Negative Binomial Hurdle Models

Modelling

Hurdle Models

Alcohol Use

Psychological Distress

Modelling psychological distress, alcohol and other drug use, and occupational functioning in Australian Public Safety Personnel

Published

March 31, 2026

Introduction

In this analysis, I model alcohol and other drug use, psychological distress, and occupational functioning in Australian public safety personnel (PSP). The aims of the analysis are to:

Describe the rates and severity of alcohol and other drug (AOD) use in a sample of Australian public safety personnel (PSP)
Examine associations between AOD use, psychological distress, and occupational functioning (absenteeism and presenteeism)
Test whether AOD use and psychological distress, independently and together, predict occupational impairment

Set-Up

Show/hide the code

# load packages 

pacman::p_load(here, tidyverse, dplyr, scales, rstatix, forecast, gtsummary, flextable, officer, ggplot2, ggpubr, dunn.test, rcompanion, MASS, pscl, AER, car, marginaleffects )
# set wd 
here::i_am("projects/hurdles/index.qmd")
# load data 
df <- readRDS(here::here("R/clean_self_assessment.RData"))
# filter to time 1 
df <- df |> filter(assessment_time == "Time 1")
# source custom flextable function 
source(here("R/customflextable.R"))

# set chunk options
knitr::opts_chunk$set(message = FALSE, warning = FALSE)

Key variables

Role	Variable(s)
Psychological distress	`k10_total`, `k10_total_interpretation`
Alcohol use	`alcohol_use_scores`, `alcohol_use_interpretation`
Other drug use	`max_other_drug_score`, `max_drug_use_interpretation`
Occupational functioning	`k10_11_absenteeism`, `k10_12_presenteeism`
Potential covariates	`gender`, `age`, `agency_years`

Screening

Show/hide the code

# Shapiro-Wilk for all relevant vars
options(scipen = 999)
tbl_shapiro <- df |>
  # select vars 
 dplyr::select(age, agency_years, alcohol_use_scores, max_other_drug_score, k10_total, k10_11_absenteeism, k10_12_presenteeism) |> 
  # pivot longer so variable name is one col and shapiro stat is the other 
  pivot_longer(everything(), names_to = "var", values_to = "value") |>
  group_by(var) |>
  shapiro_test(value) |>
  # rename each var for the table 
  mutate(
    var = c(
      "Age",
      "Agency Years",
      "Alcohol Use Risk",
      "Other Drug Use Risk",
      "Psychological Distress",
      "Absenteeism",
      "Presenteeism"
    ),
    variable = NULL,
    # round numbers 
    statistic = number(statistic, accuracy = .001),
    p = number(p, accuracy = .001)
  ) |>
  rename(Variable = var, Statistic = statistic) |>
  # convert to flextable 
  flextable() |> 
  # apply my flextable formatting from `my_flextable(font_size = 10)
  my_flextable(font_size = 10)

tbl_shapiro

Table 1: Shapiro tests of numeric variables

Variable	Statistic	p
Age	0.946	0.000
Agency Years	0.883	0.000
Alcohol Use Risk	0.933	0.000
Other Drug Use Risk	0.452	0.000
Psychological Distress	0.649	0.000
Absenteeism	0.949	0.000
Presenteeism	0.840	0.000

Shapiro-wilk tests confirm violations of normality. Non-parametric tests should be used for group comparisons (Kruskal-Wallis) that use risk categories as grouping variable

Show/hide the code

# visualise absenteeism and presenteeism 
fig_histograms <- df |>
  dplyr::select(Absenteeism = k10_11_absenteeism, Presenteeism = k10_12_presenteeism) |> 
  # pivot longer
  pivot_longer(
    cols = c(Absenteeism, Presenteeism),
    names_to = "Variable",
    values_to = "Days") |> 
  # filter NA answers 
  filter(!is.na(Days)) |>
  # create histograms
  gghistogram(
    x = "Days",
    y = "count",
    xlab = "Days",
    ylab = FALSE,
    add = "median",
    facet.by = "Variable",
    binwidth = 1
  ) +
  scale_x_continuous(breaks = seq(0, 30, by = 5))
fig_histograms

Figure 1: Histograms of absenteeism and presenteeism

Seems to be a strong right skew and large proportion of zeros. This suggests standard linear models would be inappropriate. might be better to work with count based models.

Prevalence and Severity of AOD Use

Show/hide the code

# pivot assist vars to long
# apply risk categories
tbl_rates <- df |>
  dplyr::select(starts_with("assist_total_"), -assist_total_scores_count) |>
  pivot_longer(
    cols = starts_with("assist_total_"),
    names_to = "substance",
    values_to = "score"
  ) |>
  mutate(score = case_when(score < 1 ~ NA, TRUE ~ score)) |>
  filter(!is.na(score)) |>
  mutate(
    substance = str_remove(substance, "assist_total_") |> str_to_title()
  ) |>
  mutate(
    `Risk` = case_when(
      substance == "Alcohol" & score < 10 ~ "Low risk",
      substance == "Alcohol" &
        score >= 10 &
        substance == "Alcohol" &
        score <= 26 ~ "Moderate risk",
      substance == "Alcohol" & score > 26 ~ "High risk",
      substance != "Alcohol" & score < 4 ~ "Low risk",
      substance != "Alcohol" &
        score >= 4 &
        substance != "Alcohol" &
        score <= 26 ~ "Moderate risk",
      substance != "Alcohol" & score > 26 ~ "High risk",
      TRUE ~ "fix"
    )
  ) |>
  group_by(substance, Risk) |>
  summarise(
    n = n()
  ) |>
  ungroup() |>
  group_by(substance) |>
  mutate(total = sum(n), percent = round(100 * n / sum(n), 1)) |>
  group_by(substance, Risk) |>
  ungroup() |>
  dplyr::select(substance, total, Risk, n, percent) |>
  pivot_wider(names_from = Risk, values_from = c(n, percent)) |>
  dplyr::select(
    Substance = substance,
    total,
    ends_with("Low risk"),
    ends_with("Moderate risk"),
    ends_with("High risk")
  ) |>
  mutate(
    total = paste0(total, " (", number(total / 220 * 100, accuracy = .1), "%)")
  ) |>
  filter(Substance != 'Other') |>
  filter(Substance != "Smoking") |>
  flextable()

# change spanning headers to group n and % under each risk category
typology <- data.frame(
  col_keys = c(
    "Substance",
    "total",
    "n_Low risk",
    "percent_Low risk",
    "n_Moderate risk",
    "percent_Moderate risk",
    "n_High risk",
    "percent_High risk"
  ),
  head = c("", "", rep("ASSIST scores", 6)),
  risk = c(
    "",
    "",
    rep("Low risk", 2),
    rep("Moderate risk", 2),
    rep("High risk", 2)
  ),
  names = c("Substance", "Total using", rep(c("n", "%"), 3))
)

# print and format rates table
tbl_rates <- tbl_rates |>
  set_header_df(mapping = typology, key = "col_keys") |>
  merge_h(part = "header") |>
  my_flextable(font_size = 10) |>
  hline(
    border = fp_border(color = "black", width = 1),
    i = c(1, 2),
    part = "header"
  ) |>
  hline_top(border = fp_border(color = "black", width = 1), part = "all") |> 
  align(j = c(2:8), align = "center", part = "all")

tbl_rates

Table 2: Rates of substance use in past three months by risk category

		ASSIST scores
		Low risk		Moderate risk		High risk
Substance	Total using	n	%	n	%	n	%
Alcohol	216 (98.2%)	77	35.6	83	38.4	56	25.9
Amphetamines	23 (10.5%)	5	21.7	16	69.6	2	8.7
Cannabis	28 (12.7%)	6	21.4	20	71.4	2	7.1
Cocaine	27 (12.3%)	12	44.4	13	48.1	2	7.4
Hallucinogens	18 (8.2%)	4	22.2	12	66.7	2	11.1
Inhalants	14 (6.4%)	6	42.9	6	42.9	2	14.3
Opioids	12 (5.5%)	5	41.7	6	50.0	1	8.3
Sedatives	26 (11.8%)	7	26.9	18	69.2	1	3.8

Alcohol use is nearly universal, and meaningfully spread. Other substance use is less prevalent, but does seem to be more moderate risk (although, the low-risk threshold is quite low, so even occasional use is likely to be at least moderate). High risk other drug use is less common, but still evident. Cannabis, amphetamines, and sedatives are more skewed toward moderate risk rather than low risk.

Descriptives Based on Alcohol Risk

Show/hide the code

# turn low cell counts to NA 
df <- df %>% mutate(gender = case_when(gender == "Prefer not to answer" ~ NA,
                                       gender == "Other" ~ NA,
                                       TRUE ~ gender),
                    gender = factor(gender))

tbl_alc <- df |>

# restrict to use in past 3 months 
  filter(recent_alcohol_user == 1) |>
  dplyr::select(
    gender,
    age,
    agency_years,
    alcohol_use_interpretation,
    k10_total,
    absenteeism = k10_11_absenteeism,
    presenteeism = k10_12_presenteeism
  ) |>
  tbl_summary(
    by = alcohol_use_interpretation,
    missing = "no",
    missing_text = "Missing",
    statistic = list(
      all_continuous() ~ c("{mean} ({sd})"),
      all_categorical() ~ "{n} ({p}%)"
    ),
    percent = "row",
    digits = list(all_continuous() ~ c(1, 1), all_categorical() ~ c(0, 1)),
    label = list(
      gender = "Gender",
      age = "Age",
      k10_total = "Psychological distress",
      absenteeism = "Absenteeism",
      presenteeism = "Presenteeism",
      agency_years = "Agency years",
      alcohol_use_interpretation = "Alcohol use risk category"
    )
  ) |>
  add_n(statistic = "{N_nonmiss}", col_label = "n") |>
  modify_header(list(
    label ~ "Characteristic",
    stat_1 ~ "Low Risk\n(n = 77)",
    stat_2 ~ "Moderate Risk\n(n = 83)",
    stat_3 ~ "High Risk\n(n = 56)"
  )) |>
  add_p(
    test = list(all_continuous() ~ "kruskal.test", gender ~ "chisq.test"),
    pvalue_fun = function(x) {
      formatted <- ifelse(x < 0.001, "<.001", sprintf("%.3f", x))
      sub("^0\\.", ".", formatted)
    }
  ) |>
  add_stat_label(
    label = list(all_continuous() ~ "Mean (SD)", all_categorical() ~ "n (%)")
  ) |>
  modify_footnote(everything() ~ NA) |>
  modify_header(p.value ~ "p") |>
  gtsummary::as_flex_table() |> 
  my_flextable(font_size = 10)
tbl_alc

Table 3: Sample characteristics by alcohol use risk levels

Characteristic	n	Low Risk (n = 77)	Moderate Risk (n = 83)	High Risk (n = 56)	p
Gender, n (%)	189				.011
Man or male		30 (26.8%)	52 (46.4%)	30 (26.8%)
Woman or female		35 (45.5%)	21 (27.3%)	21 (27.3%)
Age, Mean (SD)	187	37.2 (11.2)	42.8 (12.9)	40.6 (9.7)	.024
Agency years, Mean (SD)	196	8.0 (7.9)	12.9 (9.7)	10.8 (9.8)	.006
Psychological distress, Mean (SD)	202	18.9 (7.7)	20.9 (6.8)	25.4 (7.2)	<.001
Absenteeism, Mean (SD)	200	1.3 (4.3)	1.4 (4.1)	3.2 (5.7)	<.001
Presenteeism, Mean (SD)	200	4.4 (8.5)	2.0 (4.3)	7.6 (9.0)	<.001

Higher alcohol risk is associated with greater psychological distress and higher absenteeism and presenteeism. Age and agency years are higher in the moderate/higher risk groups - which might be related to cumulative exposure to traumatic events?

Show/hide the code

# post hoc pairwise comparisons - include for completeness. 
tests <- df |> filter(recent_alcohol_user == 1)

dunn.test(tests$age, tests$alcohol_use_interpretation, kw = TRUE, method = "bonferroni")

dunn.test(tests$agency_years, tests$alcohol_use_interpretation, kw = TRUE, method = "bonferroni")

dunn.test(tests$k10_total, tests$alcohol_use_interpretation, kw = TRUE, method = "bonferroni")

dunn.test(tests$k10_11_absenteeism, tests$alcohol_use_interpretation, kw = TRUE, method = "bonferroni")

dunn.test(tests$k10_12_presenteeism, tests$alcohol_use_interpretation, kw = TRUE, method = "bonferroni")

# gender - chi-squared then pairwise 
report::report(stats::chisq.test(tests$gender, tests$alcohol_use_interpretation)) 

gender_tbl <- table(tests$gender, tests$alcohol_use_interpretation)
pairwiseNominalIndependence(gender_tbl, fisher = FALSE, gtest = FALSE, chisq = TRUE, method = "bonferroni")

prop.table(table(tests$gender, tests$alcohol_use_interpretation), margin = 1)

Post hoc dunn tests with bonferroni correction show most differences are driven by low vs. moderate/high. moderate vs. high risk are pretty similar.

Descriptives Based on Other Drug Risk

Show/hide the code

tbl_drug <- df |>
  filter(recent_drug_user == 1) |>
  dplyr::select(
    gender,
    age,
    agency_years,
    max_drug_use_interpretation,
    k10_total,
    absenteeism = k10_11_absenteeism,
    presenteeism = k10_12_presenteeism
  ) |>
  tbl_summary(
    by = max_drug_use_interpretation,
    missing = "no",
    missing_text = "Missing",
    statistic = list(
      all_continuous() ~ c("{mean} ({sd})"),
      all_categorical() ~ "{n} ({p}%)"
    ),
    percent = "row",
    digits = list(all_continuous() ~ c(1, 1), all_categorical() ~ c(0, 1)),
    type = list(
      recent_drug_count = "continuous",
      #recent_alcohol_user = "categorical",
      recent_drug_user = "categorical",
      poly_drug_user = "categorical"
    ),
    label = list(
      gender = "Gender",
      age = "Age",
      k10_total = "Psychological distress",
      absenteeism = "Absenteeism",
      presenteeism = "Presenteeism",
      agency_years = "Agency years",
      max_drug_use_interpretation = "Maximum other drug use risk category"
    )
  ) |>
  add_n(statistic = "{N_nonmiss}", col_label = "n") |>
  modify_header(list(
    label ~ "Characteristic",
    stat_1 ~ "Low Risk\n(n = 17)",
    stat_2 ~ "Moderate Risk\n(n = 46)",
    stat_3 ~ "High Risk\n(n = 5)"
  )) |>
  add_p(
    test = list(all_continuous() ~ "kruskal.test", gender ~ "fisher.test"),

    #test.args = all_categorical() ~ list(simulate.p.value = TRUE),
    pvalue_fun = function(x) {
      formatted <- ifelse(x < 0.001, "<.001", sprintf("%.3f", x))
      sub("^0\\.", ".", formatted)
    }
  ) |>
  add_stat_label(
    label = list(all_continuous() ~ "Mean (SD)", all_categorical() ~ "n (%)")
  ) |>

  modify_footnote(everything() ~ NA) |>
  modify_header(p.value ~ "p") |>
  gtsummary::as_flex_table() |>
  my_flextable(font_size = 10)
tbl_drug

Table 4: Sample characteristics by other drug use risk levels

Characteristic	n	Low Risk (n = 17)	Moderate Risk (n = 46)	High Risk (n = 5)	p
Gender, n (%)	56				.789
Man or male		8 (22.2%)	26 (72.2%)	2 (5.6%)
Woman or female		6 (30.0%)	13 (65.0%)	1 (5.0%)
Age, Mean (SD)	59	41.2 (12.4)	37.7 (12.1)	32.6 (8.6)	.382
Agency years, Mean (SD)	61	11.2 (11.9)	8.5 (6.6)	5.0 (3.0)	.501
Psychological distress, Mean (SD)	61	19.7 (7.7)	24.6 (8.4)	34.2 (10.6)	.011
Absenteeism, Mean (SD)	61	0.7 (1.4)	2.3 (3.6)	10.2 (11.9)	.037
Presenteeism, Mean (SD)	61	3.6 (5.2)	7.2 (9.0)	8.4 (8.7)	.440

Much smaller n here (especially high risk group, n = 5); interpret w caution. Fisher’s exact test used for gender given small cell sizes.

Psychological distress and absenteeism seem to show a dose-reponse pattern across risk groups. Presenteeism is the same direction, but not significant, possible power issues.

Age and agency years and gender = no differences; contrast with alcohol use results.

Show/hide the code

# post hoc pairwise comparisons - included for completeness. 
tests <- df |> filter(recent_drug_user == 1)
dunn.test(tests$age, tests$max_drug_use_interpretation, kw = TRUE, method = "bonferroni")
dunn.test(tests$agency_years, tests$max_drug_use_interpretation, kw = TRUE, method = "bonferroni")
dunn.test(tests$k10_total, tests$max_drug_use_interpretation, kw = TRUE, method = "bonferroni")
dunn.test(tests$k10_11_absenteeism, tests$max_drug_use_interpretation, kw = TRUE, method = "bonferroni")
dunn.test(tests$k10_12_presenteeism, tests$max_drug_use_interpretation, kw = TRUE, method = "bonferroni")

dunn tests aren’t really reliable here.

Associations between AOD Use, Psychological Distress, and Occupational Functioning

Correlations

Show/hide the code

# select vars 
df_cor_vars <- df |>
  dplyr::select(
    alcohol_use_scores, max_other_drug_score, 
    k10_total, absenteeism = k10_11_absenteeism, presenteeism = k10_12_presenteeism
  ) 

# define variable types 
vars_continuous <- c("k10_total", "alcohol_use_scores", "max_other_drug_score")
vars_count <- c('presenteeism', 'absenteeism', 'recent_drug_count')

# get all correlation pairs 
cor_pairs <- combn(names(df_cor_vars), 2, simplify = FALSE)

# create function to get correct correlation type automatically 
get_cors <- function(x_name, y_name, data) {
  x <- data[[x_name]]
  y <- data[[y_name]]
  
  # define variable types 
  is_x_count <- x_name %in% vars_count
  is_y_count <- y_name %in% vars_count
  is_x_continuous <- x_name %in% vars_continuous
  is_y_continuous <- y_name %in% vars_continuous
  
  # define correlation method 
  cor_method <- case_when(
    is_x_count & is_y_count ~ "spearman", # use spearman
    is_x_count & is_y_continuous ~ "spearman", # use spearman
    is_x_continuous & is_y_count ~ "spearman", # use spearman
    is_x_continuous & is_y_continuous ~ "pearson",  # use pearson 
    TRUE ~ "spearman"  # if else, use spearman 
  )
  
  cor_type <- case_when(
    is_x_continuous & is_y_continuous ~ "pearson",
    TRUE ~ "spearman"
  )
  
  # calculate correlations 
  cor_test <- suppressWarnings(cor.test(x, y, method = cor_method))
  
  # return correlation coefficient, p-value, method used
  list(r = unname(cor_test$estimate), p = cor_test$p.value, method = cor_method, type = cor_type)
}

# compute correlations for all pairs 
results_cor <- purrr::map_dfr(cor_pairs, function(vars) {
  res <- get_cors(vars[1], vars[2], df_cor_vars)
  tibble(
    var1 = vars[1],
    var2 = vars[2],
    cor = res$r,
    p = number(res$p, accuracy = .001),
    method = res$cor_method,
    type = res$cor_type#,
    #p2 = res$p
  )
})

# make symmetric by mirroring rows
df_cor_long <- results_cor |>
  bind_rows(
    results_cor |>
      rename(var1 = var2, var2 = var1)
  ) |>
  mutate(
    cor = round(cor, 2),
    sig = p <= 0.05,
    stars = case_when(
      p < .001 ~ "***",
      p < .01 ~ "**", 
      p < .05 ~ "*",
      TRUE ~ ""),
    cor_ch = paste0(formatC(cor, format = "f", digits = 2), stars))
  
# make wide 
matrix_cor <- df_cor_long |>
  dplyr::select(var1, var2, cor_ch) |>
  pivot_wider(names_from = var2, values_from = cor_ch) |> 
  dplyr::select(var1, alcohol_use_scores, max_other_drug_score, k10_total, absenteeism, presenteeism)

# set cor labels 
labs_cor <- c(
  k10_total = "Psychological Distress",
  presenteeism = "Presenteeism",
  absenteeism = "Absenteeism",
  alcohol_use_scores = "Alcohol Use Score",
  max_other_drug_score = "Max Other Drug Score")

colnames(matrix_cor) <- c(" " = "var1", labs_cor[colnames(matrix_cor)[-1]])
matrix_cor[[1]] <- labs_cor[matrix_cor[[1]]]
matrix_cor[upper.tri(matrix_cor)] <- NA

# create ft 
tbl_cors <- flextable(matrix_cor) |> 
  my_flextable(font_size = 10) |> 
  set_header_labels("var1" = " ")

tbl_cors

Table 5: Correlation matrix of key variables

	Alcohol Use Score	Max Other Drug Score	Psychological Distress	Absenteeism
Alcohol Use Score
Max Other Drug Score	0.18
Psychological Distress	0.39***	0.33**
Absenteeism	0.28***	0.44***	0.52***
Presenteeism	0.24**	0.28*	0.49***	0.51***

Used pearson for continuous-continuous pairs (alcohol score, drug score, k10 score) and Spearman for pairs involving absenteeism/presenteeism - count variables, zero inflated, rank based more appropriate (see Figure 1)

Pretty much everything is highly correlated (ranging from .18 - .52).

Strongest associations are with psychological distress (particularly with absenteeism and presenteeism), but that makes conceptual sense. Might be also that they come from the k10+? Questions specifically asks “due to the symptoms described” (in the k10), so naturally they would be correlatedexcept for alcohol use and other drug use.

Weaker associations between alcohol use and other drug use which is sort of good for modelling - no collinearity

Model selection

Considered several candidate models that fit the data for absenteeism and presenteeism:

Poisson regression
Negative binomial regression
Zero-inflated Poisson (ZIP)
Zero-inflated negative binomial models (ZINB)
Hurdle negative binomial models

Show/hide the code

# round absenteeism and presenteeism 
df <- df |> mutate(
  k10_11_absenteeism = round(k10_11_absenteeism, 1),
  k10_12_presenteeism = round(k10_12_presenteeism, 1)
) |> rename(absenteeism = k10_11_absenteeism, 
             presenteeism = k10_12_presenteeism)
# select vars 
df_sub <- df |> dplyr::select(age, gender, agency_years, k10_total, rtc_3_readiness, 
                               recent_drug_user, recent_alcohol_user, recent_any_user, poly_drug_user,
                               alcohol_use_scores, max_other_drug_score, absenteeism, presenteeism)

# create function for comparing models 
predictors <- df_sub |> dplyr::select(-absenteeism, -presenteeism) |> names()

# function to run series of univariate models and check dispersion. 
compare_models <- function(var, data) {
  fmla <- as.formula(paste(outcome, "~", var))

  # print results:
  cat("\n============================\n")
  cat("Variable:", var, "\n")
  cat("============================\n\n")
  
  # fit models 
  # poisson
  pois <- glm(fmla, data = data, family = poisson())
  # negative binomial
  nb <- glm.nb(fmla, data = data)
  # zero inflated poisson
  zip <- zeroinfl(fmla, data = data, dist = "poisson")
  # zero inflated negative binomial
  zinb <- zeroinfl(fmla, data = data, dist = "negbin")
  # hurdle negative binomial 
  hurdle_nb <- hurdle(fmla, data = data, dist = "negbin", zero.dist = "binomial")
  
  # dispersion:
  disp_stat <- dispersiontest(pois)$statistic 
  cat('Dispersion (Poisson):', round(disp_stat, 2), '\n')
  
  # compare AIC / BIC
  aic_df <- AIC(pois, nb, zip, zinb, hurdle_nb)
  bic_df <- BIC(pois, nb, zip, zinb, hurdle_nb)
  
  cat("\nAIC Comparison:\n")
  print(aic_df)
  cat("\nBIC Comparison:\n")
  print(bic_df)
  
  # vuong tests
  cat("\nVuong Test: ZIP vs Poisson\n")
  print(vuong(pois, zip))
  
  cat("\nVuong Test: ZINB vs NB\n")
  print(vuong(nb, zinb))
  
  cat("\nVuong Test: Hurdle vs NB\n")
  print(vuong(nb, hurdle_nb))

# get predicted vs observed zeros 
obs_zeros <- sum(data[[outcome]] == 0)
pred_zeros <- list(
  Poisson = sum(predict(pois, type = "response") < 1e-6),
  NB = sum(predict(nb, type = "response") < 1e-6),
  ZIP = sum(predict(zip, type = "response") < 1e-6),
  ZINB = sum(predict(zinb, type = "response") < 1e-6),
  Hurdle = sum(predict(hurdle_nb, type = "response") < 1e-6)
)

cat("\nObserved Zeros:", obs_zeros, "\n")
cat("Predicted Zeros:\n")
print(pred_zeros)
}

# run and save externally 
outcome <- "absenteeism"
predictors <- df_sub |> dplyr::select(-presenteeism, -absenteeism) |> names()

sink(here("R/model_comparisons_absenteeism.txt"))
lapply(predictors, compare_models, data = df_sub |> filter(!is.na(absenteeism)))
sink()

outcome <- "presenteeism"
sink(here("R/model_comparisons_presenteeism.txt"))
lapply(predictors, compare_models, data = df_sub |> filter(!is.na(presenteeism)))
sink()

Overdispersion was confirmed across all predictors. Poisson dispersion consistently >1 (not a good candidate).Zero-inflated and hurdle models outperformed NB on AIC/BIC and predicted zeros better. Vuong tests favoured hurdle and zero-inflated, though difference between ZINB and hurdle NB was small.

Fit statistics between ZINB and hurdle NB were similar enough.

!!!

MS and I discussed this, and we think zero-inflation models were too rigid for our outcomes.

zero inflated models assume 2 groups: ppl structurally always zero (i.e., will never ever have any absenteeism/presenteeism) and people who are at risk. Hurdle models seem more appropriate, because they model the likelihood of someone crossing into abscence/not, and then a seperate model to determine how many days are lost if they do…. this fits our research question better…

decision: Hurdle negative binomial models

They account for overdispersion, handle excess zeros in two stages, and align with our aims

Modelling Absenteeism

Each predictor was examined as univariate effect before building multivariable model to see variable-specific effects and check that the hurdle structure was good.

Show/hide the code

df_sub <- df |>
  dplyr::select(
    age, gender, agency_years,
    alcohol_use_scores, max_other_drug_score,
    k10_total, absenteeism, presenteeism
  ) |>
  mutate(
    # NA = did not use in past 3 months; treat as zero risk for modelling
     alcohol_use_scores = case_when(is.na(alcohol_use_scores) ~ 0, TRUE ~ alcohol_use_scores),
     max_other_drug_score = case_when(is.na(max_other_drug_score) ~ 0, TRUE ~ max_other_drug_score)
    # alcohol_use_scores = tidyr::replace_na(alcohol_use_scores, 0),
    # max_other_drug_score = tidyr::replace_na(max_other_drug_score, 0)
  )


  # variable labels used across all model tables:
  labs_vars <- c(
  agency_years = "Years in agency",
  age = "Age",
  gender = "Gender",
  alcohol_use_scores = "Alcohol use risk",
  max_other_drug_score = "Maximum other drug use risk",
  k10_total = "Psychological distress",
  `max_other_drug_score:k10_total` = "Other drug use risk x Psychological distress"
)

Univariable Models

Show/hide the code

# function to run univariable models at once 
run_uv_hurdle <- function(outcome, predictors, 
                          data, exponentiate = T) {
  
  # build both models seperately 
  model_tables <- map(predictors, ~ {
  fmla <- as.formula(paste0(outcome, " ~ ", .x, " | ", .x))
  model <- hurdle(fmla, data = data, dist = "negbin", zero.dist = "binomial")
    
  # logit part 
  tbl_logit <- tbl_regression(
    model,
    component = "zero_inflated",
    exponentiate = TRUE,
    estimate_fun = function(x) formatC(x, digits = 2, format = "f"),
    pvalue_fun = function(x) {
      p <- style_pvalue(x, digits = 3)
      gsub("(?<=<)0\\.|^0\\.", ".", p, perl = TRUE)
    }, 
    tidy_fun = broom.helpers::tidy_zeroinfl
  ) |> 
    modify_header(label = "Variable", estimate = "OR") |> 
    modify_table_body(~ .x |>
                        mutate(
                          label = ifelse(
                            row_type == "label" & variable %in% names(labs_vars),
                            labs_vars[variable],
                            label
                          ),
                          ci = ifelse(
                            !is.na(estimate) & (conf.high - conf.low > 50),
                            ">50",
                            ci
                          )
                        )
    ) |> 
    add_n() 
  
  # NB count part
  tbl_count <- tbl_regression(
    model,
    component = "conditional",
    exponentiate = T,
    estimate_fun = function(x) formatC(x, digits = 2, format = "f"),
    pvalue_fun = function(x) {
      p <- style_pvalue(x, digits = 3)
      # remove leading zero from both "0.03" and "<0.001"
      gsub("(?<=<)0\\.|^0\\.", ".", p, perl = TRUE)
    },
    
    tidy_fun = broom.helpers::tidy_zeroinfl
  ) |> 
    modify_header(label = "Variable", estimate = "IRR") |> 
    modify_table_body(~ .x |>
                        mutate(
                          label = ifelse(
                            row_type == "label" & variable %in% names(labs_vars),
                            labs_vars[variable],
                            label
                          ),
                          ci = ifelse(
                            !is.na(estimate) & (conf.high - conf.low > 50),
                            ">50",
                            ci
                          )
                        )) 
  
  list(tbl_logit = tbl_logit, tbl_count = tbl_count)
  })

  # separate and flatten count and zero tables
  tbls_logit  <- map(model_tables, "tbl_logit")
  tbls_count <- map(model_tables, "tbl_count")
  
  tbl_merge(
    tbls = list(
      tbl_stack(tbls_logit),
      tbl_stack(tbls_count) 
      ),
    tab_spanner = c("Hurdle model (logit)", "Count model (NB)")
  )
}

## apply to absenteeism 

outcome <- "absenteeism"
predictors <- df_sub |> dplyr::select(-presenteeism, -absenteeism) |> names()

tbl_abs_uv <- suppressMessages(
  run_uv_hurdle(outcome, predictors, data = df_sub, exponentiate = TRUE)
) 

tbl_abs_uv <- tbl_abs_uv |> 
  modify_table_body(~ .x |>
    mutate(
      estimate_1 = number(estimate_1, accuracy = .01),
      estimate_1 = if_else(is.na(estimate_1), "—", estimate_1),
      estimate_2 = number(estimate_2, accuracy = .01),
      estimate_2 = if_else(is.na(estimate_2), "—", estimate_2),
      label = case_when(label == "0" ~ "No", label == "1" ~ "Yes", TRUE ~ label),
      groupname_col_1 = NULL,
      groupname_col_2 = NULL
    )
  ) |>
  as_flex_table() |>
  delete_part("footer") 


# change spanning headers
typology <- data.frame(
  col_keys = tbl_abs_uv$col_keys,
  head1 = c("", "", rep("Hurdle Model (logit)", 3), rep("Count Model (NB)", 3)),
  head2 = c("Variable", "N", "OR", "95% CI", "p", "IRR", "95% CI", "p"),
  stringsAsFactors = FALSE
)

# print and format 
tbl_abs_uv <- tbl_abs_uv |>
  set_header_df(mapping = typology, key = "col_keys") |> 
  merge_h(part = "header") |>
  my_flextable(font_size = 10)

# fix NA
tbl_abs_uv <- flextable::compose(tbl_abs_uv, i = 2, j = c(3, 4, 6, 7), value = flextable::as_paragraph(" "))
tbl_abs_uv <- flextable::compose(tbl_abs_uv, i = 3, j = c(3, 4, 6, 7), value = flextable::as_paragraph("\u2014"))



tbl_abs_uv

		Hurdle Model (logit)			Count Model (NB)
Variable	N	OR	95% CI	p	IRR	95% CI	p
Age	200	0.99	0.96, 1.02	.434	1.01	0.97, 1.05	.584
Gender	201
Man or male		—	—		—	—
Woman or female		1.24	0.67, 2.29	.492	2.24	1.09, 4.60	.028
Years in agency	209	0.99	0.95, 1.02	.426	0.99	0.94, 1.05	.761
Alcohol use risk	213	1.05	1.02, 1.08	<.001	1.00	0.97, 1.03	.806
Maximum other drug use risk	213	1.10	1.05, 1.16	<.001	1.01	0.98, 1.04	.476
Psychological distress	213	1.17	1.11, 1.22	<.001	1.09	1.05, 1.14	<.001

Higher alcohol use risk, greater other drug use risk, and higher psychological distress were all associated with increased odds of reporting any absenteeism; age, gender, and years employed at the agency were not associated with the likelihood of reporting any absenteeism. For those reporting any absenteeism, psychological distress and gender were associated with the frequency of absenteeism.

Multivariable model

Predictors showing evidence of association in either component for the univariable models were retained for multivariable modelling.

Show/hide the code

formula_abs <- absenteeism ~ gender + alcohol_use_scores + max_other_drug_score + k10_total
mod_abs_mv <- hurdle(formula_abs, data = df_sub, dist = "negbin", zero.dist = "binomial")

# check multicollinearity 
car::vif(lm(formula_abs, data = df_sub))

              gender   alcohol_use_scores max_other_drug_score 
            1.037990             1.136693             1.066019 
           k10_total 
            1.155369

All vifs were below 5 (ranging from 1.04-1.16)

Show/hide the code

# define merged table with both components
tbl_abs_mv <- list(
  
  # logit component 
  logit = suppressMessages(tbl_regression(
    mod_abs_mv,
    component = "zero_inflated",
    exponentiate = TRUE,
    estimate_fun = function(x) formatC(x, digits = 2, format = "f"),
    pvalue_fun = function(x) {
      p <- style_pvalue(x, digits = 3)
      gsub("(?<=<)0\\.|^0\\.", ".", p, perl = TRUE)
    },
    tidy_fun = broom.helpers::tidy_zeroinfl)) |> 
    modify_header(label = "Variable", estimate = "OR") |>
    modify_table_body(~ .x |>
      mutate(
        label = if_else(row_type == "label" & variable %in% names(labs_vars), labs_vars[variable], label)
       )) |> 
        add_n(),

  # count component 
  count = suppressMessages(tbl_regression(
    mod_abs_mv,
    component = "conditional",
    exponentiate = TRUE,
    estimate_fun = function(x) formatC(x, digits = 2, format = "f"),
    pvalue_fun = function(x) {
      p <- style_pvalue(x, digits = 3)
      gsub("(?<=<)0\\.|^0\\.", ".", p, perl = TRUE)
    },
    tidy_fun = broom.helpers::tidy_zeroinfl)) |>
    modify_header(label = "Variable", estimate = "IRR") |>
    modify_table_body(~ .x |>
      mutate(
        label = if_else(
          row_type == "label" & variable %in% names(labs_vars),
          labs_vars[variable],
          label
        )
      )) |>
      add_n()) |> 
      tbl_merge(tab_spanner = c("Odds of Absenteeism", "Frequency of Absenteeism"))

tbl_abs_mv <- tbl_abs_mv |> 
  modify_table_body(~ .x |> 
    mutate(
      groupname_col_1 = NULL, 
      estimate_1 = number(estimate_1, accuracy = .01),
      #estimate_1 = if_else(is.na(estimate_1), "—", estimate_1),
      ci_1 = paste(number(conf.low_1, accuracy = .01), ", ", number(conf.high_1, accuracy = .01)),
      ci_1 = if_else(ci_1 == "NA, NA", "  ", ci_1),
      groupname_col_2 = NULL, 
      stat_n_2 = NULL, 
      estimate_2 = number(estimate_2, accuracy = .01),
      #estimate_2 = if_else(is.na(estimate_2), "-", estimate_2),
      ci_2 = paste(number(conf.low_2, accuracy = .01), ", ", number(conf.high_2, accuracy = .01)),
      ci_2 = if_else(ci_2 == "NA, NA", "  ", ci_2),

    )
  ) |>
  as_flex_table() |>
  delete_part("footer") 

tbl_abs_mv <- flextable::compose(tbl_abs_mv, i = 1, j = c(3, 4, 6, 7), value = flextable::as_paragraph(" "))
tbl_abs_mv <- flextable::compose(tbl_abs_mv, i = 2, j = c(3, 4, 6, 7), value = flextable::as_paragraph("\u2014"))

# change spanning headers
typology <- data.frame(
  col_keys = tbl_abs_mv$col_keys,
  head1 = c("", "", rep("Odds of Absenteeism", 3), rep("Frequency of Absenteeism", 3)),
  head2 = c("Variable", "N", "OR", "95% CI", "p", "IRR", "95% CI", "p"),
  stringsAsFactors = FALSE
)

tbl_abs_mv <- tbl_abs_mv |>
  set_header_df(mapping = typology, key = "col_keys") |> 
  merge_h(part = "header") |>
  my_flextable(font_size = 10)


tbl_abs_mv

Table 6: Multivariable hurdle negative binomial model predicting absenteeism

		Odds of Absenteeism			Frequency of Absenteeism
Variable	N	OR	95% CI	p	IRR	95% CI	p
Gender	201
Man or male		—	—		—	—
Woman or female		1.54	0.73, 3.24	.259	1.02	0.49, 2.12	.949
Alcohol use risk	201	1.03	1.00, 1.06	.048	0.99	0.97, 1.02	.624
Maximum other drug use risk	201	1.09	1.02, 1.15	.007	0.98	0.95, 1.02	.374
Psychological distress	201	1.14	1.08, 1.20	<.001	1.10	1.04, 1.16	<.001

Higher alcohol use risk, greater other drug use risk, and higher psychological distress were all associated with increased odds of reporting any absenteeism. Effect of gender was attentuated and was no longer associated with likelihood. Higher psychological distress was associated with higher counts. No other predictors were associated with frequency once any absence had occurred.

Interactions

Interactions were not significant and likelihood ratio tests showed adding the interactions added no extra value.

Show/hide the code

mod_abs_alc_k10 <- absenteeism ~ gender + alcohol_use_scores*k10_total + max_other_drug_score
mod_abs_drug_k10 <- absenteeism ~ gender + max_other_drug_score*k10_total + alcohol_use_scores

mod_abs_alc_k10_mod <- hurdle(mod_abs_alc_k10, data = df_sub, 
                          dist = "negbin", zero.dist = "binomial")
mod_abs_drug_k10_mod <- hurdle(mod_abs_drug_k10, data = df_sub, 
                           dist = "negbin", zero.dist = "binomial")


# lrt 
lrtest(mod_abs_mv, mod_abs_alc_k10_mod)

Likelihood ratio test

Model 1: absenteeism ~ gender + alcohol_use_scores + max_other_drug_score + 
    k10_total
Model 2: absenteeism ~ gender + alcohol_use_scores * k10_total + max_other_drug_score
  #Df  LogLik Df  Chisq Pr(>Chisq)
1  11 -238.68                     
2  13 -238.17  2 1.0297     0.5976

Show/hide the code

lrtest(mod_abs_mv, mod_abs_drug_k10_mod)

Likelihood ratio test

Model 1: absenteeism ~ gender + alcohol_use_scores + max_other_drug_score + 
    k10_total
Model 2: absenteeism ~ gender + max_other_drug_score * k10_total + alcohol_use_scores
  #Df  LogLik Df  Chisq Pr(>Chisq)
1  11 -238.68                     
2  13 -238.09  2 1.1946     0.5503

Interactions were not significant and likelihood ratio tests showed adding the interactions added no extra value.

Modelling Presenteeism

Univariable models

Show/hide the code

outcome <- "presenteeism"
predictors <- df_sub |> dplyr::select(-presenteeism, -absenteeism) |> names()

tbl_pres_uv <- suppressMessages(
  run_uv_hurdle(outcome, predictors, data = df_sub, exponentiate = TRUE)
) 

tbl_pres_uv <- tbl_pres_uv |> 
  modify_table_body(~ .x |>
    mutate(
      estimate_1 = number(estimate_1, accuracy = .01),
      estimate_1 = if_else(is.na(estimate_1), "—", estimate_1),
      estimate_2 = number(estimate_2, accuracy = .01),
      estimate_2 = if_else(is.na(estimate_2), "—", estimate_2),
      label = case_when(label == "0" ~ "No", label == "1" ~ "Yes", TRUE ~ label),
      groupname_col_1 = NULL,
      groupname_col_2 = NULL
    )
  ) |>
  as_flex_table() |>
  delete_part("footer") 


# change spanning headers
typology <- data.frame(
  col_keys = tbl_pres_uv$col_keys,
  head1 = c("", "", rep("Hurdle Model (logit)", 3), rep("Count Model (NB)", 3)),
  head2 = c("Variable", "N", "OR", "95% CI", "p", "IRR", "95% CI", "p"),
  stringsAsFactors = FALSE
)

# print and format 
tbl_pres_uv <- tbl_pres_uv |>
  set_header_df(mapping = typology, key = "col_keys") |> 
  merge_h(part = "header") |>
  my_flextable(font_size = 10)

# fix NA
tbl_pres_uv <- flextable::compose(tbl_pres_uv, i = 2, j = c(3, 4, 6, 7), value = flextable::as_paragraph(" "))
tbl_pres_uv <- flextable::compose(tbl_pres_uv, i = 3, j = c(3, 4, 6, 7), value = flextable::as_paragraph("\u2014"))

tbl_pres_uv

Table 7: Univariable hurdle negative binomial regressions predicting presenteeism

		Hurdle Model (logit)			Count Model (NB)
Variable	N	OR	95% CI	p	IRR	95% CI	p
Age	200	0.97	0.95, 1.00	.029	1.00	0.98, 1.03	.864
Gender	201
Man or male		—	—		—	—
Woman or female		0.99	0.57, 1.74	.982	0.97	0.62, 1.52	.907
Years in agency	209	0.97	0.94, 1.00	.050	1.00	0.97, 1.03	.957
Alcohol use risk	213	1.03	1.01, 1.06	.015	1.01	0.99, 1.02	.419
Maximum other drug use risk	213	1.08	1.03, 1.13	.002	1.01	0.99, 1.04	.260
Psychological distress	213	1.12	1.08, 1.17	<.001	1.05	1.02, 1.09	<.001

Age, years in the agency, higher alcohol use risk, greater other drug use risk, and higher psychological distress were all associated with increased odds of reporting any presenteeism; For those reporting any presenteeism, only psychological distress was associated with the frequency of presenteeism.

Multivariable model

Show/hide the code

formula_pres <- presenteeism ~ age + agency_years + alcohol_use_scores + max_other_drug_score + k10_total
mod_pres_mv <- hurdle(formula_pres, data = df_sub, dist = "negbin", zero.dist = "binomial")

# check multicollinearity 
car::vif(lm(formula_pres, data = df_sub))

                 age         agency_years   alcohol_use_scores 
            2.107382             2.085608             1.139047 
max_other_drug_score            k10_total 
            1.120903             1.184623

all vifs were below 5 (ranging from 1.12-2.11)

Show/hide the code

# define merged table with both components
tbl_pres_mv <- list(
  
  # logit component 
  logit = suppressMessages(tbl_regression(
    mod_pres_mv,
    component = "zero_inflated",
    exponentiate = TRUE,
    estimate_fun = function(x) formatC(x, digits = 2, format = "f"),
    pvalue_fun = function(x) {
      p <- style_pvalue(x, digits = 3)
      gsub("(?<=<)0\\.|^0\\.", ".", p, perl = TRUE)
    },
    tidy_fun = broom.helpers::tidy_zeroinfl)) |> 
    modify_header(label = "Variable", estimate = "OR") |>
    modify_table_body(~ .x |>
      mutate(
        label = if_else(row_type == "label" & variable %in% names(labs_vars), labs_vars[variable], label)
       )) |> 
        add_n(),

  # count component 
  count = suppressMessages(tbl_regression(
    mod_pres_mv,
    component = "conditional",
    exponentiate = TRUE,
    estimate_fun = function(x) formatC(x, digits = 2, format = "f"),
    pvalue_fun = function(x) {
      p <- style_pvalue(x, digits = 3)
      gsub("(?<=<)0\\.|^0\\.", ".", p, perl = TRUE)
    },
    tidy_fun = broom.helpers::tidy_zeroinfl)) |>
    modify_header(label = "Variable", estimate = "IRR") |>
    modify_table_body(~ .x |>
      mutate(
        label = if_else(
          row_type == "label" & variable %in% names(labs_vars),
          labs_vars[variable],
          label
        )
      )) |>
      add_n()) |> 
      tbl_merge(tab_spanner = c("Odds of Presenteeism", "Frequency of Presenteeism"))

tbl_pres_mv <- tbl_pres_mv |> 
  modify_table_body(~ .x |> 
    mutate(
      groupname_col_1 = NULL, 
      estimate_1 = number(estimate_1, accuracy = .01),
      #estimate_1 = if_else(is.na(estimate_1), "—", estimate_1),
      ci_1 = paste(number(conf.low_1, accuracy = .01), ", ", number(conf.high_1, accuracy = .01)),
      ci_1 = if_else(ci_1 == "NA, NA", "  ", ci_1),
      groupname_col_2 = NULL, 
      stat_n_2 = NULL, 
      estimate_2 = number(estimate_2, accuracy = .01),
      #estimate_2 = if_else(is.na(estimate_2), "-", estimate_2),
      ci_2 = paste(number(conf.low_2, accuracy = .01), ", ", number(conf.high_2, accuracy = .01)),
      ci_2 = if_else(ci_2 == "NA, NA", "  ", ci_2),

    )
  ) |>
  as_flex_table() |>
  delete_part("footer") 

# change spanning headers
typology <- data.frame(
  col_keys = tbl_pres_mv$col_keys,
  head1 = c("", "", rep("Odds of Presenteeism", 3), rep("Frequency of Presenteeism", 3)),
  head2 = c("Variable", "N", "OR", "95% CI", "p", "IRR", "95% CI", "p"),
  stringsAsFactors = FALSE
)

tbl_pres_mv <- tbl_pres_mv |>
  set_header_df(mapping = typology, key = "col_keys") |> 
  merge_h(part = "header") |>
  my_flextable(font_size = 10)


tbl_pres_mv

Table 8: Multivariable hurdle negative binomial model predicting presenteeism

		Odds of Presenteeism			Frequency of Presenteeism
Variable	N	OR	95% CI	p	IRR	95% CI	p
Age	199	0.97	0.94, 1.01	.210	1.00	0.97, 1.02	.834
Years in agency	199	0.99	0.95, 1.04	.797	1.00	0.97, 1.03	.994
Alcohol use risk	199	1.02	0.99, 1.05	.150	1.01	0.99, 1.02	.409
Maximum other drug use risk	199	1.05	0.99, 1.11	.076	1.01	0.99, 1.04	.215
Psychological distress	199	1.11	1.06, 1.16	<.001	1.06	1.02, 1.09	<.001

Unlike absenteeism, alcohol and other drug use risk were not associated with odds of presenteeism, but psychological distress was. Effect of age and agency years was attentuated and was no longer associated with likelihood

Psychological distress was associated with higher counts. no other predictors were associated with frequency once any presenteeism had occurred.

Interactions

Show/hide the code

mod_pres_alc_k10 <- presenteeism ~ age + agency_years + alcohol_use_scores*k10_total + max_other_drug_score
mod_pres_drug_k10 <- presenteeism ~ age + agency_years + max_other_drug_score*k10_total + alcohol_use_scores

mod_pres_alc_k10_mod <- hurdle(mod_pres_alc_k10, data = df_sub, 
                          dist = "negbin", zero.dist = "binomial")
mod_pres_drug_k10_mod <- hurdle(mod_pres_drug_k10, data = df_sub, 
                           dist = "negbin", zero.dist = "binomial")

lrtest(mod_pres_mv, mod_pres_drug_k10_mod)

Likelihood ratio test

Model 1: presenteeism ~ age + agency_years + alcohol_use_scores + max_other_drug_score + 
    k10_total
Model 2: presenteeism ~ age + agency_years + max_other_drug_score * k10_total + 
    alcohol_use_scores
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1  13 -407.58                         
2  15 -398.59  2 17.978  0.0001248 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show/hide the code

lrtest(mod_pres_mv, mod_pres_alc_k10_mod)

Likelihood ratio test

Model 1: presenteeism ~ age + agency_years + alcohol_use_scores + max_other_drug_score + 
    k10_total
Model 2: presenteeism ~ age + agency_years + alcohol_use_scores * k10_total + 
    max_other_drug_score
  #Df  LogLik Df  Chisq Pr(>Chisq)
1  13 -407.58                     
2  15 -406.37  2 2.4158     0.2988

Results showed no significant interaction or added value of the additonal alcohol use model. But results showed a significant interaction with other drug use and psychological distress for both components of the model

Show/hide the code

# define merged table with both components
tbl_pres_mv2 <- list(
  
  # logit component 
  logit = suppressMessages(tbl_regression(
    mod_pres_drug_k10_mod,
    component = "zero_inflated",
    exponentiate = TRUE,
    estimate_fun = function(x) formatC(x, digits = 2, format = "f"),
    pvalue_fun = function(x) {
      p <- style_pvalue(x, digits = 3)
      gsub("(?<=<)0\\.|^0\\.", ".", p, perl = TRUE)
    },
    tidy_fun = broom.helpers::tidy_zeroinfl)) |> 
    modify_header(label = "Variable", estimate = "OR") |>
    modify_table_body(~ .x |>
      mutate(
        label = if_else(row_type == "label" & variable %in% names(labs_vars), labs_vars[variable], label)
       )) |> 
        add_n(),

  # count component 
  count = suppressMessages(tbl_regression(
    mod_pres_drug_k10_mod,
    component = "conditional",
    exponentiate = TRUE,
    estimate_fun = function(x) formatC(x, digits = 2, format = "f"),
    pvalue_fun = function(x) {
      p <- style_pvalue(x, digits = 3)
      gsub("(?<=<)0\\.|^0\\.", ".", p, perl = TRUE)
    },
    tidy_fun = broom.helpers::tidy_zeroinfl)) |>
    modify_header(label = "Variable", estimate = "IRR") |>
    modify_table_body(~ .x |>
      mutate(
        label = if_else(
          row_type == "label" & variable %in% names(labs_vars),
          labs_vars[variable],
          label
        )
      )) |>
      add_n()) |> 
      tbl_merge(tab_spanner = c("Odds of Presenteeism", "Frequency of Presenteeism"))

tbl_pres_mv2 <- tbl_pres_mv2 |> 
  modify_table_body(~ .x |> 
    mutate(
      groupname_col_1 = NULL, 
      estimate_1 = number(estimate_1, accuracy = .01),
      #estimate_1 = if_else(is.na(estimate_1), "—", estimate_1),
      ci_1 = paste(number(conf.low_1, accuracy = .01), ", ", number(conf.high_1, accuracy = .01)),
      ci_1 = if_else(ci_1 == "NA, NA", "  ", ci_1),
      groupname_col_2 = NULL, 
      stat_n_2 = NULL, 
      estimate_2 = number(estimate_2, accuracy = .01),
      #estimate_2 = if_else(is.na(estimate_2), "-", estimate_2),
      ci_2 = paste(number(conf.low_2, accuracy = .01), ", ", number(conf.high_2, accuracy = .01)),
      ci_2 = if_else(ci_2 == "NA, NA", "  ", ci_2),

    )
  ) |>
  as_flex_table() |>
  delete_part("footer") 

# change spanning headers
typology <- data.frame(
  col_keys = tbl_pres_mv2$col_keys,
  head1 = c("", "", rep("Odds of Presenteeism", 3), rep("Frequency of Presenteeism", 3)),
  head2 = c("Variable", "N", "OR", "95% CI", "p", "IRR", "95% CI", "p"),
  stringsAsFactors = FALSE
)

tbl_pres_mv2 <- tbl_pres_mv2 |>
  set_header_df(mapping = typology, key = "col_keys") |> 
  merge_h(part = "header") |>
  my_flextable(font_size = 10)


tbl_pres_mv2

Table 9: Multivariable hurdle negative binomial model with interaction predicting presenteeism

		Odds of Presenteeism			Frequency of Presenteeism
Variable	N	OR	95% CI	p	IRR	95% CI	p
Age	199	0.98	0.94, 1.02	.328	1.00	0.97, 1.03	.956
Years in agency	199	0.99	0.94, 1.04	.648	1.00	0.96, 1.03	.821
Maximum other drug use risk	199	1.38	1.16, 1.65	<.001	1.08	1.00, 1.16	.056
Psychological distress	199	1.16	1.10, 1.23	<.001	1.07	1.03, 1.11	<.001
Alcohol use risk	199	1.02	0.99, 1.05	.127	1.01	0.99, 1.02	.256
Other drug use risk x Psychological distress	199	0.99	0.98, 1.00	<.001	1.00	1.00, 1.00	.085

Significant interaction effect of other drug use and psychological distress on odds of presenteeism.

Margins

Show/hide the code

options(marginaleffects_safe = FALSE)
df_sub <- df_sub |>
  mutate(presenteeism1 = ifelse(presenteeism >= 1, 1, 0))

means <- df_sub |>
  summarise(
    age_mean = mean(age, na.rm = TRUE),
    agency_years_mean = mean(agency_years, na.rm = TRUE),
    alcohol_mean = mean(alcohol_use_scores, na.rm = TRUE),
    k10_mean = mean(k10_total, na.rm = TRUE),
    k10_sd = sd(k10_total, na.rm = TRUE)
  )

k10_low <- with(means, k10_mean - k10_sd)
k10_high <- with(means, k10_mean + k10_sd)
m <- glm(
  presenteeism1 ~ age +
    agency_years +
    max_other_drug_score * k10_total +
    alcohol_use_scores,
  data = df_sub,
  family = binomial
)

preds <- avg_predictions(
  m,
  by = c("max_other_drug_score", "k10_total"),
  newdata = datagrid(
    max_other_drug_score = 0:40,
    k10_total = c(k10_low, k10_high)
  )
)
# extract predicted probabilities at min/max drug score for each k10 level
preds |>
  filter(max_other_drug_score %in% c(0, 40)) |>
  dplyr::select(
    max_other_drug_score,
    k10_total,
    estimate,
    conf.low,
    conf.high
  ) |>
  mutate(across(c(estimate, conf.low, conf.high), ~ round(.x * 100, 1)))


 Estimate 2.5 % 97.5 %
     18.0   9.8   26.3
     69.3  57.0   81.6
     99.7  98.7  100.8
     87.5  66.0  109.1

Show/hide the code

# marginal effect of drug score at low vs high distress
comparisons(
  m,
  variables = "max_other_drug_score",
  newdata = datagrid(k10_total = c(k10_low, k10_high))
)


 k10_total Estimate Std. Error    z Pr(>|z|)   S   2.5 % 97.5 %
      13.9   0.0388    0.01292 3.00  0.00266 8.6  0.0135 0.0641
      29.5   0.0058    0.00576 1.01  0.31423 1.7 -0.0055 0.0171

Term: max_other_drug_score
Type: response
Comparison: +1

Show/hide the code

# test whether marginal effects differ between distress levels
comparisons(
  m,
  hypothesis = "b2-b1=0",
  variables = "max_other_drug_score",
  newdata = datagrid(k10_total = c(k10_low, k10_high))
)


 Hypothesis Estimate Std. Error     z Pr(>|z|)   S   2.5 %  97.5 %
    b2-b1=0   -0.033     0.0108 -3.07  0.00217 8.8 -0.0541 -0.0119

Type: response

Analysis of marginal effects showed that at lower levels of psychological distress (1 SD below the mean = 13.9), the predicted probability of presenteeism ranged from 18% at zero other drug use risk, to 99.7% at max risk. At high levels of psychological distress, probabilities ranged from 69.3% at low levels of risk to 87.5% at maximum risk.

However, the high psychological distress x maximum other drug use risk group had an upper confidence interval exceeding 100. So the effect of other drug use risk on presenteeism is less certain at higher levels of distress.

!!!

Each one point increase in other drug use risk was associated with a 3.9% increase in the probability of presenteeism at low psychological distress (p = .003), whereas the effect is attenuated and non-significant at high psychological distress (0.6% increase; p = .310). The difference in marginal effects between low and high distress was significant (b = -0.03, p = .002), indicating that the effect of other drug use risk on presenteeism is significantly stronger at lower levels of psychological distress.

Plot

Show/hide the code

colourpal <- c("#1b9e77", "#d95f02", "#7570b3", "#e7298a", "#e6ab02")
preds <- preds %>% mutate(distress = if_else(k10_total < 13.9, "Low","High"),
                 distress = factor(distress, levels = c("Low", "High"), labels = c("Low (mean - 1 SD)", "High (mean + 1 SD)"),
                                   ordered = TRUE))

preds %>% ggplot(aes(x = max_other_drug_score, y = estimate, color = distress, fill = distress)) +
  geom_line(size = 1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.3, colour = NA) +
  scale_y_continuous(
    labels = label_percent(accuracy = 1),
   breaks = seq(0, 1, by = 0.1), limits = c(0, 1.10) ) +
  scale_x_continuous(breaks = seq(0, 40, 5), limits = c(0, 40)) + 
  scale_colour_manual(values = c("#1b9e77","#7570b3")) +    
  scale_fill_manual(values = c("#1b9e77","#7570b3")) +
    
    labs(
      x = "Maximum other drug use risk score (ASSIST)",
      y = "Predicted probability of presenteeism",
      color = "Psychological distress (K10+)",
      fill = "Psychological distress (K10+)"
    ) +
    theme_pubr(base_size = 13) +
    labs_pubr()

Figure 2: Predicted probability of any presenteeism by other drug use risk and psychological distress

(Note. Lines show model-based predictions from the logit component of the hurdle negative binomial model for presenteeism plotted over maximum other drug use risk scores (0-40) at low (1 SD below the mean) and high (1 SD above the mean) levels of psychological distress (M = 21.7, SD = 7.8). Shaded areas represent 95% confidence intervals. All other covariates in the final model (age, years working for agency, alcohol use risk) are held at their sample means. The significant interaction shows that when psychological distress is low, increases in other drug use risk are associated with an increase in the probability of any presenteeism. At higher levels of psychological distress, the baseline probability of presenteeism is elevated and the effect of other drug use risk is attentuated)

Interpretation

Risky levels of alcohol use, drug use, and psychological distress were prevalent in the sample. Alcohol use, other drug use, and psychological distress predicted the odds of absenteeism, and other drug use predicted the odds of presenteeism (particularly when psychological distress was low). Psychological distress was the most consistent predictor of the odds and frequency of both absenteeism and presenteeism. This suggests that psychological distress is the primary driver of occupational impairment in PSP. AOD use contributes to occupational impairment, but in ways that differ across contexts (absenteeism/presenteeism), and substance classes (alcohol vs other drugs)