Data Management in Health Research: You're funded but are you FAIR?

This 2-day event will run on Tuesday 7th and Wednesday 8th April 2020 University College Cork.

Background:

Research funding is evolving and the onus now rests on primary researchers to plan the collection and curation of research data with a view to maximise its availability, together with the software and materials that underpin it.

From the 1st January 2020, data gathered and generated in whole or in part from Health Research Board (HRB) funded research will subject to the HRB policy on Management and Sharing of Research Data.

Target audience:

Day 1 - Symposium: Research leaders, PIs and post-docs currently, or actively seeking, grants funded by the HRB.

Day 2 - Workshop: Researchers actively engaged with data collection, curation and management as part of HRB-funded research.

Aims:

Day 1 - To provide an overview of HRB policy, hear from researchers currently engaged in HRB-funded research and to hear from researchers already benefiting from adoption of improved data management policies in their own organisation(s).

Day 2 - To provide guidance to researchers tasked with employing data management protocols. Instruction will be given on both quantitative and qualitative data types through the lens of the FAIR data principles (Findable, Accessible, Interoperable, Re-usable).

To register for the event, please follow this link.

Speakers/Facilitators:

Topics covered will include:

Preliminary Schedule

Evaluating a logistic regression based prediction tool in R

This is a tutorial on how to use R to evaluate a previously published prediction tool in a new dataset. Most of the good ideas came from Maarten van Smeden, and any mistakes are surely mine. This post is not intended to explain they why one might do what follows, but rather how to do it in R.

It is based on a recent analysis we published (in press) that validated the HOMR model to predict all-cause mortality within one-year of a hospitalization in a cohort of patients aged 65 years or older that were under the care of geriatric medicine service at Cork University Hospital (2013-01-01 to 2015-03-06). The predictors used in the HOMR model were generally things that are easy enough to collect during hospitalization. You can read all about it in the paper linked above.

All materials for the analysis can be found on the project’s OSF page.

  data <- read_csv("https://crfcsdau.github.io/public/homr_data.csv") 
  data$alive <- factor(data$alive, levels = c("Yes", "No"))

Calculating the HOMR score

The first step in evaluating the HOMR model in our cohort of patients was to calculate the linear predictor (Z) for each of them. This was done by multiplying the fitted regression coefficients (on the log-odds scale; found in appendix E of the HOMR development paper) by each patient’s respective variable values, and then adding them all together, including the intercept term (data$z_homr).

The original HOMR development paper also provided a set of tables to calculate a score based on the predictors (data$homr_3). However, these scores are just a rescaled version of Z that gives you a set of integer values that can be easily added together from simple tables to get a prediction. However, since there was considerable rounding error (as you can see from the plot below), we used Z instead of the HOMR score in our evaluation.

Figure 1. The relationship between the HOMR score and the HOMR linear predictor (Z) in our cohort of patients.

  ggplot(data, aes(x = z_homr, y = homr_3)) +
    geom_point(alpha = 0.2, size = 2, color = viridis(1, begin = 0.5)) +
    xlab("HOMR linear predictor (log odds scale)") +
    ylab("HOMR score") +
    theme_minimal()

Evaluating the model: Overview

To evaluate the HOMR Model, we followed the procedure outlined in Vergouwe et al (2016) and estimated four logistic regression models. The first included the HOMR linear predictor, with its coefficient set equal to 1, and intercept set to zero (the original HOMR model). The second model allowed the intercept to be freely estimated (Recalibration in the Large). The third then allowed the coefficient on the HOMR linear predictor to be freely estimated as well (Logistic Recalibration). Finally, the fourth model included the complete set variables used in the HOMR model, including the same transformations and interactions, and allowed their respective coefficients to be freely estimated (Model Revision). Here is the code for executing those models:

# HOMR (intercept = 0 and beta = 1)
  m1 <- glm(alive ~ -1 + offset(z_homr), data = data, family = binomial)
# Recalibration in the Large (estimate intercept, beta = 1)
  m2 <- glm(alive ~  1 + offset(z_homr), data = data, family = binomial)
# Logisitic Calibration (estimate both)
  m3 <- glm(alive ~  1 +        z_homr,  data = data, family = binomial)
# Model Revision
  data2 <- data
  data2$admit_service <- as.character(data2$admit_service)
  data2$admit_service <- factor(
    ifelse( # Sparse cells, so this was collapsed to binary
    data2$admit_service == "General Medicine", 
    data2$admit_service, 
    "Other"
    )
  )
  
  formula_4 <- as.formula(
    "alive ~  1 + drs + sqrt_age + sex + living_status + log_cci +
    trans_ed + trans_amb + admit_service + urgency_admit + urgent_readmit +      
    (sqrt_age * log_cci) + (living_status * trans_amb) + 
    (urgency_admit * trans_amb)"
    )

  m4 <- glm(formula_4, data = data2, family = binomial)
  
# Model predicted probabilties
  data$m1_pred <- predict(m1, type = "response")
  data$m2_pred <- predict(m2, type = "response")
  data$m3_pred <- predict(m3, type = "response")
  data$m4_pred <- predict(m4, type = "response")

The overall performance of these models was evaluated using the Brier score, rescaled to range from 0 to 1 (with higher values indicating better performance) as suggested by Steyerberg et al (2010). We assessed calibration graphically, in addition to using the maximum and average difference in predicted vs loess-calibrated probabilities (Emax and Eavg); and we used the c-statistic to assess discrimination. For each of these metrics, we reported bootstrapped 95% confidence intervals. Finally, for Model Revision, we estimated an optimism-corrected c-statistic and shrinkage factor using bootstrapping, as described in Harrell, Lee and Mark (1996)

To get most metrics, we used rms::val.prob as follows:

# Metrics  
  val_m1 <- val.prob(data$m1_pred, as.numeric(data$alive) - 1, 
                     pl = FALSE) %>% round(3)
  val_m2 <- val.prob(data$m2_pred, as.numeric(data$alive) - 1, 
                     pl = FALSE) %>% round(3)
  val_m3 <- val.prob(data$m3_pred, as.numeric(data$alive) - 1, 
                     pl = FALSE) %>% round(3)
  val_m4 <- val.prob(data$m4_pred, as.numeric(data$alive) - 1, 
                     pl = FALSE) %>% round(3)
  
  rescale_brier <- function(x, p, ...){ 
    format(round(1 - (x / (mean(p) * (1 - mean(p)))), digits = 2), nsmall = 2)
  }
  
  b1 <- rescale_brier(val_m1["Brier"], 0.184) 
  b2 <- rescale_brier(val_m2["Brier"], 0.184)
  b3 <- rescale_brier(val_m3["Brier"], 0.184)
  b4 <- rescale_brier(val_m4["Brier"], 0.184)
# Note: 0.184 is the marginal probabilty of death in the entire sample

Here is what the rms::val.prob output looks like.

  pander(val_m1) 
Table continues below
Dxy C (ROC) R2 D D:Chi-sq D:p U U:Chi-sq U:p
0.564 0.782 0.22 0.145 204.6 NA 0.021 32.23 0
Q Brier Intercept Slope Emax E90 Eavg S:z S:p
0.123 0.127 -0.428 0.985 0.103 0.102 0.058 -3.806 0

Uncertainty in these metrics was evaluated with bootstrapping.

  set.seed(48572)

  boot_val <- function(data, formula, ...){
    out <- list()
    for(i in 1:500){
      df <- sample_n(data, nrow(data), replace = TRUE)
      md <- glm(formula, data = df, family = binomial)
      out[[i]] <- val.prob(predict(md, type = "response"),
                           as.numeric(df$alive) - 1, 
                           pl = FALSE) %>% round(3)
    }
    return(out)
  }

  boot_vals_m1 <- boot_val(data,  as.formula("alive ~ -1 + offset(z_homr)"))
  boot_vals_m2 <- boot_val(data,  as.formula("alive ~  1 + offset(z_homr)"))
  boot_vals_m3 <- boot_val(data,  as.formula("alive ~  1 +        z_homr "))
  boot_vals_m4 <- boot_val(data2, formula_4) 

This code just pulls out 95% intervals from the bootstrapped values.

  calc_ci <- function(metric, boot_vals, n){
    x <- unlist(map(boot_vals, `[`, c(metric)))
    if(metric == 'Brier'){x <- as.numeric(rescale_brier(x, 0.184))}
    paste0("(", round(quantile(x, 0.025), n), " to ", 
           round(quantile(x, 0.975), n), ")")
  }

# m1
  m1_c_boot_ci     <- calc_ci("C (ROC)", boot_vals_m1, 2)
  m1_brier_boot_ci <- calc_ci("Brier",   boot_vals_m1, 2)
  m1_emax_boot_ci  <- calc_ci("Emax",    boot_vals_m1, 2)
  m1_eavg_boot_ci  <- calc_ci("Eavg",    boot_vals_m1, 2)

# m2
  m2_c_boot_ci     <- calc_ci("C (ROC)", boot_vals_m2, 2)
  m2_brier_boot_ci <- calc_ci("Brier",   boot_vals_m2, 2)
  m2_emax_boot_ci  <- calc_ci("Emax",    boot_vals_m2, 2)
  m2_eavg_boot_ci  <- calc_ci("Eavg",    boot_vals_m2, 2)
  
# m3
  m3_c_boot_ci     <- calc_ci("C (ROC)", boot_vals_m3, 2)
  m3_brier_boot_ci <- calc_ci("Brier",   boot_vals_m3, 2)
  m3_emax_boot_ci  <- calc_ci("Emax",    boot_vals_m3, 2)
  m3_eavg_boot_ci  <- calc_ci("Eavg",    boot_vals_m3, 2)
  
# m4
  m4_c_boot_ci     <- calc_ci("C (ROC)", boot_vals_m4, 2)
  m4_brier_boot_ci <- calc_ci("Brier",   boot_vals_m4, 2)
  m4_emax_boot_ci  <- calc_ci("Emax",    boot_vals_m4, 2)
  m4_eavg_boot_ci  <- calc_ci("Eavg",    boot_vals_m4, 2)

This code is for the shrinkage corrected c-index and calibration slope for Model Revision. Since we are estimating new coefficients for each predictor from the data we have, this helps us avoid over-fit. To do this, we use rms::validate.

# To use validate, we need to estimate m4 with lrm instead of glm. 

  d <- datadist(data)
  options(datadist = "d")
  
  m4b <- lrm(formula_4, data = data2, x = TRUE, y = TRUE)

  set.seed(04012019)
  val_new <- rms::validate(m4b, B = 500)
  
  shrink_factor <- round(val_new["Slope","index.corrected"], 2)
  c_corrected <- round(0.5 * (1 + val_new["Dxy","index.corrected"]), 2)

Finally, this is the code for making the calibration plots (rms::val.plot will also give a nice calibration plot unless it’s suppressed in the call, but I wanted a ggplot based version so I could tweak it to my liking).

# Function to produce the calibration plots

  cal_plot <- function(model, model_name, pred_var, ...){

    require(tidyverse)
    require(viridis)
    require(gridExtra)

# The calibration plot        
    g1 <- mutate(data, bin = ntile(get(pred_var), 10)) %>% 
          # Bin prediction into 10ths
      group_by(bin) %>%
      mutate(n = n(), # Get ests and CIs
             bin_pred = mean(get(pred_var)), 
             bin_prob = mean(as.numeric(alive) - 1), 
             se = sqrt((bin_prob * (1 - bin_prob)) / n), 
             ul = bin_prob + 1.96 * se, 
             ll = bin_prob - 1.96 * se) %>%
      ungroup() %>%
    ggplot(aes(x = bin_pred, y = bin_prob, ymin = ll, ymax = ul)) +
      geom_pointrange(size = 0.5, color = "black") +
      scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
      scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
      geom_abline() + # 45 degree line indicating perfect calibration
      geom_smooth(method = "lm", se = FALSE, linetype = "dashed", 
                  color = "black", formula = y~-1 + x) + 
                  # straight line fit through estimates
      geom_smooth(aes(x = get(pred_var), y = as.numeric(alive) - 1), 
                  color = "red", se = FALSE, method = "loess") + 
                  # loess fit through estimates
      xlab("") +
      ylab("Observed Probability") +
      theme_minimal() +
      ggtitle(model_name)

# The distribution plot        
    g2 <- ggplot(data, aes(x = get(pred_var))) +
      geom_histogram(fill = "black", bins = 200) +
      scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.1)) +
      xlab("Predicted Probability") +
      ylab("") +
      theme_minimal() +
      scale_y_continuous(breaks = c(0, 40)) +
      theme(panel.grid.minor = element_blank())
    
# Combine them    
    g <- arrangeGrob(g1, g2, respect = TRUE, heights = c(1, 0.25), ncol = 1)
    grid.newpage()
    grid.draw(g)
    return(g[[3]])

  }

Results

# Combine all the results into a single dataframe 

  model_tab_1 <- data_frame(
    est = c("Intercept", "Slope", "Residual deviance", "Df", 
            "LRT Chisq p-value", "Brier score (rescaled)",
            "Emax", "Eavg", "c-statistic"),
    m1_est = c(round(c(0, 1,                 m1$deviance, m1$df.residual, 0), 2), 
               paste(b1, m1_brier_boot_ci), 
               paste(val_m1["Emax"], m1_emax_boot_ci),
               paste(val_m1["Eavg"], m1_eavg_boot_ci),
               paste(round(val_m1["C (ROC)"], 2), m1_c_boot_ci)),
    m2_est = c(round(c(tidy(m2)$estimate, 1, m2$deviance, m2$df.residual, 0), 2), 
               paste(b2, m2_brier_boot_ci), 
               paste(val_m2["Emax"], m2_emax_boot_ci),
               paste(val_m2["Eavg"], m2_eavg_boot_ci), 
               paste(round(val_m2["C (ROC)"], 2), m2_c_boot_ci)),
    m3_est = c(round(c(tidy(m3)$estimate,    m3$deviance, m3$df.residual, 0), 2), 
               paste(b3, m3_brier_boot_ci), 
               paste(val_m3["Emax"], m3_emax_boot_ci),
               paste(val_m3["Eavg"], m3_eavg_boot_ci),
               paste(round(val_m3["C (ROC)"], 2), m3_c_boot_ci)),
    m4_est = c("", "",               round(c(m4$deviance, m4$df.residual, 0), 2), 
               paste(b4, m4_brier_boot_ci), 
               paste(val_m4["Emax"], m4_emax_boot_ci),
               paste(val_m4["Eavg"], m4_eavg_boot_ci),
               paste(round(val_m4["C (ROC)"], 2), m4_c_boot_ci))
    )
  
  names(model_tab_1) <- c("", "HOMR model", "Calibration in the Large", 
                          "Logistic Recalibration", "Model Revision")
  
  model_tab_1[5, 2:5] <- c(
    "-", "<0.001", round(anova(m2, m3, test = "Chisq")[5][2, ], 2), "-"
    )

Table 1. All the results hacked into a table.

  knitr::kable(model_tab_1)
HOMR model Calibration in the Large Logistic Recalibration Model Revision
Intercept 0 -0.42 -0.43
Slope 1 1 0.99
Residual deviance 1139.96 1107.76 1107.73 1046.55
Df 1409 1408 1407 1389
LRT Chisq p-value - <0.001 0.85 -
Brier score (rescaled) 0.15 (0.09 to 0.22) 0.19 (0.11 to 0.26) 0.19 (0.11 to 0.27) 0.23 (0.17 to 0.32)
Emax 0.103 (0.08 to 0.17) 0.111 (0.03 to 0.25) 0.121 (0.03 to 0.23) 0.017 (0.01 to 0.11)
Eavg 0.058 (0.04 to 0.08) 0.016 (0.01 to 0.03) 0.017 (0.01 to 0.03) 0.008 (0 to 0.02)
c-statistic 0.78 (0.75 to 0.81) 0.78 (0.75 to 0.81) 0.78 (0.75 to 0.81) 0.82 (0.8 to 0.86)

Calibration plots

Figure 2. Calibration plot for the HOMR model in a sample of 1409 patients aged 65 years or older that were under the care of geriatric medicine service at Cork University Hospital (2013-01-01 to 2015-03-06)

  x <- cal_plot(m1, "HOMR model", "m1_pred")

Figure 3. Calibration plot for Recalibration in the Large

Figure 4. Calibration plot for Model Revision

Other stuff

Figure 5. Logistic curve fit of the original HOMR model to one year post-hospitalization mortality

# Plot of data with a logistic curve fit
  ggplot(data, aes(x = z_homr, y = as.numeric(alive) - 1)) +
    geom_jitter(height = 0.1, size =1, alpha = 0.5) +
    geom_smooth(method = "glm",
                method.args = list(family = "binomial")) +
    theme_minimal() +
    scale_y_continuous(breaks = c(0, 1), labels = c("Alive", "Dead")) +
    ylab("") +
    xlab("HOMR Linear Predictor")

Figure 6. Number/Proportion of patients who died within one year of hospitalization by risk level (Z)

  g1 <- ggplot(data, aes(x = z_homr, fill = alive)) +
    geom_histogram() +
    theme_minimal() +
    xlab("HOMR Linear Predictor") +
    ylab("Number of Participants") +
    scale_fill_brewer("Alive", palette = "Paired")

  g2 <- ggplot(data, aes(x = z_homr, fill = alive)) +
    geom_histogram(position = "fill") +
    theme_minimal() +
    xlab("HOMR Linear Predictor") +
    ylab("Proportion") +
    scale_fill_brewer("Alive", palette = "Paired")
  
  grid.arrange(g1, g2, ncol = 1)

Figure 7. Distribution of the original HOMR linear predictor among those who did and didn’t die within one year after hospitalization

  ggplot(data, aes(y = z_homr, x = alive, fill = alive, color = alive)) +
    geom_beeswarm() +
    geom_boxplot(alpha = 0, color = "black") +
    theme_minimal() +
    ylab("HOMR Linear Predictor") +
    xlab("Alive at 1 year") +
    scale_fill_brewer(guide = FALSE, palette = "Paired") +
    scale_color_brewer(guide = FALSE, palette = "Paired") 

Table 2. HOMR Model Revision coefficients

  covariates <- c(
    "Intercept",
    "DRS",                                            
    "sqrt(Age)",                                            
    "Male (vs Female)",                                       
    "Rehab",                             
    "Homecare",                         
    "Nursing Home",                     
    "log(CCI)",                                            
    "sqrt(Ed visits in the previous year + 1)",                                
    "1/(Admissions by ambulance in previous year +1)",    
    "Other (vs General Medicine)",                        
    "ED w/o Ambulance",                 
    "ED w/Ambulance",                   
    "Urgent readmission",                             
    "Sqrt(Age) * log(CCI)",                                       
    "Rehab * 1/(Admissions by ambulance in previous year +1)",              
    "Homecare * 1/(Admissions by ambulance in previous year +1)",           
    "Nursing Home * 1/(Admissions by ambulance in previous year +1)",       
    "ED w/o Ambulance * 1/(Admissions by ambulance in previous year +1)",
    "ED w/Ambulance * 1/(Admissions by ambulance in previous year +1)"
  )

  sjPlot::tab_model(m4, show.p = FALSE, show.ci = FALSE, show.se = TRUE, 
                    transform = NULL, pred.labels = covariates)
  alive
Predictors Log-Odds std. Error
Intercept -14.84 4.02
DRS 0.11 0.02
sqrt(Age) 1.45 0.43
Male (vs Female) 0.44 0.17
Rehab -1.16 0.71
Homecare 0.41 0.66
Nursing Home -0.34 1.29
log(CCI) 2.78 2.83
sqrt(Ed visits in the previous year + 1) 0.16 0.71
1/(Admissions by ambulance in previous year +1) 0.19 0.61
Other (vs General Medicine) -0.68 0.46
ED w/o Ambulance 0.38 0.67
ED w/Ambulance 1.21 1.12
Urgent readmission 0.60 0.27
Sqrt(Age) * log(CCI) -0.23 0.31
Rehab * 1/(Admissions by ambulance in previous year +1) -0.31 0.79
Homecare * 1/(Admissions by ambulance in previous year +1) -0.51 0.82
Nursing Home * 1/(Admissions by ambulance in previous year +1) -0.46 1.78
ED w/o Ambulance * 1/(Admissions by ambulance in previous year +1) -0.87 0.76
ED w/Ambulance * 1/(Admissions by ambulance in previous year +1) -1.91 1.34
Observations 1409
Cox & Snell’s R2 / Nagelkerke’s R2 0.191 / 0.310

Note: Admit service recoded to General Medicine vs Other, due to small call sizes. ICU admission was omitted as there were only 3 cases of this happening. Home O2 was omitted since no patients in our sample were using it.

Full distributions of bootstrapped values.

  boots <- function(metric, boot_vals){
    x <- as.numeric(unlist(map(boot_vals, `[`, metric)))
    if(metric == 'Brier'){x <- as.numeric(rescale_brier(x, 0.184))}
    return(x)
  }

  boot_list <- list(boot_vals_m1, boot_vals_m2, boot_vals_m3, boot_vals_m4)
  metrics <- c("C (ROC)", "Brier", "Eavg", "Emax")
  
  x <- c()
  for(i in metrics){
    for(j in 1:4){
      x <- c(x, boots(i, boot_list[[j]]))
    }
  }
  
  y <- rep(c(paste0("m", 1:4, " c-index"), paste0("m", 1:4, " Brier"), 
             paste0("m", 1:4, " Emax"), paste0("m", 1:4, " Eavg")), 
           each = 500) 

  boot_data <- data_frame(var = y, val = x)

Figure 7. Distributions of bootstrapped model statistics

  ggplot(boot_data, aes(x = val)) +
    geom_density() +
    facet_wrap(~var, nrow = 4, scales = "free") +
    theme_minimal() +
    xlab("") +
    ylab("Density")

# Packages used

# library(tidyverse)
# library(rms)
# library(Hmisc)
# library(knitr)
# library(broom)
# library(pander)
# library(ggbeeswarm)
# library(gridExtra)
# library(grid)
# library(sjPlot)
# library(sjmisc)
# library(sjlabelled)
# library(viridis)

  sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] grid      methods   stats     graphics  grDevices utils     datasets 
## [8] base     
## 
## other attached packages:
##  [1] viridis_0.5.1     viridisLite_0.3.0 sjlabelled_1.0.17
##  [4] sjmisc_2.7.9      sjPlot_2.6.2      gridExtra_2.3    
##  [7] ggbeeswarm_0.6.0  pander_0.6.2      broom_0.4.4      
## [10] knitr_1.20        rms_5.1-2         SparseM_1.77     
## [13] Hmisc_4.1-1       Formula_1.2-3     survival_2.41-3  
## [16] lattice_0.20-35   forcats_0.3.0     stringr_1.3.1    
## [19] dplyr_0.8.0.1     purrr_0.2.5       readr_1.1.1      
## [22] tidyr_0.8.1       tibble_2.0.1      ggplot2_3.1.0    
## [25] tidyverse_1.2.1  
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-131        lubridate_1.7.4     insight_0.2.0      
##  [4] RColorBrewer_1.1-2  httr_1.4.0          rprojroot_1.3-2    
##  [7] TMB_1.7.13          tools_3.4.3         backports_1.1.2    
## [10] R6_2.2.2            rpart_4.1-11        vipor_0.4.5        
## [13] lazyeval_0.2.1      colorspace_1.3-2    nnet_7.3-12        
## [16] withr_2.1.2         tidyselect_0.2.5    mnormt_1.5-5       
## [19] emmeans_1.2.3       curl_3.3            compiler_3.4.3     
## [22] cli_1.0.1           rvest_0.3.2         quantreg_5.36      
## [25] htmlTable_1.12      xml2_1.2.0          sandwich_2.4-0     
## [28] labeling_0.3        bookdown_0.9        scales_1.0.0       
## [31] checkmate_1.8.5     polspline_1.1.13    mvtnorm_1.0-8      
## [34] psych_1.8.4         digest_0.6.18       minqa_1.2.4        
## [37] foreign_0.8-69      rmarkdown_1.10      base64enc_0.1-3    
## [40] pkgconfig_2.0.2     htmltools_0.3.6     lme4_1.1-17        
## [43] highr_0.7           htmlwidgets_1.3     rlang_0.3.4        
## [46] readxl_1.1.0        rstudioapi_0.7      zoo_1.8-2          
## [49] jsonlite_1.6        acepack_1.4.1       magrittr_1.5       
## [52] Matrix_1.2-12       Rcpp_1.0.0          munsell_0.5.0      
## [55] stringi_1.4.3       multcomp_1.4-8      yaml_2.1.19        
## [58] snakecase_0.9.1     MASS_7.3-47         plyr_1.8.4         
## [61] parallel_3.4.3      crayon_1.3.4        ggeffects_0.9.0    
## [64] haven_2.1.0         splines_3.4.3       sjstats_0.17.4     
## [67] hms_0.4.2           pillar_1.3.1        estimability_1.3   
## [70] reshape2_1.4.3      codetools_0.2-15    glue_1.3.0         
## [73] evaluate_0.10.1     blogdown_0.6        latticeExtra_0.6-28
## [76] data.table_1.11.4   modelr_0.1.2        nloptr_1.0.4       
## [79] MatrixModels_0.4-1  cellranger_1.1.0    gtable_0.2.0       
## [82] assertthat_0.2.0    xfun_0.2            xtable_1.8-2       
## [85] coda_0.19-1         glmmTMB_0.2.1.0     beeswarm_0.2.3     
## [88] cluster_2.0.6       TH.data_1.0-8

Trial Methods Week 1 - Reading Notes (Draft 1.0)

Week 1 - Introduction

The goals of week 1 are to:

  1. Welcome everyone and introduce the module
  2. Explain the role of the statistician and their role on the trial team
  3. Motivate an interest in proper clinical trial design and analysis
  4. Introduce the most fundamental aspects of trial design, and justify the importance of control and randomization.

Welcome

I want to welcome you all to this module on Clinical Trial Design and Analysis.

You have already learned about all the important people it takes to run a clinical trial. These include research nurses, pharmacists, data managers, information technologists, programmers, research assistants, monitors, regulatory experts, statisticians, clinical investigators, and of course the patients themselves. Some of you taking this module are Clinical Investigators, and will be responsible for the medical care of patients enrolled in the study, the overall project management of the trial, and the scientific content of the study. This is a lot for any one person to contend with, so it is critical that you can count on the support from the other experts in the study team.

When it comes to meeting the scientific goals of the trial, this support will primarily come from the Trial Statistician. Importantly, the clinical investigator and the trial statistician will each bring different but vital knowledge to bear on the trial’s design. The clinical investigator will be the subject-matter expert, knowledgeable about clinical problem at hand, and the potential solution being evaluated in the trial. The statistician’s expertise, on the other hand, is ultimately about how to draw appropriate inferences from the results of a given trial, and how to design trials that best support these efforts. Another way to view this distinction is that the clinical investigator is an expert on what we know, while statisticians focus on understanding how we come to know things.

There is no doubt that good clinical investigators will, over time, acquire some of the statistician’s expertise, just as any respectable statistician will eventually become quite knowledgeable about the clinical areas they work in. However, a tenet of this module is that the role of the Trial Statistician should be respected as a unique contribution that is necessary for the successful conduct of a clinical trial. Consequently, this module is not designed to teach you how to “be a statistician”, but rather how to work with a statistician to support optimal trial design .

Why good design and analysis matters

Perhaps it is obvious to you why good trial design and analysis matters. There is a wealth of evidence however that clinical trials are often flawed in their design; and that errors in the analysis and reporting of trial data are common.

The problems are not inconsequential. In 1994, the late Doug Altman, writing in the BMJ, talked about “the scandal of poor medical research.”

What should we think about a doctor who uses the wrong treatment, either wilfully or through ignorance, or who uses the right treatment wrongly (such as by giving the wrong dose of a drug)? Most people would agree that such behaviour was unprofessional, arguably unethical, and certainly unacceptable. What, then, should we think about researchers who use the wrong techniques (either wilfully or in ignorance), use the right techniques wrongly, misinterpret their results, report their results selectively, cite the literature selectively, and draw unjustified conclusions? We should be appalled. Yet numerous studies of the medical literature, in both general and specialist journals, have shown that all of the above phenomena are common. This is surely a scandal.

One might hope that much has changed since Altman wrote these words, but in 2009, Chalmers and Glasziou made the provocative claim that perhaps as much as 85% of research funding in the biomedical sciences was wasted. This claim was followed up in a special issue on “research waste” in the Lancet five years later. The reasons for research waste included a number of issues we will touch on to varying degrees in this modules. These were inappropriate or irrelevant research questions; biased or inaccessible reports of research results; and most importantly, from our perspective, errors in study design and data analysis.

Around the time that research waste was capturing our attention, Ioannidis published his landmark 2005 paper in PLOS ONE, titled “Why most research findings are false.” While Chalmers, Glaziou and others were discussing a relatively broad set of problems, Ioannidis observations were more focused more on statistics and epistemology. In a nutshell, he argued that many research findings were likely to be so-called false positives, and outlined the circumstances that would most often lead to this. While Ioannidis’s position has been challenged (see here), it helped promote the importance of meta-research (research about research) aimed at improving the methodological issues we are discussing.

Ioannidis’s paper also coincided with increasing recognition of the modern reproducibility crisis. Largely starting in psychology, but since extending to other fields, researchers started to pay more attention to replicating earlier studies. Many were surprised to find that studies often failed to replicate, even very famous ones. This led researchers to think more deeply about these reasons for this, which subsequently drew more attention to so-called questionable research practices, such as p-hacking (the process of modifying a statistical analysis, often with good intentions, until an acceptable result is found).

During this tumultuous period (which is ongoing), one particular statistical tool, the ubiquitous p-value, started to draw some of the blame for many of these problems. As we will learn next week, statisticians have been arguing about the value of p-values, or lack thereof, since the moment they were invented by Sir Ronald Fisher. However, this questioning of p-values among the wider community of scientists was notable, leading the American Statistical Association to commission an expert panel to weigh in. This resulted in a position paper from the ASA attempting to clarify how p-values should and shouldn’t be used. Of course, in the great tradition of statisticians, the position paper has hardly settled matters, and arguments over p-values continue (e.g. “Redefining Statistical Significance” vs “Justify Your Alpha”).

The Researcher as Statistician

So where does all of this leave us with respect to this module on clinical trial design and analysis? In our opinion, many of the problems we just described are substantially driven by a lack of expert statistical input in many, if not most, studies. This includes both the design of studies before they are conducted, and the analysis of the resulting data. Unfortunately, there aren’t enough experienced statisticians in the world to contribute to every study, so scientists are often expected to act as their own statistician (or similarly nominate someone they are working with). Thus most of us receive at least some statistical training, though it is typically limited in the following ways:

  • It typically centres on a statistical toolkit from which researchers select the appropriate test or procedure based on the characteristics the data they are using. While this tool kit is often sufficient for the analysis of well-conducted, relatively simple experiments, it completely lacks tools needed for common statistical challenges that occur in the wild, such as missing data, clustered observations, preventing overfit, measurement error, and model selection.
  • The rationale for different statistical methods are often left unexplored, leaving students ill-equipped to apply them critically.
  • There is little training, if any, in data management, manipulation, or visualization.
  • Statistics is typically presented as something you do to data once they are collected, so that important links between study design and statistical methods are obscured.
  • Because almost all scientists receive some training in statistics, the mathematical underpinnings are often omitted or glossed over to accommodate the wide range of learners’ previous maths training and quantitative aptitude.
  • Students are frequently left with the impression that statistics is a monolith, firmly rooted in mathematics and thus somehow pure - while in truth the subject is highly contentious and as closely aligned to philosophy as it is to maths.

A troubling consequence of all of this is that too many statistical analyses are conducted by researchers who readily admit that they don’t feel comfortable with statistics (or even dislike them), or by researchers who don’t understand the limits of their own statistical knowledge. Further, because many senior scientists aren’t any more comfortable with statistics than their less experienced colleagues, the task of analysing the data often falls on the latter. These problems, in our experience, can be exacerbated in clinical trials. This is because most clinical investigators receive even less training in study design and statistics than the scientists leading research in other fields. This of course no fault of their own – they are busy learning all of the critically important things they need to be good clinicians! This module then is partly aimed at catching everyone up, so to speak, while redressing some of the limitations of typical statistical training listed above - but perhaps more importanly, about teaching you how to work with a trial statistican.

Randomized controlled trials: An Overview

The randomized controlled trial (RCT) is widely recognized as the preferred study design for understanding the effects of an intervention. However, to reap the benefits of an RCT, it must be well-designed and properly conducted. This module is primarily focused on ensuring the former. RCT designs include several components, and choices regarding these components are deeply interrelated. It will thus be helpful to now take a big picture view of what these key components are.

The first thing any trial needs is patients. This fact underpins critically important ethical considerations, as well as many practical aspects of trial conduct. Special care will be taken to define exactly which kinds of patients to include in the trial.

The next thing a trial needs is an intervention we would like to test. Common examples include medicines, devices, and educational programs. For any proposed intervention, we will need to carefully consider equipoise, and how the nature of the intervention may impact the trial’s design.

The final critical component to a trial is the outcome (or endpoint), which is something about the patients we can measure to facilitate judgement of the intervention. In other words, it is almost always the thing about the patients that the intervention is meant to improve. Examples would be blood pressure, weight, quality of life, and time to death.

While one can technically run a trial with just the above three components, it’s really the next two components that really increase its evidential value. These are the use of a control arm, and using randomization to allocate patients to the different study arms.

Concurrent Controls

There is a famous story of a psychologist who went to work with the Israeli Air Force to help them improve the performance of their pilots (ref). The psychologist talked about the value of positive reinforcement - but an experienced trainer questioned this, pointing out that when they praised pilots after they did well on an exercise, they almost always did worse the next time out. Conversely, when pilots performed poorly, they were yelled at or punished, and sure enough, their performance would almost always improve. In other words, the experience of the trainer was completely counter to the psychologist’s advice. So who was right?

If you think it was the trainers, then you’ve just fallen victim to one of the most pervasive challenges in data analysis, called regression to the mean. In a nutshell, regression to the mean is when we select a group based on having relatively extreme values for some variable (e.g. take all the pilots that did very well on an exercise), measure that variable again, and see that the second set of measurements aren’t as extreme, on average, as the first set. To better understand this phenomena, please watch the following video, which uses an example from SIDD.

[rtm.video]()

rtm.app

Randomization

Randomization refers to a set of tools used to allocate study participants to different treatments based on chance alone. R.A. Fisher is widely credited with first employing randomization in experimental research, when he used chance to assign treatments to different plots of land in agricultural experiments (Fisher RA. The Design of Experiments, 7th ed. Edinburgh: Oliver and Boyd, 1971). For Fisher, randomization was fundamental to the tests of significance he advocated, by supporting the assumptions about normal error distributions, and facilitating permutation tests when parametric assumptions were untenable (REF). We will discuss this further in the lessons on probability theory and statistical inference.

Bradford Hill is then credited with promoting randomization in the context of clinical trials (Medical Research Council (MRC). Streptomycin treatment of pulmonary tuberculosis: A Medical Research Council investigation. Br Med J 2:769-782, 1948). Though a statistician, Hill’s focus on randomization was less about the statistical properties, but was more about his desire to allocate patients to treatments in an unbiased manner – to avoid “personal idiosyncrasies”, “personal judgement”, and importantly, to protect oneself from critics who might say that our treatment groups are different due to “our predilections or through our stupidity.” In other words, Hill saw randomization as our best tool for maintaining allocation concealment, and thus improving causal inferences. The importance of allocation concealment in experiments was recognized much earlier than randomization per se, and researchers were already using methods such as alternation (see the James Lind Library for an interesting, comprehensive overview). Hill understood, however, that while these tools might conceal allocation in theory, they were far from fool-proof. The value of randomization for this purpose is now well recognized. In modern trials, randomization is now required for “utmost good faith”, and any regulator would likely be suspicious of a trial where the investigators chose not to randomize (SIDD p35).

Required reading

SIDD 2007 - Chapter 3

Discussion

Topic 1 - Randomization Workplace Wellness Programs Don’t Work Well. Why Some Studies Show Otherwise. https://www.nytimes.com/2018/08/06/upshot/employer-wellness-programs-randomized-trials.html

Topic 2 – Placebo effects and regression to the mean The Therapeutic Effect of Intra-articular Normal Saline Injections for Knee Osteoarthritis: A Meta-analysis of Evidence Level 1 Studies http://journals.sagepub.com/doi/abs/10.1177/0363546516680607?journalCode=ajsb

Additional reading:

SIDD – Chapter 5 Bridging Clinical Investigators and Statisticians: Writing the Statistical Methodology for a Research Proposal

The Statistician’s Role in Developing a Protocol for a Clinical Trial

Researcher Requests for Inappropriate Analysis and Reporting: A U.S. Survey of Consulting Biostatisticians

1-day R workshop for experienced users - Oct 30, 2018

Overview: R is a programming language and free software environment for statistical computing and graphics. User instruction in R is often limited to course specific content, that focuses primarily on statistical operations through the base R package. R functionality extends far beyond this and across research fields. When combined with the RStudio GUI, R can be utilised as a powerful tool that encompasses data cleaning, exploration, visualisation and documentation operations.

The dual aim of this 1-day workshop is to (A) highlight how the reorganisation of popular R packages, under the umbrella banner of the “tidyverse”, has made the versatility of R more accessible and (B) introduce data analysis through RStudio project workflows.

Preliminary schedule:

  • 9.30 am: Setup, introductions, housekeeping
  • 10 am: Lecture - Introduction to the tidyverse
  • 11 am: Coffee break – Discussion
  • 11.30 am: 3 × 30 minute tidyverse tutorials including:
    • Example scripts and problem sheets exploring R packages tidyr, dplyr, ggplot2 etc.
    • Opportunity to work on your own data sets
    • Opportunity to work in groups
  • 1 pm: Lunch break
  • 2 pm: Lecture - Project orientated workflows through RStudio
  • 3 pm: 3 × 30 minute tutorials dedicated to reproducible research including:
    • Project folder structure
    • Scripting guides and formats
    • Automated report generation using RMarkdown
  • 4.30 pm: Closing remarks

Essential Requirements:

  1. Experience using R and RStudio
  2. Your own laptop
  3. R and RStudio pre-installed and running
  4. Access to the Eduroam Wi-Fi network

Next step: To register your interest, email b.palmer@ucc.ie. Please ensure “[R Workshop]” appears in the subject line. Your place at the workshop will be confirmed via return email.

And finally: This workshop will lead-in to a 3-day Open Science event set to run from Wednesday 31st October – Friday 2nd November here in UCC (see attached poster and https://crfcsdau.github.io/post/openscienceucc/ for more). Research funders (including H2020, SFI, Wellcome and the HRB) are recognising the long term benefits to Open Science and new grant calls increasingly reflect this. The challenge for researchers is to become aware of the available options, and explore ways in which they can apply these techniques to their own day-to-day research workflows.

Open Science @ University College Cork (Oct 31 to Nov 2)

Open Science @ UCC is a 3 day workshop running from October 31 to November 2 (2018) at University College Cork.

Open science practices help make research accessible, transparent, and replicable. Research funders (including H2020, SFI and HRB) are recognising the long term benefits to open science and new grant calls increasingly reflect this. The challenge for researchers is to become aware of the available options, and explore ways in which they can apply these techniques to their day-to-day research workflows.

The goal of Open Science @ UCC is to bring accomplished experts in open science to Cork to train 30 Open Science Champions. The event will be a mixture of lectures, hands-on tutorials, and participant-driven un-conferencing. There will also be time for discussions, break-out groups, and socializing. Participants will thus leave the workshop as skilled, confident advocates for open science, connected to a network of other open science experts.

Open Science @ UCC will take place in a multi-purpose learning space called the Creative Zone, located in the main Library on the UCC campus. The topics and confirmed speakers are listed below. We welcome researchers from all stages of their careers. The organizers are dedicated to a harassment-free workshop experience for everyone. Our anti-harassment policy can be found here.

To register for the event, please email Dr Darren Dahly (ddahly@ucc.ie). The registration fee is 300 euros per participant for the full three-day event. Payments can be made by invoice or (preferably) credit card (links for payment will be provided upon registration). Full refunds will be available until Monday, October 15, 2018. Given the goals of the workshop, there are no single day options.

The full programme is now available here.

Speakers/Facilitators:

Open Science Topics we will cover include:

Dealing with negative trial results

Welcome to the world of clinical trials. You’ve got a great idea that has real potential to improve health. But I have some bad news for you. Are you ready?

Your intervention won’t work. I’m sorry. It won’t. Not even a little.

“But hold on a minute,” you say. “I haven’t even run the trial yet. How can you possibly know it won’t work?” Experience my friend – and as any experienced trialist can tell you, the intervention never works. (Ok, sometimes the intervention works, but it usually doesn’t, even though we almost always think it will.)

With that in mind, let’s go forward in time. The last follow-up was completed a few weeks ago. Now you are standing next to me, the trial statistician, at the front of a large meeting room in some cookie-cutter hotel. I have a clicker in my hand, there is a graph projected onto the large screen behind us, and I just told everyone that the intervention didn’t work.

Who is everyone? The entire study team of course, all sat quietly at a giant U-shaped table with little name placards in front of them. All the site investigators are there. There is someone from the DMSB too. And of course all the other people it takes to run a trial. Someone from the funding body is there as well - it seems they are interested in what you did with their money. And don’t forget the patients’ representative – they gave more than anyone.

Look at their faces. They look so confused. Disappointed. A few even look sad. Oh no, are those tears? And we were having so much fun up to this point, drinking free coffee and catching up since our last meeting. Now they are just blankly staring at us. You have to say something. You have to say something now.

Freeze time

What are you going to say?

You’re new to this, so let me help you. To survive this moment, you need to be able to look everyone in the eye and remind them that the trial was designed to rigorously answer an important question. If this is true, then you have likely just learned something useful, regardless of whether the intervention worked or not. Further, if the study was well designed, other people will be more likely to trust your result and act on it. In other words, there is no such thing as a “negative” trial result – the answer given by a well-designed trial is always useful, whether the intervention worked or not. So you simply remind everyone of this. People will still be disappointed of course - we’d be fools to test treatments if we didn’t think they worked, and it’s natural to hope. But at least we know we ran a good trial and added knowledge to the world - that’s not nothing.

And what if the trial wasn’t well designed? Then the errors in the design might be used to reasonably explain away any result, “positive” or “negative”. Maybe the intervention didn’t appear to work because the trial was too small. Or maybe it only appeared to work because you used a surrogate outcome. And regardless of the explanations you come up with, nobody else will believe the result anyway, so what’s the point? So the best case scenario is that you’ve wasted lots of time and money. The worst case scenario is that you have needlessly put patients at risk. Trust me, there’s not enough perfume in France to cover that kind of stink.

So here is my advice. At the start of any trial, imagine yourself in the situation above. The intervention didn’t work. How are you going to live with yourself? Now act accordingly.

HRB CRFC Open Science Challenge Week 3

Thanks again to everyone who came to Week 2 of the Open Science Challenge. If you haven’t made it yet, no worries, you are welcome to join us any time (but if you want me to stop emailing you, let me know).

Our next meeting will be Friday, May 18 at 2-3 pm in WGB room 4.05, where we will have a visit from Vicky Hellon from F1000 to talk about HRB Open Research.

At our last meeting, we had an excellent discussion of what makes for a good research question. I think the key points from our conversation were as follows:

  1. The questions must be scientific, but exactly does this mean? I offered a bit on the Problem of Induction, Popper, and Falsification, largely just to establish that smart people can and do disagree on what science is and how it progresses. However, we all seemed to generally agree that science is about asking questions that can be addressed by empirical data.
  2. We discussed the idea of novelty, contrasting false bravado and exaggeration with genuine novelty that drives scientific progress. We also discussed the relative merits of novelty and the importance of replication.
  3. We talked about the importance of our research questions. I mentioned concerns that many research questions, at least in biomedical science, aren’t the ones that stakeholders and decision makers want answered. I also mentioned the role of initiatives like the JLA Priority Setting Partnerships, as well as funding bodies, in focusing our collective efforts on important questions. We also discussed impact and how the REF in the UK has affected wider perceptions of this term – and thankfully we all agreed that seeing your research in the media isn’t necessarily impact. This conversation dovetailed into shared concerns that basic and methodological research were wrongly placed at disadvantage as the potential value wasn’t as immediately apparent as say an assessment of an intervention – and so we highlighted the need to make the case for the importance of this research, rather than to assume others (reviewers/editors) understand.
  4. We discussed ethics, which turned more towards a discussion of ethics around data collection (rather than scientific questions), so we agreed to pick up this topic again.
  5. We also we able to identify cultural differences across the various fields we each represent.

In light of this conversation, we should now be more prepared to devise our own research questions, which we can discuss further on a newly established Slack channel.

HRB CRFC Open Science Challenge - Week 2

Thank you to everyone who came to our first session of the HRB Open Science Challenge. If you weren’t able to make it, no problem – we’ll catch you at the next session on May 4th, in room 4.05 of the Western Gateway Building (10:30 to 11:30), when we’ll be discussing what makes a good research question. Please give some thought to that before we meet, as well as what data you might/will use for your project.

During the first meeting we discussed motivations for open science.

  • Open Access - Since much of science is publicly funded, it seems reasonable that the public should be able to access the results of said research.

  • Reproducibility/Replicability – Single experiments in isolation provide little assurance that an observed effect is “real”. Science thus relies on scientists repeating and then confirming/dis-confirming earlier studies to eventually arrive at a consensus. Sharing research methods and materials is needed to facilitate this process.

  • Research Integrity – There is plenty of evidence to suggest that many studies are not conducted properly, especially when it comes to analyses of the resulting data. Mistakes tend to go un-detected because we typically work in isolation and don’t usually fully document our analyses. Further, we can’t really count on peer review to catch mistakes, since we don’t typically share materials with editors/reviewers, and their ability to judge science is highly variable. Thus, the best way (at least in my opinion) to improve the quality of data analyses is to ensure that they are done in a transparent, reproducible manner, which will be a substantial focus of these workshops.

Lastly we touched on the various tools we’ll cover during the challenge. These include:

  • Statistical programming/scripting
  • Reproducible reports/literate programming
  • Software testing
  • Pre-registration
  • Registered reports
  • Pre-prints
  • Open science publishing platforms

HRB CRFC Open Science Challenge

I’m a big fan of open science, and regularly argue for its importance. Strangely, however, I’ve never actually used many of the available open science tools. It’s just too damn easy to keep going with the status quo. That stops today!

Starting on Friday, April 20, Brendan and I will be hosting the HRB CRFC Open Science challenge. The challenge will be open until the end of the year, during which we will meet 2 or 3 times per month to develop an open science research project. To complete the challenge, participants must meet the following goals before the end of the year:

  1. Create a project on the Open Science Framework.
  2. Pre-register the analysis.
  3. Prepare all the analysis code so that it can be shared.
  4. Prepare the data according to FAIR principles.
  5. Write up the manuscript.
  6. Submit a pre-print to the relevant repository
  7. Submit the manuscript to the HRB Open Resaerch Publishing Platform for publication.

During the workshops, we will introduce and discuss the various open science tools, as well as talk about how to develop a manuscript in a manner that faciliates pre-registration.

To sign up, please email me your name, position, and affiliation (ddahly@ucc.ie). Anyone is welcome, but a large proportion of places will be reserved for PhD Students and Postdoctoral Researchers working at UCC. Ideally, participants will have some ideas about their research question, and already have access to relevant data. Workshops will rely substantially on small-group discussions - you have been warned.

Meetings will generally take place on Friday mornings in the Western Gateway Building on the UCC campus. The specific schedule will be announced for each month as the rooms are sorted.

Magical clusters

I was greeted today with the news that there are 5, not 2, types of diabetes. This is earth-shattering if you are a diabetologist or diabetes researcher. However, I soon as I saw the term “data-driven clustering” I knew I could probably relax.

For the uninitiated, data-driven clustering techniques can seem magical. The basic premise is that you take a sample, measure lots of things about them, and then feed all those data into an algorithm that will divide your sample into mutually exclusive groups. Then, hopefully, knowing what group a person is in will tell you something about them that isn’t otherwise apparent in the data. This is a seductive idea. In this diabetes example, the authors are touting that knowing a patient’s group will eventually be used to guide their treatment.

I suspect that some people reading about this research won’t understand just how easy it is to detect groups. It’s so easy in fact that I would be shocked if groups weren’t found in most cases. This is because clustering, regardless of the specific method used, is just another way to describe relationships among variables - and as long as the variables are in fact related, you will find groups. Here is a simple example:

First, create some data for two variables. The first is just 1000 observations drawn from a standard normal. The second variable then multiplies the first variable by 2 and adds some additional randomly distributed noise.

  x <- rnorm(1000, 0, 1)
  y <- (2 * x) + rnorm(1000, 0, 3)

  data <- data.frame(x = x, y = y)

Plotting these data confirms that they are related.

    plot(x, y, data = data)

Pretending that we didn’t just simulate these data, upon seeing them for the first time, most people would probably try to fit a regression line, the results of which are below.

  library(pander)

  pander(lm(y ~ x, data))
Fitting linear model: y ~ x
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07992 0.09135 0.8749 0.3818
x 2.023 0.08921 22.68 3.411e-92

So the results of the linear model confirm what we already know. However, there are other ways to explain the data, so to speak. Here are the results from a fairly constrained finite mixture model which is similar enough to the k-means approach used in the diabetes paper.

  library(mclust)

  bic <- mclustBIC(data)

  clusters <- Mclust(data, x = bic, modelNames = ("EII"))
  
# summary(clusters, parameters = TRUE)
  plot(clusters, what = "classification")

What you can see from the plot is that the clustering algorithm explained the x-y scatter plot by positing the existence of 8 groups of observations, each covering a different area in the overall space. In other words, the clustering algorithm is trying to describe 2 correlated variables when the only tool it has is to group people. As long as there is in fact a correlation between the two variables, then you are going to need more than one group to describe the data (unless you start to use more flexible models).

So in most cases it is trivially easy to find groups. The challenge is of course to ascribe meaning to them. If we return to the linear regression, we might want to infer that x is a cause of y based on the observed relationship. However, we’d be foolish to do so without other information to support the causal claim. Similarly, you can pretend that data-driven clusters are revealing some deeper truth, but again, without other corroborating information, I wouldn’t be making strong claims.

For additional thoughts, Frank Harrell also blogged about this here; and Maarten van Smeden did some other simulation work described here.