The topic of prediction has gained considerable attention in the field of statistics and meta-analysis is no exception. Perhaps the most important practical benefit of a prediciton interval in meta-analysis is that it encapsulates information about the between-trial variability and it complements information about the point estimate. In contrast, the magnitude of the between-trial variance is difficult to interpret in isolation and the inability to capture the between-trial variance has been a topic of intense debate. When the estimate is zero (or close) the argument is that a “fixed” model is more appropriate, but the interpretation and usefulness of a prediction interval in this case is questionable. The “random-effects” model is assumed to always be the “correct” model, but not every dataset supports this type of model. Providing prediciton intervals in meta-analysis relfects a more accurate representation about whether the drug, treatment or intervention will be effective in future environments or for other trials or studies not in our sample.
Let’s define it:
An interval that will contain a future observation (i.e. not part of our sample) with a given level of miscoverage (say 5 or 10%), given our observed data. For this application, our inference is about a new study or trial, not about an individual observation or subject within a trial or study.
Very wide intervals have a high probability of containing the future trial/study but are likely to be impractically wide for decision making. Intervals that are too narrow and do not have nominal coverage give us an illusion of precision and certainty which might not hold true in the long run.
A prediction interval (https://en.wikipedia.org/wiki/Prediction_interval) is not a new idea and it goes back to the tradition of Gosset, Fisher and Neyman (Shafer and Vovk, 2008).
Given a sample (\(X_1, X_2, X_3, ..., X_n\)) we want an interval that will contain \(X_{n+1}\) with probability (\(1 - \alpha\), where \(\alpha \in (0,1)\)). The level of miscoverage (\(\alpha\)) is typically 0.05 or 0.10. It is common to suppose that \(X\) needs to meet the assumption of identically and indepenednently distributed (iid) and Gaussian, but this can be relaxed.
This package provides different methods for calculating a prediction interval for a random-effects model. This type of model is typical in meta-analysis.
There is substantial material on this topic here: https://femiguez.github.io/paf/
Classes supported:
All of these methods can be accessed through the interface ‘pred_int’ (not finished yet).
If we want a prediction interval for a new observation (\(Y_{n+1}\)) after a sample of \(Y_1,Y_2,Y_3,...,Y_n\) we can use the following functions.
set.seed(101)
y <- rnorm(50)
## Using the t-distribution
pdi.t <- pred_int_tdist(y)
## or pdi.t <- pred_int(y)
## Using conformal inference
pdi.c <- pred_int_conformal(y)
## or pdi.c <- pred_int(y, method = "conformal")
## Compare this with quantiles
y.quant <- quantile(y, probs = c(0.5, 0.025, 0.975))
method | fit | lwr | upr |
---|---|---|---|
tdist | -0.12 | -2.02 | 1.77 |
conformal | -0.12 | -2.06 | 1.43 |
quantile | -0.03 | -2.07 | 1.19 |
Random effects models are a way of incorporating the effect of ‘trial’ but with an imposed distribution. Commonly it is assumed that they are normally distributed but other options are possible.
\[ y_{ij} = \mu + \theta_i + e_{ij} \]
If we have a data set with the proper structure with a response variable a ‘trial’ variable and ‘reps’ within trials we can use the functions that work with data.frames
Let’s first visualize the data
data(soyrs)
soyrs.s <- tsum(soyrs, var.names = c("lrr", "Trial_ID"))
ggplot(data = soyrs.s, aes(x = m, y = id)) +
geom_point() + xlab("log RR") + ylab("ID") +
geom_errorbarh(aes(xmin = lb, xmax = ub)) +
geom_vline(xintercept = 0)
Now we can calculate prediction intervals
pdi.mf <- pred_int(soyrs, method = "metafor", var.names = c("TRT1_Yld","TRT2_Yld","Trial_ID"))
pdi.mcg <- pred_int_mcg_ntrial(lrr ~ Trial_ID, data = soyrs)
pdi.boot <- pred_int_boot(lrr ~ Trial_ID, data = soyrs)
pdi.t.df <- pred_int_tdist_df(soyrs, var.names = c("lrr", "Trial_ID"))
Alternatively, we can fit a model first using ‘lme4’ and then calcualte prediction intervals using different methods.
soy.lmm <- lmer(lrr ~ 1 + (1|Trial_ID), data = soyrs)
pdi.lm1 <- pred_int(soy.lmm, method = "tdist")
pdi.lm2 <- pred_int(soy.lmm, method = "boot")
pdi.lm3 <- pred_int(soy.lmm, method = "tdist2")
It is also possible to change the method used for degrees of freedom.
pdi.lm4 <- pred_int(soy.lmm, method = "tdist", degfr = "kr")
pdi.lm5 <- pred_int(soy.lmm, method = "tdist", degfr = "zdist") ## or 'Inf'
Now I implemented the sub-sampling method for conformal prediction, which in this case, is distribution-free.
pdi.cf.df <- pred_int_conformal_df(soyrs, var.names = c("lrr","Trial_ID"), s.size = 3)
Let’s display the data for the different methods.
method | fit | lwr | upr | |
---|---|---|---|---|
pdi.mf | metafor | 0.0116357 | -0.0909873 | 0.1142587 |
pdi.mcg | mcg-ntrial | 0.0124180 | -0.1221102 | 0.1491420 |
pdi.boot | boot | 0.0003741 | -0.0681357 | 0.2125807 |
pdi.lm1 | lmer-tdist | 0.0139535 | -0.1180481 | 0.1459551 |
pdi.lm2 | lmer-boot | 0.0150078 | -0.1180880 | 0.1481037 |
pdi.lm3 | lmer-tdist2 | 0.0139535 | -0.1167379 | 0.1446449 |
pdi.lm4 | lmer-tdist-kr | 0.0139535 | -0.1174986 | 0.1454056 |
pdi.lm5 | lmer-tdist-inf | 0.0139535 | -0.1080889 | 0.1359959 |
pdi.cf.df | conformal | 0.0127034 | -0.1140148 | 0.2977822 |
pdi.t.df | tdist-df | 0.0127452 | -0.1438330 | 0.1693234 |
This is an example where a foliar fungicide was applied.
data(mzfg)
## Remove one trial with just one rep
mzfg <- subset(mzfg, Trial_ID != "ST2007236A")
mzfg.s <- tsum(mzfg, var.names = c("lrr","Trial_ID"))
ggplot(data = mzfg.s, aes(x = m, y = id)) +
geom_point() +
geom_errorbarh(aes(xmin = lb, xmax = ub)) +
geom_vline(aes(xintercept = 0))
pdi.mf <- pred_int(mzfg, method = "metafor", var.names = c("TRT_Yld","CTR_Yld","Trial_ID"))
pdi.mcg <- pred_int(formula = lrr ~ Trial_ID, x = mzfg, method = "ntrial")
## Warning in pred_int_mcg_ntrial(formula = formula, data = x, level = level):
## chain might have not converged, gd1 (fixed). ratio: 1.5766152882168
## Warning in pred_int_mcg_ntrial(formula = formula, data = x, level = level):
## chain might have not converged, gd2 (random). ratio: 1.62737526925158
pdi.t.df <- pred_int_tdist_df(mzfg, var.names = c("lrr", "Trial_ID"))
mz.lmm <- lmer(lrr ~ 1 + (1|Trial_ID), data = mzfg)
pdi.lm1 <- pred_int(mz.lmm, method = "tdist")
pdi.lm2 <- pred_int(mz.lmm, method = "boot")
pdi.lm3 <- pred_int(mz.lmm, method = "tdist2")
pdi.lm4 <- pred_int(mz.lmm, method = "tdist", degfr = "kr")
pdi.lm5 <- pred_int(mz.lmm, method = "tdist", degfr = "zdist") ## or 'Inf'
pdi.cf.df <- pred_int_conformal_df(mzfg, var.names = c("lrr","Trial_ID"), s.size = 3)
method | fit | lwr | upr | |
---|---|---|---|---|
pdi.mf | metafor | 0.0228009 | -0.0100376 | 0.0556394 |
pdi.mcg | mcg-ntrial | 0.0224006 | -0.0043962 | 0.0494299 |
pdi.t.df | tdist-df | 0.0242552 | -0.0844140 | 0.1329245 |
pdi.lm1 | lmer-tdist | 0.0228913 | -0.0083930 | 0.0541756 |
pdi.lm2 | lmer-boot | 0.0222711 | 0.0042597 | 0.0402825 |
pdi.lm3 | lmer-tdist2 | 0.0228913 | -0.0766857 | 0.1224683 |
pdi.lm4 | lmer-tdist-kr | 0.0228913 | -0.0084089 | 0.0541915 |
pdi.lm5 | lmer-tdist-inf | 0.0228913 | -0.0081226 | 0.0539052 |
pdi.cf.df | conformal | 0.0242657 | -0.0812638 | 0.1473882 |
For this example I will use 90% prediction intervals as they are more stable than the typical 95%.
There are many models here that solve the problem in different ways. The conformal prediction interval has some advantages and disadvantages:
Because it does not make assumptions about an underlying model, the interval is wider, which makes sense. For small sample sizes it is not possible to construct a proper interval because the bounds end up being the upper and lower values of the sample (‘minimum and maximum’).
data(soysff)
## Comparing methods
pdi.mf <- pred_int(soysff, method = "metafor", level = 0.90,
var.names = c("TRT_Yld","CTR_Yld","Trial_ID"))
pdi.mcg <- pred_int(formula = lrr ~ Trial_ID, x = soysff,
method = "ntrial", level = 0.90)
pdi.t.df <- pred_int_tdist_df(soysff, level = 0.90,
var.names = c("lrr", "Trial_ID"))
soy.lmm <- lmer(lrr ~ 1 + (1|Trial_ID), data = soysff)
## boundary (singular) fit: see ?isSingular
pdi.lm1 <- pred_int(soy.lmm, method = "tdist", level = 0.90)
pdi.lm2 <- pred_int(soy.lmm, method = "boot", level = 0.90)
pdi.lm3 <- pred_int(soy.lmm, method = "tdist2", level = 0.90)
pdi.lm4 <- pred_int(soy.lmm, method = "tdist", degfr = "kr", level = 0.90)
pdi.lm5 <- pred_int(soy.lmm, method = "tdist", degfr = "zdist", level = 0.90) ## or 'Inf'
## Bayesian methods
prior1 <- list(R = list(V = 1, nu = 0.001), G = list(G1=list(V = 1, nu = 0.001)))
soy.mcg <- MCMCglmm(lrr ~ 1, random = ~Trial_ID, data = soysff,
prior = prior1, pr = TRUE, verbose = FALSE)
pdi.mcg1 <- pred_int(soy.mcg, method = "simulate", level = 0.90)
methods <- c("metafor", "mcg-ntrial","lmer-tdist","lmer-boot","lmer-tdist2",
"lmer-tdist-kr","lmer-tdist-inf","conformal","mcg-simulate",
"tdist-df")
cpdi <- data.frame(method = methods,
rbind(pdi.mf,pdi.mcg,pdi.lm1,pdi.lm2,pdi.lm3,pdi.lm4,pdi.lm5,
pdi.cf.df,pdi.mcg1,pdi.t.df))
kable(cpdi)
method | fit | lwr | upr | |
---|---|---|---|---|
pdi.mf | metafor | 0.0219847 | 0.0063394 | 0.0376300 |
pdi.mcg | mcg-ntrial | 0.0280974 | 0.0200789 | 0.0357235 |
pdi.lm1 | lmer-tdist | 0.0281040 | 0.0205813 | 0.0356267 |
pdi.lm2 | lmer-boot | 0.0279835 | 0.0115147 | 0.0444524 |
pdi.lm3 | lmer-tdist2 | 0.0281040 | -0.0785486 | 0.1347566 |
pdi.lm4 | lmer-tdist-kr | 0.0281040 | 0.0205616 | 0.0356464 |
pdi.lm5 | lmer-tdist-inf | 0.0281040 | 0.0207804 | 0.0354276 |
pdi.cf.df | conformal | 0.0242657 | -0.0812638 | 0.1473882 |
pdi.mcg1 | mcg-simulate | 0.0295827 | -0.0929196 | 0.1517079 |
pdi.t.df | tdist-df | 0.0271355 | -0.0824906 | 0.1367616 |
ggplot(data = cpdi, aes(x = fit, y = method)) +
geom_point() +
geom_errorbarh(aes(xmin = lwr, xmax = upr), height = 0.5)
data(soyrs)
## Simply calculate the trial means
tmns <- aggregate(lrr ~ Trial_ID, data = soyrs, FUN = mean)
## Bootstrapped stratified trial means
btm <- boot_tmeans(lrr ~ Trial_ID, data = soyrs, R = 2e3)$dat
pdi.cf <- pred_int_conformal_df(formula = lrr ~ Trial_ID, x = soyrs)
btm.q <- quantile(btm$ys, probs = c(0.5, 0.025, 0.975))
ggplot() + xlab("lrr") +
geom_density(data = btm, aes(x = ys)) +
geom_jitter(data = soyrs, aes(x = lrr, y = 2.5)) +
geom_jitter(data = tmns, aes(x = lrr, y = 5), color = "blue", size = 1.2) +
geom_point(aes(x = pdi.cf[1], y = -1), color = "orange", size = 1.2) +
geom_errorbarh(mapping = aes(xmin = pdi.cf[2], xmax = pdi.cf[3],
y = -1), color = "orange", size = 1.2) +
geom_point(aes(x = btm.q[1], y = -2), color = "red", size = 1.2) +
geom_errorbarh(aes(xmin = btm.q[2], xmax = btm.q[3], y = -2),
color = "red", size = 1.2)
set.seed(12345)
x <- rnorm(25)
## t-dist method
pp1 <- pred_profile(x)
## conformal - quantile
pp2 <- pred_profile(x, method = "conformal")
## conformal - deviation
pp3 <- pred_profile(x, method = "conformal", m.method = "deviation")
## conformal - jackknife
pp4 <- pred_profile(x, method = "conformal", m.method = "jackknife")
par(mfrow=c(2,2))
plot(pp1)
plot(pp2)
plot(pp3)
plot(pp4)
par(mfrow=c(1,1))
Shafer and Vovk (2008). “A Tutorial on Conformal Prediction”. Journal of Machine Learning Research 9 (2008) 371-421.
Lei et al. “Distribution-Free Predictive Inference for Regression”. https://arxiv.org/abs/1604.04173
Dunn and Wasserman (2018) “Distribution-Free Prediction Sets with Random Effects”. https://arxiv.org/abs/1809.07441