Chapter 2: Confidence and prediction intervals in meta-analysis

Background

Include here an introduction to linear mixed models

On-farm research networks

Some useful background for the following topic is: (doi: 10.2134/agronj2019.02.0135).

Example: soybean row spacing multi-location trials

This first example is for soybean row spacing. The question is: is there a difference between growing soybean using rows 15 inches apart vs. 30 in apart? The planting density is assumed to be the same so only the configuration of the planting scheme was changed. The experimental design is that of paired strips, but it is organized in different trials located in different farms. Each farm/field/trial has unique soils, management history and experiences their own weather conditions. They are all located in Iowa and each season will exert similar effects (e.g. dry vs. wet). It is common to analyze the data as the response ratio, or the yield of the treatment over the yield of the control. Taking the log is common practice.

\[ RR = \frac{Yield_T}{Yield_C} \]

\[ lrr = \log(RR) \]

data("soyrs")
soy <- soyrs
## Select only a subset of data
soy$lrr <- log(soy$TRT1_Yld/soy$TRT2_Yld)
soy.o <- soy[order(soy$lrr),]
soy.o$id <- 1:nrow(soy.o)

Visually, it is hard to make a definite statement, but it would seem that the practice is highly beneficial. We could calculate how many observations were positive and see that around 60% of the observations have a positive outcome. Is this enough to recommend the use of the practice? Answer: no, an economic analysis would be a better tool to make decisions.

round(dim(subset(soy.o, lrr > 0))[1]/nrow(soy.o),2)

## [1] 0.6

There is a clear break which suggests that some of these trials are quite different from the rest. Which variable explains these differences? Almost all of the high values were from year 2014. In addition, we should also investigate how do the observations group with respect to studies.

Almost all of the observations which have a large response are from a single study. It is not uncommon to throw this out and call it an ‘outlier’, but we don’t have a reason for doing this at the moment. What if it truly represents a situation that can occurr again? We should consider it as part of the decision making process.

Confidence intervals for each individual study

The function ‘tsum’ is in package ‘predintma’

The variability in the confidence intervals is a reflection of both the underlying data and the sample size within each trial. Trials with a smaller sample size tend to have wider confidence intervals.

## Trials sig. > 0: 0.11

## Trials sig. < 0: 0.06

## Trials incl. 0: 0.83

Now only 11% of the trials had a significant positive response, about 6% had a significant negative response and the rest, 83%, had a response for which the confidence interval included zero. This is much less promising than the 60% we observed previously, but now we are counting the number of trials which had a positive reponse instead of the number of observations. It is somewhat ironic that the trial with the highest number of reps (the one with the highest precision) is the one that deviates the most with the observed response in the rest of the trials. This trial experienced hail, which is likely involed in the observed response. One possible explanation for the one trial which had a negative response could be due to higher moisture in the canopy (in narrow rows) which leads to higher disease pressure, particularly white mold.

Linear Model

The disadvantage of the previous analysis is that we are not analyzing the data together. If we want to make a statement about the overall performance of the management practice, combining the trials is necessary. In the following steps we will gradually increase the complexity of the analysis. The first model (lm0) does not include trial, it analyzes the data ignoring its structure.

Prediction interval for linear model

The concept of a prediciton interval was introduced in Chapter 0. In this context, it represents a statement about a trial which is not part of the sample we collected. That is, we want an interval that will contain a trial in the future with 95% probability (or a different alpha-level). This is not trivial in this context. For example, Partlett et al. (2017) show that the prediction interval as defined by Higgins et al (2009) does not perform well when the between trial variance is equal or less than the within trial variance. Many variations on the interval using different methods did not present coverage close to the stated alpha-level (e.g. 95%).

## Fitting a first model without trial, just the intercept
soy.lm0 <- lm(lrr ~ 1, data = soy.o)
## Intercept parameter confidence interval
pp.lm0 <- predict(soy.lm0, 
                  newdata = data.frame(lrr = mean(soy.o$lrr)),
                  interval = "confidence")
## Individual observation confidence interval (prediction)
ppp.lm0 <- predict(soy.lm0, 
                   newdata = data.frame(lrr = mean(soy.o$lrr)), 
                   interval = "prediction")
## The auxiliary function pred_int can also be used
## ppp.lm0 <- pred_int(soy.o$lrr)

In the following figure it is clear that the confidence interval is very narrow and it does not represent the true variability in the dataset and the prediction interval is very wide and while it describes the distribution of the data better, its usefulness is questionable.

A second step would be to include ‘trial’ in the model, but still in the context of a linear model. This model tests whether there is an effect of ‘trial’ (in this case it is significant); at least one trial has an effect which is different from the rest.

## Fitting a model with trial (fixed)
## Need to look into defining other contrasts
soy.lm1 <- lm(lrr ~ Trial_ID, data = soy.o)
pp.lm1 <- predict(soy.lm1, 
                  newdata = data.frame(Trial_ID = unique(soy.o$Trial_ID)),
                  interval = "confidence")
pp.lm1d <- data.frame(Trial_ID = unique(soy.o$Trial_ID), pp.lm1)
pp.lm1d2 <- merge(soy.s, pp.lm1d)
## prediciton interval for the trial which is closer to the mean
pmt <- which.min(abs(pp.lm1d$fit - coef(soy.lm0)))
## The previous is potentially dangerous, because the actual mean might
## be far away from any trial in particular
ppp.lm1 <- predict(soy.lm1, 
              newdata = data.frame(Trial_ID = unique(soy.o$Trial_ID)[pmt]),
                                          interval = "prediction")
ppp.lm1 <- as.data.frame(ppp.lm1)

The blue intervals in the folowing figure represent the fit from a linear model which includes trial as fixed. Since we assume there is a common variance among the trials the confidence intervals are narrower for some cases and a bit wider in other cases. The new line (“lm1 pred single trial”) is the prediction interval for a specific trial which was close to the estimate of the population mean(\(\hat{\mu}\)). It is much wider than the confidence interval for \(\hat{\mu}\) but narrower than the prediction interval which ignored the structure of the data (lm0).

The main difference now that we analyze the data combined is that the confidence intervals for individual trials are narrower (blue dots and errorbars). The reason is that we assume homogeneous variance and we use the data from all the trials to estimate a single variance. (This is reasonable). For some of the trials the precision in the confidence interval has improved (it is narrower), but for others it is slightly wider. Since the sample size for each trial is different, the intervals are of different widths. Had the sample sizes for all trials been the same, then the width of the intervals would have been exactly the same for all trials. The prediciton interval for a single trial (close to the average) is shown and it is narrower than for the intercept of the lm0 (no trial) model.

Linear Mixed Models

We have more or less exhausted what we can do using linear models for this example (not really). Now we move on to mixed models and we consider ‘trials’ as ‘random’. This means that we are less focused on what has happened on each individual trial and we consider trials as being ‘sampled’ from a larger population. (Also see the concept of exchangeability) When considering factors as fixed we assume that we can replicate that level, but here each trial is an ocurrence which we cannot easily replicate. For a distinction between ‘fixed’ and ‘random’ and other issues for mixed models see: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html

Random-effects model

Random effects models are a way of incorporating the effect of ‘trial’ but with an imposed distribution. Commonly it is assumed that they are normally distributed but other options are possible.

\[ y_{ij} = \mu + \theta_i + e_{ij} \] \(y_{ij}\) is the response variable (here, log of the response ratio), \(\mu\) is the intercept (or population mean), \(\theta_i\) is the random effect of each trial (assumed to have zero mean and variance \(\tau^2\)), \(e_i\) is the residual which accounts for the within trial variability (assumed to have zero mean and variance \(\sigma^2\)). Although the distribution of \(\theta_i\) and \(e_{ij}\) is typically assumed to be normal (Gaussian), this is an assumption that can be relaxed.

soy.lme <- lmer(lrr ~ 1 + (1|Trial_ID), data = soy.o)
round(exp(fixef(soy.lme)),3)

## (Intercept) 
##       1.014

round(exp(confint(soy.lme, quiet = T))[3,],3)

##  2.5 % 97.5 % 
##  0.983  1.046

##For these type of models there are different methods for 
##computing confidence intervals ("profile","Wald","boot").
## see ?confint.merMod for more details

## Confidence intervals 
ci.lme <- confint(soy.lme, quiet = T)[3,]
pp.lme <- c(fixef(soy.lme),ci.lme[1],ci.lme[2])

These results point to a 1.4% increase in yield when using the narrower row spacing (\(\hat{\mu}\)). The confidence interval provides additional support that this is a small increase. The point estimates for the standard deviations are \(\hat{\tau} =\) 0.0602766 and \(\hat{\sigma} =\) 0.058963 . The intra-class correlation is \(0.06028^2/(0.06028^2 + 0.05896^2) = 0.51\). Later, the intra-class correlation will be a very important quantity and concept. Another quantity of interest is the heterogeneity ratio (HR) (\(\frac{\tau^2}{\sigma^2}\)); which in this case is \(0.06028/0.05896 = 1.022388\)

The confidence interval for the intercept of the model (lme) is wider than the one computed with the linear model (lm0), but its location is much closer to zero to the point that it overlaps the zero line. It is wider because the trial variance is incoporated in this uncertainty. The effect is much smaller than what we had anticipated before. This analysis has the characteristic that it gives less weight to extreme observations.

Following Higgins et al. (2009) we can calculate the prediction interval. This is a known result applied to meta-analysis. For the general case, see Chapter 0.

\[ \hat{\mu} \pm t^{\alpha}_{k-2} \sqrt{\hat{\tau}^2 + SE(\hat{\mu})^2} \] \[ \hat{\mu} = \text{estimated 'population' mean (or intercept)} \] \[ \hat{\tau} = \text{estimated between studies std. deviation} \] \[ SE(\hat{\mu}) = \text{estimated standard error for the intercept} \]

Here we compute the prediction interval using Higgins formula with the helper function ‘pred_int’ method = ‘tdist’ and there is another option for using the bootstrap method, which can be more robust in the presence of outliers.

## pred_int is a helper function which
## decluters things. For details see chapter 0
p.soy.lme <- pred_int(soy.lme, method = "tdist")
## The boostrap can be more robust to outliers
p.soy.lme.b <- pred_int(soy.lme, method = "boot")
p.soy.lme.d <- data.frame(funct = rep("lmer",2), 
                          method = c("tdist","boot"),
                          fit = c(p.soy.lme[1],p.soy.lme.b[1]),
                          lwr = c(p.soy.lme[2],p.soy.lme.b[2]),
                          upr = c(p.soy.lme[3],p.soy.lme.b[3]), 
                          row.names = NULL)
## The bootstrapped result is very similar to the 
## simpler tdist method.
kable(p.soy.lme.d)

funct	method	fit	lwr	upr
lmer	tdist	0.0139535	-0.1180481	0.1459551
lmer	boot	0.0150939	-0.1179705	0.1481583

A fact that is somewhat unexpected is that the prediction interval does not simply overlap the distribution of the trial means. It is pulled toward zero and it does not include the trial which has a more extreme positive value.

Bayesian approach

There are many reasons for considering a Bayesian model for meta-analysis. An important one is that the between and within variances can be underestimated in the REML-based mixed model method (Gelman 2006). In some cases, including (weakly) informative priors for the variances can produce more realistic results. Other quantities of interest such as the confidence intervals for individual trials or prediction intervals are more easily derived.

Some references: Chung et al. “Avoiding zero between-study variance estimates in random-effects meta-analysis”. Statistics in Medicine. (2013). DOI: 10.1002/sim.5821

Gelman, Andrew: “Prior distributions for variance parameters in hierarchical models”. Bayesian Analysis. (2006).

In this first example we are using the pacakge ‘MCMCglmm’. We provide weak priors.

prior1 <- list(B = list(mu = 0, V = 10),
              G = list(G1 = list(V = 1, nu = 0.002)),
              R = list(V = 1, nu = 0.002))
soy.mc <- MCMCglmm(lrr ~ 1, random = ~ Trial_ID, 
                   prior = prior1, pr = TRUE, burnin = 5000,
                   nitt = 50000, thin = 2,
                  data = soy.o, verbose = FALSE, DIC = TRUE)
soy.mc.ci <- as.vector(summary(soy.mc)$solutions)[1:3]
## soy.mc.ci <- predict(soy.mc, interval = "confidence")[1,]
## Those two lines are equivalent

## With non-informative priors the results are almost identical 
## to the lme solution
round(exp(rbind(pp.lme, soy.mc.ci)),3)

##           (Intercept) 2.5 % 97.5 %
## pp.lme          1.014 0.983  1.046
## soy.mc.ci       1.014 0.981  1.049

## Calculating a predictive interval using MCMCglmm
## There are many different ways of doing this
## and they give similar results
soy.mc.pi0 <- pred_int(soy.mc, method = "tdist") 
soy.mc.pi1 <- pred_int(soy.mc, method = "mcmc") 
soy.mc.pi2 <- pred_int(soy.mc, method = "simulate") 
soy.mc.pi3 <- pred_int(soy.mc, method = "predict") 
## 'new_trial' method
soy.mc.pi4 <- pred_int(soy.o, method = "ntrial", 
                       var.names = c("lrr","Trial_ID")) 

## I pick one but they all give similar answers
soy.mc.pi <- soy.mc.pi3

soy.mc.col <- as.data.frame(rbind(soy.mc.pi0, 
                                  soy.mc.pi1,
                                  soy.mc.pi2,
                                  soy.mc.pi3,
                                  soy.mc.pi4))
mmthd <- c("tdist","mcmc","simulate","predict","ntrial")
soy.mc.col.d <- data.frame(funct = "MCMCglmm", 
                           method = mmthd, soy.mc.col,
                           row.names = NULL)
soy.mcpi <- rbind(p.soy.lme.d, soy.mc.col.d)

kable(cbind(soy.mcpi[,1:2],round(exp(soy.mcpi[,3:5]),3)), 
      caption = "Different methods for computing prediction intervals")

Different methods for computing prediction intervals
funct	method	fit	lwr	upr
lmer	tdist	1.014	0.889	1.157
lmer	boot	1.015	0.889	1.160
MCMCglmm	tdist	1.014	0.879	1.170
MCMCglmm	mcmc	1.015	0.889	1.161
MCMCglmm	simulate	1.014	0.891	1.154
MCMCglmm	predict	1.014	0.893	1.151
MCMCglmm	ntrial	1.012	0.882	1.159

These methods require an explanation. The ‘lmer-tdist’ method extracts the REML-based variance estimates and uses the standard prediction interval equation from Higgins et al. (2009). The ‘lmer-boot’ computes the same statistic but it uses the bootstrap method (“lmer::bootMer”). The idea is that the bootstrap method can be better in the presence of outliers. However, it is computationally intensive and, by default, it runs 500 times. The ‘MCMCglmm-tdist’ method uses the same equation but it extracts the variance estimates from the Markov chain Monte Carlo (MCMC) Bayesian model. The ‘MCMCglmm-mcmc’ method extracts the posterior distribution for \(\mu\) and \(\tau\) and it simulates deviations from a normal distribution for the trial effects (using \(\tau_j\), j is the \(j^{th}\) mcmc sample) and then it simply adds these two posteriors. The ‘MCMCglmm-simulate’ method creates a posterior distribution for a new (out-of-sample) trial effect (\(\theta_{new}\)) and it adds this to the posterior distribution of \(\mu\). The ‘MCMCglmm-predict’ method is the most computationally intensive and it simulates the distribution of all trial effects and then it averages over all the samples. (See the function ‘predict.MCMCglmm’, with options, marginal = NULL and interval = ‘prediction’). These last two methods seem to generate very similar results, but ‘simulate’ is more efficient. The ‘MCMCglmm-ntrial’ method adds a new trial to the data.frame (with a missing value for the response) and fits the model, it extracts the posterior distribution of \(\theta_{new}\) and then it adds it to the posterior distribution of \(\mu\). Conceptually, it is the same as method ‘simulate’ but in practice it does not seem to account for within trial variability. The subtle difference between these last three methods tends to have important consequences in edge cases.

The main thing to notice is the difference in the variance of a trial effect in the sample (“theta_i=4”) and that of one outside our sample (“theta_new”). The relative location of both distributions is not important here, only their spread. Trial = 4 was chosen as an example, other trials in the network would have similar spread.

Bayesian model using JAGS (and rjags)

There are many alternatives (and software options) for Bayesian analysis. One of these alternatives is based on the BUGS language (cite). In R an implementation of this approach is through the connection with JAGS (Just Another Gibbs Sampler) and the ‘rjags’ package (other packages are available). In this case we define the model in terms of the assumptions about the probability distributions of each parameter in the model.

soyd <- list(lrr = soy.o$lrr, trial = soy.o$Trial_ID,
           t.n = length(unique(soy.o$Trial_ID)), 
           n = nrow(soy.o))
init <- list(mu = 0, tau = 0.01, tau.trial = 0.01)
modelstring="
  model {
    # Single intercept model likelihood
    for (i in 1:n) {
      lrr[i]~dnorm(mu + b[trial[i]],tau)
    }

    # trial effect 
    for(j in 1:t.n){
      b[j] ~ dnorm(0, tau.trial)
    }
    
    # priors
    mu ~ dnorm(0,0.00001) # intercept prior
    tau ~ dgamma(0.0001,0.0001) ## tau is the residual precision
    sigma <- 1.0/sqrt(tau)
    tau.trial ~ dgamma(0.0001, 0.0001)
    sigma.trial <- 1.0/sqrt(tau.trial)
    # generate predictions 
    beff ~ dnorm(0, tau.trial)
    pred <- mu + beff
  }
"
mdl=jags.model(textConnection(modelstring), data=soyd, inits=init, 
               n.chains = 2)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 119
##    Unobserved stochastic nodes: 22
##    Total graph size: 289
## 
## Initializing model

update(mdl,n.iter = 15000, n.burnin=10000)
output=coda.samples(model=mdl, 
                    variable.names=c("mu","pred","sigma","sigma.trial"), 
                    n.iter=20000, thin=10)
gelman.diag(output) ## Should be close to 1 to assume convergence

## Potential scale reduction factors:
## 
##             Point est. Upper C.I.
## mu                   1       1.00
## pred                 1       1.00
## sigma                1       1.01
## sigma.trial          1       1.00
## 
## Multivariate psrf
## 
## 1

print(summary(output))

## 
## Iterations = 15010:35000
## Thinning interval = 10 
## Number of chains = 2 
## Sample size per chain = 2000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                Mean       SD  Naive SE Time-series SE
## mu          0.01386 0.016684 2.638e-04      0.0003589
## pred        0.01339 0.065776 1.040e-03      0.0010000
## sigma       0.05944 0.004254 6.727e-05      0.0000661
## sigma.trial 0.06286 0.012575 1.988e-04      0.0001989
## 
## 2. Quantiles for each variable:
## 
##                 2.5%       25%     50%     75%   97.5%
## mu          -0.02095  0.003293 0.01390 0.02485 0.04663
## pred        -0.11908 -0.028727 0.01276 0.05684 0.14127
## sigma        0.05177  0.056502 0.05931 0.06204 0.06852
## sigma.trial  0.04338  0.054025 0.06113 0.06957 0.09216

plot(output, trace = FALSE)

plot(output, density = FALSE)

Bayesian approach using ‘rstanarm’

## refresh = 0 suppresses output
## default priors seem to be okay
soy.stn <- stan_lmer(lrr ~ 1 + (1|Trial_ID), data = soy.o, refresh = 0)
print(soy.stn, digits = 4)

## stan_lmer
##  family:       gaussian [identity]
##  formula:      lrr ~ 1 + (1 | Trial_ID)
##  observations: 119
## ------
##             Median MAD_SD
## (Intercept) 0.0146 0.0158
## 
## Auxiliary parameter(s):
##       Median MAD_SD
## sigma 0.0594 0.0044
## 
## Error terms:
##  Groups   Name        Std.Dev.
##  Trial_ID (Intercept) 0.063733
##  Residual             0.059598
## Num. levels: Trial_ID 18 
## 
## Sample avg. posterior predictive distribution of y:
##          Median MAD_SD
## mean_PPD 0.0424 0.0076
## 
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg

## Calculating a credible interval for parameter 'mu' (Intercept)
soy.stn.ci <- exp(posterior_interval(soy.stn, 
                                 prob = 0.95, 
                                 pars = "(Intercept)"))

## Predictive interval for 'new_trial'
pdint <- predictive_interval(soy.stn, re.form = NA,
                                    prob = 0.95, 
                                    newdata = data.frame(Trial_ID = "new_trial"))

soy.stn.pdi <- c(fixef(soy.stn), pdint)
soy.stn.pdi.d <- data.frame(funct = "stan_lmer", method = "predict",
                            fit = fixef(soy.stn),
                            lwr = soy.stn.pdi[2],
                            upr = soy.stn.pdi[3],
                            row.names = NULL)

Comparing methods for estimating prediction intervals

At this point I have introduced many methods for computing prediction intervals and in this case there is a strong agreement among the different approaches. This, however, does not prove that they are correct, it is possible for all of them to be incorrect, or in other words, they do not have the claimed alpha-coverage for a new observation (out-of-sample prediction).

Different methods for computing prediction intervals
funct	method	fit	lwr	upr
lmer	tdist	1.014	0.889	1.157
lmer	boot	1.015	0.889	1.160
MCMCglmm	tdist	1.014	0.879	1.170
MCMCglmm	mcmc	1.015	0.889	1.161
MCMCglmm	simulate	1.014	0.891	1.154
MCMCglmm	predict	1.014	0.893	1.151
MCMCglmm	ntrial	1.012	0.882	1.159
jags	mcmc	1.012	0.893	1.149
stan_lmer	predict	1.015	0.899	1.148

Comparing model variance components

Comparing between and within trial std. deviations
method	sigma.trial	sigma
lm	NA	0.1082348
lmer	0.0602766	0.0589630
mcmc.glmm	0.0623914	0.0593439
jags	0.0611323	0.0593113
stan	0.0637331	0.0593765

References

Higgins et al. “A re-evaluation of random-effects meta-analysis”. J. R. Statist. Soc. A (2009) 172, Part 1, pp. 137–159.

IntHout J, Ioannidis JPA, Rovers MM, et al. Plea for routinely presenting prediction intervals in meta-analysis. BMJ Open 2016;6:e010247. doi:10.1136/bmjopen-2015- 010247

Hadfield, Jarrod. “MCMC Methods for Multi-Response Generalized Linear Mixed Models: The MCMCglmm R Package”. JSS (2010).

Intro mixed models: (https://stats.idre.ucla.edu/other/mult-pkg/introduction-to-linear-mixed-models/).

Wiki (https://en.wikipedia.org/wiki/Mixed_model).

Another reference is the book on Mixed Models by McCollough and Searle (2000/8).

Linear Mixed Models (part a)

Fernando Miguez

2019-08-21