Chapter 2b: Assessing intervals coverage (part b)

One important concept used later is that of the heterogeneity ratio (HR), which is the ratio of the between trial variance and the within trial variance

\[ HR = \frac{\tau^2}{\sigma^2} \]

A similar concept is that of the intra-class correlation

\[ ICC = \frac{\tau^2}{\tau^2 + \sigma^2} \] Which could also re-interpret in this context as the Relative Heterogeneity Ratio (RHR).

When the between trial variance gets small: ‘Stratego YLD foliar fungicide’

There are many examples in which the observed trial variance is smaller than the within trial variance. This is not, in general, what we expect with mixed models. If the sample size is high and this allows us to estimate veriances with good precision we often (not always) expect the ‘between-group’ variance to be larger than the ‘within-group’ variance.

One (somewhat extreme) example where this happens is the ‘Stratego YLD’ foliar fungicide in soybean.

sfs <- read.csv("./ch2/data/StrategoYldOnSoybean.csv")
sfs$lrr <- log(with(sfs, TRT_Yld/CTR_Yld))
## The lowest value is very different from the rest
summary(sfs$lrr)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.250709 -0.009848  0.024904  0.028104  0.063291  0.254443

## Plot by averaging
sfs.s <- tsum(sfs) 
sfs.a <- tsum(sfs, method = "min-max")

## Notice that the variance of trial means is very small
cat("Trial means variance: ",var(sfs.s$m),"\n")

## Trial means variance:  0.0009763225

sfs.lm0 <- lm(lrr ~ 1, data = sfs)
sfs.lm <- lm(lrr ~ Trial_ID, data = sfs)
sfs.lmer <- lmer(lrr ~ 1 + (1 | Trial_ID), data = sfs)

## boundary (singular) fit: see ?isSingular

## with this particular dataset the between trial variance is 0
## For more information see ?isSingular

What should we do when the between trial variance is zero?

In reality the following should also apply when the between trial variance is not exactly zero but very close.

Option 0: we give up and decide these data is not worth analyzing.

Option 1: we decide that the ‘true’ model is a model without the trial effect. The trial effect ‘does not exist’; “there is no spoon”. In this case we can only compute two intervals the ‘confidence interval’ and the ‘prediction interval’. Predicting a ‘new’ trial does not make sense because ‘there is no trial’. We can only make inferences about the model parameter \(\mu\) or an observation \(y_i\). The prediction interval is about \(y_{new}\); not about a new trial.

Option 2: we decide that the ‘true’ model is one with a trial effect, but we were not able to estimate it well with these data. One fact that supports this option is that ‘trials’ were part of the experimental design and it represents different farms, operators, machinery, hybrids, etc. Statistically, we could try to fix some value for the between trial variance, but this is arbitrary and not a reasonable option. Another approach is to use the bootstrap method, perhaps by resampling these data we can obtain a better estimate of the between trial variance. This option still results in a very small number for the variance (or standard deviation).

## bootstrapped sigma quantile 95% confidence interval 
##   0.025 (%): 0 
##  50     (%): 1.134899e-09 
##  97.5   (%): 0.02036577

Option 3: we decide that the ‘true’ model should have a trial effect, but as in option 2 we recognize that these data does not allow us to estimate it. The difference is that now we are using a Bayesian approach. However, this is more involved and it requires developing informative priors for this parameter and providing informative priors for a variance is not intuitive. One option is to estimate the between trial variance from other datasets. Also, even if we create weakly informative priors the estimated variance is likely to be small but not exactly zero (Williams, Rast and Burkner, 2018).

Frequentist Trial Effect Existencialist Manifesto
HRs	belief	action
HR = 1	trial effect exists	just analyze the data
HR = 0.75	trial effect likely to exist	we might need more data
HR = 0.5	trial effect may exist	we need more data
HR = 0.25	trial effect likely to not exist	are we going to drop the trial effect?
HR = 0	trial effect does not exist	just drop the trial effect

Bayesian Trial Effect Existencialist Manifesto
HRs	belief	action
HR = 1	trial effect exists (no need for priors)	just analyze the data
HR = 0.75	trial effect exists (we might need priors)	are prediction intervals good?
HR = 0.5	trial effect exists (consider better priors)	prediction intervals are not good
HR = 0.25	trial effect exists (we need better priors)	we need more data
HR = 0	trial effect exists (we really need better priors)	we need more data and priors

set.seed(123456789)
prior1 <- list(R = list(V = 1, nu = 0.001), G = list(G1=list(V = 1, nu = 0.001)))
## These are 'non-informative' priors
sfs.mcmc1.2 <- MCMCglmm(lrr ~ 1, random = ~Trial_ID, data = sfs, 
                      prior = prior1, pr = TRUE, verbose = FALSE)
sfs.mcmc2.2 <- MCMCglmm(lrr ~ 1, random = ~Trial_ID, data = sfs, 
                      prior = prior1, pr = TRUE, verbose = FALSE)
sfs.mcmc3.2 <- MCMCglmm(lrr ~ 1, random = ~Trial_ID, data = sfs, 
                      prior = prior1, pr = TRUE, verbose = FALSE)
sfs.mcmc.v2 <- mcmc.list(sfs.mcmc1.2$VCV, sfs.mcmc2.2$VCV, sfs.mcmc3.2$VCV)
sfs.mcmc.m2 <- mcmc.list(sfs.mcmc1.2$Sol, sfs.mcmc2.2$Sol, sfs.mcmc3.2$Sol)
gelman.diag(sfs.mcmc.v2)

## Potential scale reduction factors:
## 
##          Point est. Upper C.I.
## Trial_ID          1          1
## units             1          1
## 
## Multivariate psrf
## 
## 1

plot(sfs.mcmc.v2, density = FALSE)

set.seed(123456)
prior2 <- list(R = list(V = 0.0013978, nu = 1.492583), 
             G = list(G1=list(V = 0.0013978*diag(1), nu = 0.492583)))
## These are 'non-informative' priors
sfs.mcmc1.3 <- MCMCglmm(lrr ~ 1, random = ~Trial_ID, data = sfs, 
                      prior = prior2, pr = TRUE, verbose = FALSE)
sfs.mcmc2.3 <- MCMCglmm(lrr ~ 1, random = ~Trial_ID, data = sfs, 
                      prior = prior2, pr = TRUE, verbose = FALSE)
sfs.mcmc3.3 <- MCMCglmm(lrr ~ 1, random = ~Trial_ID, data = sfs, 
                      prior = prior2, pr = TRUE, verbose = FALSE)
sfs.mcmc.v3 <- mcmc.list(sfs.mcmc1.3$VCV, sfs.mcmc2.3$VCV, sfs.mcmc3.3$VCV)
sfs.mcmc.m3 <- mcmc.list(sfs.mcmc1.3$Sol, sfs.mcmc2.3$Sol, sfs.mcmc3.3$Sol)
gelman.diag(sfs.mcmc.v3)

## Potential scale reduction factors:
## 
##          Point est. Upper C.I.
## Trial_ID       1.01       1.02
## units          1.00       1.00
## 
## Multivariate psrf
## 
## 1

plot(sfs.mcmc.v3, density = FALSE)

Based on these data alone it is difficult to decide which interval is the ‘correct’ one. If the goal is to have a prediction interval that it likely to contain a ‘new trial’ then, maybe, there are some we can discard. First, ‘lmer-tdist’ seem to be too narrow, it currently only includes just a few of the observed trials, so if the sample of trial means looks similar as the one we obtained it is too narrow. The method ‘lmer-boot’ is wider than ‘lmer-tdist’ and it is likely to have better coverage of future trial means, but it does not seem wide enough. After these two we have two groups: the first one includes ‘mcglm-tdist’, ‘mcglm-mcmc’ and ‘mcglm-ntrial’. The support for these intervals being correct is that we believe that the observed trial means are overdispersed and that a future collection of data will be included in these intervals. In the second group we have ‘mcglm-simulate’, ‘mcglm-predict’, ‘lm-tdist’ and ‘lmer-tdist2’. The widths of these intervals are surprisingly similar, but the methods behind the calculations are very different. The simplest one is ‘lm-tdist’ which is basically a model in which we removed the effect of trial (fixed-effects model) and now the interpretation of the interval is that it will include future \(y_{new}\) observations, since trials ‘do not exist’. The methods ‘mcglm-simulate’ and ‘mcglm-predict’ use similar approaches, but they seems a bit too wide. They are more likely to contain at least 95% of future trial means than the first group if the future looks similar as the observed data. The method ‘lmer-tdist2’ is a new experimental equation. If we are forced to make a visual statement, it would seem that the ‘correct’ interval would be somewhere between the first group and the second group. The first group is too narrow and the second group is too wide. We need more evidence to support these visual impressions.

## Coverage for method 'mcglm-ntrial' 89.2 %

## Coverage for method 'mcglm-simulate' 100 %

This is the coverage of these intervals for the ‘current’ trials, but these intervals are aimed at including an observation from a trial which is not included here; a “theta_new”. It would make sense that the wider interval will be more appropriate, since it includes 100% of the current trials, it is reasonable that it might include only 95% of future trials, but there is no guarantee about this. It is less likely that the interval that includes 89% of the ‘current’ observed trial means will include 95% of the ‘future’ trial means, although not impossible, depending on what the ‘true’ model is.

For completeness we also fit this model using JAGS.

sfsd <- list(lrr = sfs$lrr, trial = sfs$Trial_ID,
           t.n = length(unique(sfs$Trial_ID)), 
           n = nrow(sfs))
init <- list(mu = 0, tau = 0.01, tau.trial = 0.01)
modelstring="
  model {
    # Single intercept model likelihood
    for (i in 1:n) {
      lrr[i]~dnorm(mu + b[trial[i]],tau)
    }

    # trial effect 
    for(j in 1:t.n){
      b[j] ~ dnorm(0, tau.trial)
    }
    
    # priors
    mu ~ dnorm(0,0.00001) # intercept prior
    tau ~ dgamma(0.0001,0.0001) ## tau is the residual precision
    sigma <- 1.0/sqrt(tau)
    tau.trial ~ dgamma(0.0001, 0.0001)
    sigma.trial <- 1.0/sqrt(tau.trial)
    # generate predictions 
    beff ~ dnorm(0, tau.trial)
    pred <- mu + beff
  }
"
mdl=jags.model(textConnection(modelstring), data=sfsd, inits=init, 
               n.chains = 2)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 200
##    Unobserved stochastic nodes: 41
##    Total graph size: 489
## 
## Initializing model

update(mdl,n.iter = 15000, n.burnin=10000)
output=coda.samples(model=mdl, 
                    variable.names=c("mu","pred","sigma","sigma.trial"), 
                    n.iter=20000, thin=10)
print(summary(output))

## 
## Iterations = 15010:35000
## Thinning interval = 10 
## Number of chains = 2 
## Sample size per chain = 2000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                Mean       SD  Naive SE Time-series SE
## mu          0.02784 0.005110 8.080e-05      8.081e-05
## pred        0.02809 0.015336 2.425e-04      2.425e-04
## sigma       0.06217 0.003166 5.007e-05      5.007e-05
## sigma.trial 0.01348 0.004838 7.650e-05      1.044e-04
## 
## 2. Quantiles for each variable:
## 
##                  2.5%      25%     50%     75%   97.5%
## mu           0.017922 0.024317 0.02778 0.03134 0.03780
## pred        -0.002715 0.018871 0.02831 0.03723 0.05842
## sigma        0.056264 0.059971 0.06201 0.06426 0.06859
## sigma.trial  0.006127 0.009902 0.01273 0.01639 0.02450

plot(output, trace = FALSE)

plot(output, density = FALSE)

Comparing model parameters
	mu	sigma.trial	sigma
lm	0.02810	NA	0.06241
lmer	0.02810	0.00000	0.06297
mcmc.s.priors	0.02771	0.01934	0.06213
mcmc.p.priors	0.02778	0.01720	0.06188
rjags	0.02784	0.01348	0.06217

Comparing confidence intervals

Comparing confidence intervals
funct	mu	lwr	upr
lm	0.0281	0.0193	0.0369
lmer	0.0281	0.0194	0.0369
MCMCglmm	0.0278	0.0175	0.0388
jags	0.0278	0.0179	0.0378

Comparing prediction intervals

Comparing prediction intervals
funct	method	mu	lwr	upr
lm	tdist	0.0281	-0.0964	0.1526
lmer	tdist	0.0281	0.0191	0.0371
lmer	boot	0.0282	0.0117	0.0447
MCMCglmm	tdist	0.0278	-0.0089	0.0645
MCMCglmm	mcmc	0.0283	-0.0117	0.0639
MCMCglmm	simulate	0.0297	-0.1022	0.1516
MCMCglmm	predict	0.0279	-0.0955	0.1523
MCMCglmm	ntrial	0.0280	-0.0170	0.0735
jags	mcmc	0.0283	-0.0027	0.0584

The previous figure shows that there are many different ways of computing the prediction interval, but it does not point to which one is correct. The ‘narrow’ ones seem ‘too’ ‘narrow’ and the ‘wide’ ones seem ‘too’ wide. Are the ones in the middle the correct ones?

Simulations

For this first simulation I used numbers somewhat similar to what is observed in the ‘Stratego foliar fungicide example’ but I varied the between trial variance:

Number of trials (n.k): 40

Observations within a trial (n) (balanced): 6

Within-trial variance (\(\sigma^2\)): 0.05

True value of \(\mu\) = 0.10 (log(RR))

Between trial variance (\(\tau^2\)) (range): (0.001 - 0.1) (100 values)

The range in the between trial variance represents an heterogeneity ratio of close to zero to 2.

The simulation took approx. 32 minutes to run on MacBook Pro laptop.

The first figure shows that the methods differ to some extent but they are broadly in two camps. The first one follows the philosophy of the disappearing ‘trial-effect’ and thus the intervals get smaller as the HR goes down. The other group maintains a fairly wide interval as HR is reduced and it is very close to the interval of the model which does not include the trial effect (lm).

simulations-results-bias

This figure is a subset of the previous one, where the number of methods is reduced for simplicity. It shows that there is a departure between ‘tdist’ and ‘ntrial’ only for HRs lower than 0.25. Methods ‘simulate’ and ‘tdist2’ are similar for low values of HR but they diverge for values of HR between 1 and 2.

simulations-results-bias-subset

The coverage for these methods was assessed by sampling new data using the same mechanism used to generate the data. It shows that for methods ‘ntrial’ and ‘tdist’ the coverage quickly degrades below HR 0.5. For ‘simualte’ and ‘tdist2’ it is reasonably close to the intended 95% between HR = 1 and HR = 0.5. For HR < 0.5 the coverage reaches 100%.

simulations-results-coverage

This figure presents the same information as the previous one but ‘smoothed’ for easier visualization. Method ‘tdist’ seems to perform well for HR > 1.5, method ‘tdist2’ performs better between HR = 1.5 and HR 0.5. For HR < 0.5 there is no method which has coverage of 95%, although if we need to err on the side of caution, we should use either ‘tdist2’ or ‘simualte’.

simulations-results-coverage-smooth

The between trial variance is well estimated by both the REML method and the Bayesian method. The ‘lm’ method is the ‘naive’ variance of the observed trial-effects shown for comparison.

simulations-results-tau-bias

The residual variance is well estimated without substantial bias.

simulations-results-sigma-bias

This part is a work in progress…

It would useful to summarize the paper by Partlett and Riley (2016) “Random effects meta-analysis: Coverage performance of 95% confidence and prediction intervals following REML estimation”.

Important take aways from that study.

(Table II, pg. 306) For REML method only. Heterogeneity ratio (HR) = 10. Confidence intervals have lower than 95% coverage for small sample sizes (k = 10, 7, 5, 3), close to 95% for large sample size (k = 100). Some departures are observed when number of studies is really small (k = 3). Coverage could be as low as 70% for a mixture of study sizes and only three studies (k = 3). Prediction intervals can have higher than stated coverage when study number size is small.
(Table III, pg. 307) For REML only. Heterogeneity ratio (HR) = 1. Confidence intervals never reach 95% coverage when the number of studies is small (k = 3, 5, 7, 10) with the lowest value being 77%. Prediciton intervals for k = 5, 7 and 10 is less than 95% with values as low as 87%. For balances studies and k = 3, coverage is close to 100%.
(Table IV, pg. 308) For REML only. Heterogeneity ratio (HR) = 0.5. For confidence intervals and (k < 100) coverage is lower than 95%. Prediction intervals also have lower coverage than 95% for k < 100, with a few exceptions.
(Table V, pg. 309) For REML only. Heterogeneity ratio (HR) = 0.1. Surprisingly, confidence intervals are closer to 95% than for the previous cases. For prediction intervals the departure is not dramatic, but it can be closer to 90% (instead of 95%) in some cases.

Main differences between simulations and Partlett study. The simulations here only looked at the balanced case, but with small within trial sample size.

It shows that both frequentist (‘tdist’) method and Bayesian (‘ntrial’) method have lower than alpha-coverage when the between trial variance is small. The Bayesian method (‘ntrial’) is a bit more robust when the between trial variance becomes very small. Other Bayesian methods (e.g. ‘simulate’,‘predict’) have higher coverage than stated alpha levels; the intervals are too wide.
It shows that different methods start to diverge below HR = 1.
I did not run these simulations 10,000 times. The main difference among different runs is that there is more divergent coverage between ‘tdist’ and ‘ntrial’ method for low between trial variances.

Is there an effect of sample size on prediction interval coverage?

Answer: For a simple problem, the sample size does not affect coverage dramatically. This might not be true for unbalanced cases with unequal sample sizes and other messiness.

##   samp.size coverage
## 1         3  0.95100
## 2         5  0.94580
## 3        10  0.94870
## 4        50  0.94936
## 5       100  0.95034

Remaining questions:

Is there a method which gives the ‘correct’ coverage?

So far, there is no perfect method which shows nominal coaverage of 95% across all values of HR.

Is there a better alternative?

Method ‘tdist2’ is an attempt to overcome the limitations of the other methods.

Is a possible solution to provide stronger priors?

See: Williams, Rast and Burkner (2018). An important result from this paper is that priors are important but also that the specific choice of the distribution for the priors is less important. An inverse-gamma, half-Cauchy or positive t-Student can have similar performance.

See: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html

What happens for different scenarios (sample size, variance structure, etc.)?

A small sample size leads to a downward variance bias for variance components.

Does the trial effect exist when HR < 1?

Whether the trial effect ‘exists’ should be a function of the experimental design and conceptual understanding of a system. When HR < 1 it might mean that it is effectively small or that our sample did not detect it. Is it possible for the trial effect to be large and still find a HR < 1? It is less likely but possible.

What happens in the middle (HR between 0 and 1)?

Appendix: Meta-analysis using the metafor package

soy.m1i <- aggregate(TRT1_Yld ~ Trial_ID, data = soy.o, FUN = mean)[,2]
soy.m2i <- aggregate(TRT2_Yld ~ Trial_ID, data = soy.o, FUN = mean)[,2]
soy.sd1i <- aggregate(TRT1_Yld ~ Trial_ID, data = soy.o, FUN = sd)[,2]
soy.sd2i <- aggregate(TRT2_Yld ~ Trial_ID, data = soy.o, FUN = sd)[,2]
soy.n1i <- aggregate(TRT1_Yld ~ Trial_ID, data = soy.o, FUN = length)[,2]
soy.n2i <- aggregate(TRT2_Yld ~ Trial_ID, data = soy.o, FUN = length)[,2]

rr.escl <- escalc("ROM", m1i = soy.m1i, m2i = soy.m2i, 
                         sd1i = soy.sd1i, sd2i = soy.sd2i,
                         n1i = soy.n1i, n2i = soy.n2i)
soy.rma <- rma(yi, vi, data = rr.escl, test = "knha")
forest(soy.rma)

funnel(soy.rma)

## Comparing methods for the estimation of tau
mthds <- c("REML","DL","HE","HS","SJ","ML","REML","EB","PM")
tau.col <- numeric(length(mthds))
I2.col <- numeric(length(mthds))
for(i in 1:length(mthds)){
   tmp <- rma(yi, vi, data = rr.escl, method = mthds[i])
   tau.col[i] <- tmp$tau2
   I2.col[i] <- tmp$I2
}

col.c <- data.frame(method = mthds, tau = tau.col, I2 = I2.col)
col.c

##   method          tau       I2
## 1   REML 0.0025645157 91.76571
## 2     DL 0.0011192107 82.94572
## 3     HE 0.0027860419 92.37049
## 4     HS 0.0009599012 80.66264
## 5     SJ 0.0028436097 92.51338
## 6     ML 0.0023588147 91.11146
## 7   REML 0.0025645157 91.76571
## 8     EB 0.0027249670 92.21280
## 9     PM 0.0027320625 92.23146

## Caterpillar plot
## Adapted from
## http://www.metafor-project.org/doku.php/plots:caterpillar_plot
### create plot
forest(rr.escl$yi, rr.escl$vi,
       xlim=c(-0.2,0.3),        ### adjust horizontal plot region limits
       subset=order(rr.escl$yi),        ### order by size of yi
       slab=NA, annotate=FALSE, ### remove study labels and annotations
       efac=0,                  ### remove vertical bars at end of CIs
       pch=19,                  ### changing point symbol to filled circle
       col="gray40",            ### change color of points/CIs
       psize=1,                 ### increase point size
       cex.lab=1, cex.axis=1,   ### increase size of x-axis title/labels
       lty=c("solid","blank"))  ### remove horizontal line at top of plot
 
### draw points one more time to make them easier to see
points(sort(rr.escl$yi), nrow(rr.escl):1, pch=19, cex=0.5)
 
### add summary polygon at bottom and text
addpoly(soy.rma, mlab="", annotate=FALSE, cex=1, row = 1, col = "red")
text(-0.1, 1, "RE Model", pos=4, offset=0, cex=1, col = "red")

## Using metafor for the analysis of the sfs dataset
sfs.m1i <- aggregate(TRT_Yld ~ Trial_ID, data = sfs, FUN = mean)[,2]
sfs.m2i <- aggregate(CTR_Yld ~ Trial_ID, data = sfs, FUN = mean)[,2]
sfs.sd1i <- aggregate(TRT_Yld ~ Trial_ID, data = sfs, FUN = sd)[,2]
sfs.sd2i <- aggregate(CTR_Yld ~ Trial_ID, data = sfs, FUN = sd)[,2]
sfs.n1i <- aggregate(TRT_Yld ~ Trial_ID, data = sfs, FUN = length)[,2]
sfs.n2i <- aggregate(CTR_Yld ~ Trial_ID, data = sfs, FUN = length)[,2]

rr.escl.sfs <- escalc("ROM", m1i = sfs.m1i, m2i = sfs.m2i, 
                         sd1i = sfs.sd1i, sd2i = sfs.sd2i,
                         n1i = sfs.n1i, n2i = sfs.n2i)
sfs.rma <- rma(yi, vi, data = rr.escl.sfs, test = "knha")
sfs.rma

## 
## Random-Effects Model (k = 37; tau^2 estimator: REML)
## 
## tau^2 (estimated amount of total heterogeneity): 0.0001 (SE = 0.0001)
## tau (square root of estimated tau^2 value):      0.0085
## I^2 (total heterogeneity / total variability):   12.18%
## H^2 (total variability / sampling variability):  1.14
## 
## Test for Heterogeneity:
## Q(df = 36) = 33.0775, p-val = 0.6083
## 
## Model Results:
## 
## estimate      se    tval    pval   ci.lb   ci.ub 
##   0.0220  0.0038  5.8552  <.0001  0.0144  0.0296  *** 
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

predict(sfs.rma, transf = exp)

## 
##    pred  ci.lb  ci.ub  cr.lb  cr.ub 
##  1.0222 1.0145 1.0300 1.0031 1.0417

forest(sfs.rma)

funnel(sfs.rma)

References

Williams, Rast and Burkner (2018) “Bayesian Meta-Analysis with Weakly Informative Prior Distributions”. https://psyarxiv.com/7tbrm/

Linear Mixed Models (part b)

Fernando Miguez

2019-08-20