Chapter 0: Basic concepts

0.1 Introduction

In this first Chapter 0 I will illustrate a simple example with the most basic assumption in statistics in which we draw a sample from a population.

The purpose of this document is to present theory, concepts and tools for the challenge of predictions and forecasting in the context of agricultural research.

The first issue is that prediction (https://en.wikipedia.org/wiki/Prediction) is fundamentally about the future and the future can be different from the past so we might deal with an “out-of-sample” problem. This is, the data we have (which once collected is already in the past) can be more or less useful when it comes to making statements about the future. In agriculture, perhaps the most frequent problem is related to seasons and weather. We might want to make decisions in the fall about next summer, but right now the signals are too weak to make forecasts that far in advance. If next year has a combination of conditions (e.g. temperature, precipitation, VPD) which are strikingly different from the ones on record then our predictions and forecasts might be off by quite a bit. The other similar problem is that of making spatial predictions. In this case, soil properties in our samples can be different from the new location to where our predictions are intended. Although, soil properties are sometimes thought of as static, they change over time and even within a growing season.

The exercise of prediction is not always appreciated. Most research papers do not include prediction intervals, even though, implicitly we are always concerned with prediction. We often write research papers describinig what was done and performing a statistical analysis on what was done; this is history. It is not uncommon in the discussion to make statements about the implications for the future without presenting a formal prediction given our data. Let me restate this, it is extremely uncommon to present predictions with associated uncertainty in most experimental work.

0.2 One sample

Here is an example with a very simple dataset. From a set of observations we can calculate the following:

## This is a very simple example with just 10 observations
y <- c(1.2, 2.7, 3.2, 5, 5.5, 6.1, 6.6, 7.3, 8.05, 9.9)
n.y <- length(y) # sample size
m.y <- mean(y) # mean of y
sd.y <- sd(y) # standard deviation of y
var.y <- var(y) # variance of y

The average

\[ \bar{y} = \frac{\Sigma_i y_i}{n} \]

Constitutes a form of estimation of the unknown parameter \(\mu\). But this means we propose the following model:

\[ y_i = \mu + e_i \]

In inferential statistics we recognize that there is uncertainty around this estimate. Using this sample to estimate the unknown \(\mu\) we construct a confidence interval. For this, we use the t-distribution and the usual 95% percent level.

The following is a confidence interval for \(\mu\) \[ CI \; \mu \rightarrow \; \bar{y} \pm t_{\alpha,df} \sqrt{\frac{\sigma^2}{n}} \]

0.3 Confidence and Prediction Intervals

t.level <- qt((0.05/2),(n.y - 1))
stderr.y <- sd.y / (sqrt(n.y))
ci.delta <- t.level * stderr.y
ci.y <- c(m.y + ci.delta, m.y - ci.delta)
print(ci.y, digits = 3)

## [1] 3.67 7.44

What is the confidence interval (3.67 - 7.44) a statement of?

If we were to repeat this process many times this interval will contain the ‘true’ \(\mu\) 95% of the time. Thus, in repeated sampling, 5% of the time we will be wrong. This is interesting, but we never do this, do we?
Values of the population parameter in that range might be consistent with the data we have collected. There is no gurantee that the ‘true’ population parameter lies in that range.

Some wikipedia (See: https://en.wikipedia.org/wiki/Confidence_interval) quotes:

“A 95% confidence level does not mean that for a given realized interval there is a 95% probability that the population parameter lies within the interval”.

“A 95% confidence level does not mean that 95% of the sample data lie within the confidence interval”

“A confidence interval is not a definitive range of plausible values for the sample parameter, though it may be understood as an estimate of plausible values for the population parameter.”

This semantics make my head spin, but it means we should be careful when interpreting confidence intervals.

The idea of prediction is that, if we wanted to predict a single new observation not in our sample (\(y^*\)), then that interval is not appropriate. In fact 5 (50%) of our observations fell outside that range. One way to think about it is that we now have \(var(y^* + \bar{y})\), which is \(var(y^*) + var(\bar{y})\) assuming independence. This is

\[ var(y^* + \bar{y}) = var(y^{*}) + var(\bar{y}) = \sigma^2_{y^*} + \sigma^2_{\bar{y}} \]

\[ PI \; y^* \rightarrow \bar{y} \pm t_{\alpha,df} \sqrt{\sigma^2 + \frac{\sigma^2}{n}} \]

Here \(y^*\) is a future observation, not in the sample. It is important to note that there is a difference between making a statement about a parameter and an individual observation. Although often this will be a very wide range it gives a more informative descritption of the variability in the sample.

psd.ystar <- sd.y * sqrt(1 + 1/n.y)
pci <- c(m.y + qt(0.025,n.y-1)*psd.ystar, m.y - qt(0.025,n.y-1)*psd.ystar)
print(pci, digits = 2)

## [1] -0.7 11.8

This new interval is quite wide (-0.7, 11.8) and it fact goes beyond any original value in the sample and it even has a negative lower limit, which might not be reasonable if our original variable is strictly positive.

It is possible to construct the same interval using the lm function.

fitlm.y <- lm(y ~ 1)
predict(fitlm.y, newdata = data.frame(y = m.y), interval = "prediction")

##     fit        lwr      upr
## 1 5.555 -0.6978848 11.80788

So within the linear model framework it is possible to provide individual level predictions. (There is actually a good argument for computing the degrees of freedom as n-2 instead of n-1 (Higgins, 2009), but this what I had to do to match the calculations by hand with what R provides.)

The residual standard error from this linear model is equivalent to the standard deviation of y.

sd.y

## [1] 2.63549

sigma(fitlm.y)

## [1] 2.63549

0.4 Confidence interval for the sample variance

What is the confidence interval for the standard devaition? One method is to use the chi-squared distribution (https://en.wikipedia.org/wiki/Chi-squared_distribution). Many introductory statistics textbooks show that

\[ \frac{(n - 1) s^2}{\chi^2_{\alpha/2,k}} < \sigma^2 < \frac{(n-1)s^2}{\chi^2_{1-\alpha/2,k}}\] \(\alpha\) is the desired confidence interval level, typically \(\alpha=0.05\). \(\sigma^2\) is the unknown varaince.

var.y.cil <- ((n.y - 1) * var.y) / qchisq(0.975, n.y - 1)
var.y.ciu <- ((n.y - 1) * var.y) / qchisq(0.025, n.y - 1)

## Variance CI:  3.29 23.15

## Std. Dev. CI:  1.81 4.81

Introducing the distribution for the sample variance is useful here as we will use this information to model variances in later sections.

0.5 The bootstrap method

Another option is to use the bootstrap method (https://en.wikipedia.org/wiki/Bootstrapping_(statistics)). In this case we can sample with replacement from our original vector \(\mathbf{y}\) and calcualte the variance for each instance.

bint <- 5000
varvec <- numeric(bint)


for(i in 1:bint){
  y.star <- sample(y, n.y, replace = TRUE)
  varvec[i] <- var(y.star)
}
quantile(varvec, c(0.025, 0.5, 0.975))

##      2.5%       50%     97.5% 
##  2.092331  6.116250 11.245035

There is a bias with the use of the bootstrap method which is computed as the difference between the ‘true’ estimate using the original data and the boostrap estimate

bvar <- function(x, indices){ x2 <- x[indices]; return(var(x2))}
bv5 <- boot(y, bvar, 5000)
bv5

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = y, statistic = bvar, R = 5000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1* 6.945806 -0.7441014    2.364899

summary(bv5$t)

##        V1        
##  Min.   : 0.374  
##  1st Qu.: 4.468  
##  Median : 6.094  
##  Mean   : 6.202  
##  3rd Qu.: 7.692  
##  Max.   :15.781

boot.ci(bv5, type = "perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bv5, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 2.107, 11.289 )  
## Calculations and Intervals on Original Scale

hist(bv5$t, main = "", xlab = "bootstrap estimates")

The boostrap method produces a narrower confidence interval than the \(\chi^2\) method. Is the bootstrap method truly underpredicting the distribution of the variance? There is also a bias as the mean (of the sampled variances) is generally lower than the variance calcualted on the original data. The upper confidence limit for the estimate of the variance using the bootstrap method is much lower than the one estimated using the \(\chi^2\) method (\(\chi^2\) = 23 vs. boot = ~11.4) - almost half.

0.6 The jackknife

The jackknife version of this is obtained by leaving one out (https://en.wikipedia.org/wiki/Jackknife_resampling). Not terribly useful in this situation, but I include it for completeness. With this small sample size there is not much that can be extracted from this small resampling exercise. However, I constructed a confidence interval which is somewhat wider than the one obtained through bootstrap.

jvarv <- numeric(n.y)

for(i in 1:n.y){
  y2 <- y[-i]
  jvarv[i] <- var(y2)
}

mjv <- mean(jvarv)
vjv <- sum((jvarv - mjv)^2) * ((n.y-1)/n.y)
sdjv <- sqrt(vjv)
round(c(mjv - 1.96*sdjv, mjv + 1.96*sdjv),2)

## [1]  1.34 12.56

0.7 Bayesian framework

0.7.1 Intro to Bayes

Introduction to Bayesian inference.

0.7.2 Bayes examples

In a Bayesian framework we can generate a sample from a Gaussian distribution for which there is uncertainty about both the mean and the variance. One option for simulating variances is to generate samples using the chi-squared distribution, but this distribution is a special case of the gamma.

“A central chi-squared distribution with n degrees of freedom is the same as a Gamma distribution with shape a = n/2 and scale s = 2”.

Example of a histogram for a Chi-squared distribution for df = 9

ggplot() + 
xlab("variance") +
geom_density(aes(rchisq(1e4, 9)), color = "blue") +
geom_density(aes(rgamma(1e4, shape = 9/2, scale = 2)), color="orange")

Let’s review what the gamma distribution looks like for different levels of scale

For the same mean of 10, the gamma becomes more spread out when the scale increases.

## Gamma mean:  10 9.86 9.66

## Gamma variance:  9.3 19.26 93.94

This is because for the gamma distribution the mean and variance are \(E(X) = a\times s\) and \(Var(X) = a\times s^2\).

## This simple function calculates the shape and scale for the 
## gamma distribution from data. This is called 'method matching'
fgp <- function(x){
  ## the mean of gamma is E(x) = a * s
  ## the variance of gamma is Var(X) = a * s^2
  m.x <- mean(x)
  v.x <- var(x)
  scale <- v.x / m.x
  shape <- m.x / scale
  return(c(shape, scale))
}
## It is also possible to estimate parameters from a distribution
## Using the MASS::fitdistr function or the fitdistrplus package
## This is shown in the next chapter

0.7.3 Bayes: sampling from distributions

This is an example of generating data in a forward fashion, by simulating data from a \(\chi^2\) distribution for the variance and using this in the generation of samples to infer the shape of the distribution. This is one way of creating a ‘prediciton’ interval for a future observation (\(y^*\)).

nit <- 1e5
rvec <- numeric(nit)

for(i in 1:nit){
  ## Generate a variance based on the observed one using the chisq 
  ## distribution. In this case degrees of freedom is equivalent 
  ## to the variance
  vxi <- rchisq(1, df = var.y)
  ## We could accomplish something similar using the gamma 
  ## distribution
  ## vxi <- rgamma(1, shape = var.y/2, scale = 2)
  sdxi <- sqrt(vxi) ## Standard deviation
  stderrxi <- sdxi/sqrt(10) ## Standard error
  mx2 <- rnorm(1, mean = m.y, sd = stderrxi)
  postx <- rnorm(1, mean = mx2, sd = sdxi)
  rvec[i] <- postx
}

quantile(rvec, c(0.025, 0.975))

##        2.5%       97.5% 
## -0.05077841 11.14829168

This confidence interval is very similar to the one obtained at the beginning (using either the ‘manual’ method or lm).

0.8 Example intro to JAGS

This example has non-iformative (almost flat) priors for the mean and precision (1/variance). The mean and variance are very similar to the previous estimates.

yd <- list(y = y, n = n.y)
init <- list(mu = 0, tau = 1)
modelstring="
  model {
    for (i in 1:n) {
      y[i]~dnorm(mu,tau)
    }
    mu~dnorm(0,0.00001)
    tau~dgamma(0.0001,0.0001)
    sigma <- 1.0/sqrt(tau)
  }
"
mdl <- jags.model(textConnection(modelstring), data=yd, inits=init)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 10
##    Unobserved stochastic nodes: 2
##    Total graph size: 19
## 
## Initializing model

update(mdl,n.iter=5000,n.burnin=1000)
output <- coda.samples(model=mdl,variable.names=c("mu", "sigma"), n.iter=10000, thin=10)
print(summary(output))

## 
## Iterations = 5010:15000
## Thinning interval = 10 
## Number of chains = 1 
## Sample size per chain = 1000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean     SD Naive SE Time-series SE
## mu    5.521 0.9392  0.02970         0.0297
## sigma 2.906 0.8056  0.02547         0.0223
## 
## 2. Quantiles for each variable:
## 
##        2.5%   25%   50%   75% 97.5%
## mu    3.563 4.929 5.518 6.093 7.294
## sigma 1.831 2.362 2.733 3.310 4.960

plot(output, density = FALSE)

plot(output, trace = FALSE)

The confidence interval (3.6 - 7.4) is similar to the confidence interval computed at the beginning. The confidence interval for the variance is similar to that obtained using the \(\chi^2\) method.

ci.var <- summary(output)$quantile[2,]^2 
ci.var

##      2.5%       25%       50%       75%     97.5% 
##  3.351676  5.577189  7.469801 10.956086 24.605666

0.8.1 Prediction interval using JAGS

It is also possible to create a prediction interval within the JAGS fitting process.

## We can also construct a prediciton interval
yd <- list(y = y, n = n.y)
init <- list(mu = 0, tau = 1)
modelstring="
  model {
    for (i in 1:n) {
      y[i]~dnorm(mu,tau)
    }
    mu~dnorm(0,0.00001)
    tau~dgamma(0.0001,0.0001)
    sigma <- 1.0/sqrt(tau)
    ## prediction
    resid~dnorm(0,tau)
    pred <- mu + resid
  }
"
mdl <- jags.model(textConnection(modelstring), data=yd, inits=init)

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 10
##    Unobserved stochastic nodes: 3
##    Total graph size: 21
## 
## Initializing model

update(mdl,n.iter=5000,n.burnin=1000)
output <- coda.samples(model=mdl,
                       variable.names=c("mu", "sigma","pred"),
                       n.iter=10000, thin=10)
## Here the 'pred' contains the prediction interval
## This prediction interval is similar to the ones obtained using 
## the other methods
print(summary(output))

## 
## Iterations = 5010:15000
## Thinning interval = 10 
## Number of chains = 1 
## Sample size per chain = 1000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##        Mean     SD Naive SE Time-series SE
## mu    5.522 0.9611  0.03039        0.03039
## pred  5.519 3.3742  0.10670        0.10670
## sigma 2.919 0.8670  0.02742        0.02742
## 
## 2. Quantiles for each variable:
## 
##         2.5%   25%   50%   75%  97.5%
## mu     3.420 4.974 5.541 6.136  7.370
## pred  -1.910 3.774 5.590 7.608 11.737
## sigma  1.857 2.324 2.757 3.305  4.809

plot(output, density = FALSE)

plot(output, trace = FALSE)

## Warning: Removed 8 rows containing non-finite values (stat_density).

0.9 Individual observation vs. parameter: example with men and women’s heights

To illustrate the conceptual difference between statements about a population parameter and an individual observation let’s think about the difference between men and women’s heights.

From our source (https://en.wikipedia.org/wiki/List_of_average_human_height_worldwide). For Argentina, for example, the average for men is 174.46cm (sd = 7.43cm) and for women 161cm (sd = 6.99cm). Let’s pretend we collected data from 1000 men and women.

In this case, the confidence interval is very narrow and it is a statement about the population parameter. The prediction interval is much wider and it is a statement about an observation not in our sample (\(y_{new}\)). The prediciton interval is not exactly a description of the observed distribution (there are exceptions), but it should have a similar ‘look and feel’. If it is ver, very different, then we might have done something wrong.

m.ht <- rnorm(1e3, mean = 174.43, sd = 7.43)
w.ht <- rnorm(1e3, mean = 161, sd = 6.99)
mw.ht <- data.frame(sex = rep(c("men","women"), each = 1e3),
                    height = c(m.ht,w.ht))

## Fit linear model
ht.lm <- lm(height ~ sex, data = mw.ht)

dsx <- data.frame(sex = c("men","women"))
  
htp1 <- predict(ht.lm, newdata = dsx,
                interval = "confidence") %>% 
  as.data.frame() %>% bind_cols(dsx,.) 

htp2 <- predict(ht.lm, newdata = dsx,
                interval = "prediction") %>% 
  as.data.frame(.,dsx) %>% bind_cols(dsx,.)

## Warning: Ignoring unknown aesthetics: x

## Warning: Ignoring unknown aesthetics: x

0.10 Relationship between confidence, prediction intervals and sample size

An interesting question is whether for very small sample sizes the confidence and prediction intervals do not apply. Certainly, when our sample size is small we should be very careful about making general statements and recognize that we have a small amount of evidence about any statement that will be derived from that data. Somewhat surprisingly, the sample size for the stabilization of confidence and prediction intervals is not that high.

## Let's do this for different samples sizes 
set.seed(123456)
ss <- seq(2, 10)

rdat <- data.frame(ss = rep(ss, 2), 
                   intervals = rep(c("confidence","prediction"), each = length(ss)),
                              fit = NA, lwr = NA, upr = NA)

for(i in 1:length(ss)){
  ## Generate data
  xx <- rnorm(ss[i], mean = 100, sd = 15)
  fit <- lm(xx ~ 1)
  rdat[i,c("fit","lwr","upr")] <- predict(fit, interval = "confidence")[1,]
  rdat[I(i+length(ss)),c("fit","lwr","upr")] <- predict(fit, 
                                          newdata = data.frame(xx=mean(xx)),
                                          interval = "prediction")[1,]
}

ggplot(data = rdat, aes(x = ss, y = fit)) +
  geom_smooth(method = "loess", se = FALSE, color = "black") + 
  geom_smooth(aes(x = ss, y = lwr, color = intervals), method = "loess", 
              se = FALSE) +
  geom_smooth(aes(x = ss, y = upr, color = intervals), method = "loess",
              se = FALSE) +
  xlab("sample size") + ylab("y") + 
  ggtitle("Relationship between sample size and confidence and prediction intervals")