Linear models, Part II: Examples with R

Vassilis Kehayas

Last updated on Sep 29, 2019 23 min read

Linear models, Part II: Examples with R

Vassilis Kehayas

Last updated on Sep 29, 2019 23 min read

Edit File History View Source

In the companion post, I briefly covered the theory behind linear models. In this post, I will go through an example while revisiting some of the principles we described there. This small example should demonstrate the ease with which linear models can be extended to accommodate data with varying structures and satisfy a range of distributional assumptions. An example will also give us the opportunity to highlight some best practices. We will be looking at (made-up) data from neuronal recordings, created and analyzed in R.

Construct and inspect the data

Expand code

packageList = c("ggplot2", # Plotting
                "sjPlot", # Tables for model outputs
                "cowplot", # Arrange plots 
                "ggbeeswarm", # Beeswarm plots
                "DHARMa", # Residual simulation
                "reshape2", # "Melt" matrices
                "lme4") # Fit mixed models
ph = lapply(packageList, 
            require,
            character.only = TRUE)

In the next code block I will create the data:

seed = 0
set.seed(seed) # For reproducible results
n = 600 # Total sample size
nID = 20 # Number of neurons
nTrials = n/nID
# "Ground truth" coefficients
b0 = 1.6 # Intercept
bStim = 0.7
bArea = -1.6
bAreaXStim = 0.7 # Interaction area X stim
# Design matrix
stimKernel = c("no-stim", "stim")
stim = factor(c(rep(stimKernel[1], n/2), 
                rep(stimKernel[2], n/2)), 
              levels = stimKernel)
areaKernel = c("thalamus", "cortex")
area = factor(rep(areaKernel, n/2), 
              levels = areaKernel)
temp = round(rnorm(n, 25, 1), 1) # Degrees Celsius
trial = 1:n
id = rep(1:nID, nTrials)
u = 0.3
zeta = rep(rnorm(nID, 0, u), nTrials)
# Simulate response
xArea = contrasts(area)[area]
xStim = contrasts(stim)[stim]
mu = exp(b0 + bStim*xStim + bArea*xArea + bAreaXStim*xArea*xStim + zeta)
nSpikes = rpois(n, mu)
# Bind data
data = data.frame(trial, id, stim, area, temp, nSpikes)

Let’s first have a look at a sample of the data for a single neuron:

Expand code

data1 = data[data$id == 1, ]
sampleArray = rbind(head(data1), NA, tail(data1))
ShowTable = function(sampleArray) {
  kable(sampleArray, 
        align = "l", 
        format = "markdown",
        row.names = FALSE,
        digits = 4)
}
ShowTable(sampleArray)

trial	id	stim	area	temp	nSpikes
1	1	no-stim	thalamus	26.3	4
21	1	no-stim	thalamus	24.8	1
41	1	no-stim	thalamus	26.8	6
61	1	no-stim	thalamus	25.4	0
81	1	no-stim	thalamus	24.2	3
101	1	no-stim	thalamus	25.8	2
…	…	…	…	…	…
481	1	stim	thalamus	22.7	6
501	1	stim	thalamus	24.4	5
521	1	stim	thalamus	24.0	7
541	1	stim	thalamus	24.7	5
561	1	stim	thalamus	25.8	10
581	1	stim	thalamus	24.7	4

There are 20 neurons in this dataset. For each neuron located in either of two brain areas –the thalamus and the cortex– the number of spikes recorded per trial is stored in the variable nSpikes. In half of the trials, each neuron received stimulation (total number of trials per neuron is 30). The temperature during the trial was recorded in degrees Celsius in the variable temp.

First, I will plot the number of spikes per trial for all neurons split into groups based on whether they received stimulation or not.

Expand code

my_theme = 
  theme_classic() +
  theme(text = element_text(face = "bold", 
                            size = 14,
                            family = "Noto"), 
        panel.grid.major = element_line())

PlotBeeswarm = function(data, my_theme, ...) {
  p1 = ggplot(data = data, aes(stim, nSpikes, ...)) +
    geom_beeswarm(alpha = 0.3, cex = 0.5, size = 1.5) +
    labs(x = "Stimulation",
         y = "Spikes per trial") +
    my_theme
  return(p1)
}
p1 = PlotBeeswarm(data, my_theme)
p1

Each dot represents an observation. It is clear from the plot that stimulation increases the average number of spikes per trial and the variance of their distribution.

Linear model

Let’s fit a linear model to investigate:

lm1 = lm(nSpikes ~ 1 + stim, data)

I have specified a formula nSpikes ~ 1 + stim for this model. This form of formula specification is called Wilkinson notation. The above model is equivalent to the following equation:

$$ nSpikes = \beta_0 + \beta_{stim}x_{stim} \label{eq:lin_mod} $$ where $\beta_0$ is the estimate for the intercept, which corresponds to the term 1 in Wilkinson notation, and $\beta_{stim}$ is the estimate for the effect of factor $x_{stim}$, which can either take values of $0$ or $1$, for the absence or presence of stimulation, respectively.

The model’s output is shown in the following table:

Expand code

ConstructModelTable = function(mdl, ...) {
  tab_model(mdl,
            pred.labels = c("Intercept", attr(terms(mdl), "term.labels")), 
            p.style = "numeric", 
            collapse.ci = TRUE,
            linebreak = FALSE,
            show.aic = TRUE,
            emph.p = FALSE,
            transform = NULL,
            ...)
}
ConstructModelTable(lm1, show.fstat = TRUE)

	n Spikes
Predictors	Estimates	p
Intercept	3.16 (2.68 – 3.64)	<0.001
stim	4.10 (3.42 – 4.77)	<0.001
Observations	600
R² / adjusted R²	0.191 / 0.190
AIC	3433.405

The model is picking up what we can readily observe from the plot. The intercept estimate corresponds to the number of spikes per trial in cases where the neurons received no stimulation, and its value is detected to differ from $0$. The reason that the intercept estimate corresponds to the case where no stimulation was applied is that the first level of the categorical variable stim is set to $0$ by default. We can see from equation $\eqref{eq:lin_mod}$ that when we set $\beta_{stim}$ to $0$ the estimate for the response variable is equal to the intercept term, $\beta_0$. The estimate labelled stim corresponds to the number of spikes per trial in the presence of stimulation, in excess of those for the intercept estimate.

ANOVA interlude: The results of an ANOVA on the same data are identical to those of the simple linear model presented above:

aov1 = aov(nSpikes ~ stim, data)

Expand code

kable(anova(aov1), 
        align = "l", 
        format = "markdown",
        digits = 3)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
stim	1	2517.402	2517.402	141.625	0
Residuals	598	10629.557	17.775

To see that, compare the F-statistic value between the two outputs.

Expand code

modelFStat = data.frame(model = c("Linear model", 
                                  "ANOVA"), 
                        Fstat = c(summary(lm1)$fstatistic[1], 
                                  summary(aov1)[[1]]$`F value`[1]))
kable(modelFStat, 
      col.names = c("Model", 
                    "F-statistic"), 
      row.names = FALSE,
      align = "l", 
      format = "markdown",
      digits = 3)

Model	F-statistic
Linear model	141.625
ANOVA	141.625

In fact, the R function I used for the ANOVA (aov) is simply using the function I used to fit the linear model (lm) under the hood. The help page for aov includes the following note:

aov is designed for balanced designs, and the results can be hard to interpret without balance: beware that missing values in the response(s) will likely lose the balance. If there are two or more error strata, the methods used are statistically inefficient without balance, and it may be better to use [liner mixed models]

Thus, I suggest that it is preferable to explicitly use linear models to become familiar with a statistical procedure that is easily extendable to accommodate more complicated data structures while achieving the same results as ANOVA.

Going back to the original linear model, the next step should always be to evaluate the fit:

Expand code

blank_style = my_theme + 
  theme(axis.line.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(), 
        axis.title.y = element_blank())

lmPredict = predict(lm1, interval = "confidence", level = 0.95)

OverlayPrediction = function(gPlot, data, predictionTable, 
                             fillValue, fill, colour,
                             showLegend = FALSE, ...) {
  gp = gPlot + 
    geom_ribbon(aes(x = as.numeric(data$stim), 
                    ymin = predictionTable[,2], 
                    ymax = predictionTable[,3],
                    fill = fill,
                    colour = NULL), 
                alpha = 0.2) +
    geom_line(aes(x = as.numeric(data$stim), 
                  y = predictionTable[,1], 
                  ...), 
              size = 1) +
    geom_point(aes(x = as.numeric(data$stim), 
                   y = predictionTable[,1],
                   fill = fill, 
                   ...), 
               shape = 22,
               size = 4,
               stroke = 1.5,
               colour = "black") +
    scale_colour_manual(values = fillValue,
                        name = NULL) +
    scale_fill_manual(values = fillValue,
                        name = NULL) +
    guides(colour = FALSE) +
    theme(legend.position = c(0.25, 0.9))
  
  if (showLegend != TRUE) {
    gp = gp + theme(legend.position = "none")
  }
  
  return(gp)
  
}

PlotQQ = function(mdl, my_theme) {
  
  gp = ggplot(mdl, aes(sample = .resid)) +
    stat_qq_line(size = 1, 
                 linetype = "dashed", 
                 alpha = 0.5) +
    stat_qq(alpha = 0.3) +
    labs(x = "Theoretical",
         y = "Sample") +
    my_theme
  
  gp = gp +
    coord_cartesian(ylim = range(pretty(residuals(mdl))))
}

PlotResidualHistogram = function(mdlResiduals, my_theme) {
  
  nBins = 20
  binLabels = c(0, 50, 100, 150, 200)
  
  xLim = range(pretty(mdlResiduals))
  
  gp = ggplot() + 
    geom_histogram(aes(mdlResiduals), 
                   bins = nBins) +
    labs(y = "Count") +
    coord_flip(xlim = xLim) +
    my_theme
  
  binCounts = c(ggplot_build(gp)$data[[1]]$count, 50)
  yLim = binLabels[binLabels <= max(binCounts)]
  
  gp = gp +
    scale_y_continuous(breaks = yLim,
                       labels = yLim) +
    expand_limits(y = max(c(max(binCounts), max(yLim))))
  
}

ArrangePlots = function(pA, pB, pC) {
  plot_grid(pA, pB, pC, 
            labels = c("a", "b"), 
            ncol = 3, 
            rel_widths = c(2, 1.5, 0.75))
}

p2a = OverlayPrediction(p1, data, lmPredict, 
                        fillValue = "black", 
                        fill = "black", 
                        colour = rgb(2/3, 2/3, 2/3))
p2b = PlotQQ(lm1, my_theme)
p2c = PlotResidualHistogram(lm1$residuals, blank_style)
ArrangePlots(p2a, p2b, p2c)

In panel a I added the model’s estimated slope with 95% confidence intervals on top of the data. The square symbols designate the estimated average number of spikes per trial. The left plot in panel b is showing us the theoretical versus the observed quantiles of the fit’s residuals. The residuals represent how much off the fit was for every single observation, i.e. the distance of each point in panel a from the fit line. Recall that the residuals represent the error estimate and are assumed to be approximately normally distributed with an average of $0$. If that was the case, we should expect to see them distributed along the straight line shown on the plot. On the right of the same panel, we see the histogram of the residuals, as another way to visually assess their distribution.

From these plots, we see clearly that the model fails to accommodate the structure of the data. We can also see the inadequacy of the model in the estimates of the linear model: whereas the coefficient estimate accurately describes the relationship between the two groups, the reported standard errors are inaccurate (e.g. 0.24 from the model versus 0.18 from the data for the intercept).

There are certain things we should consider at this point to improve our model. Some may attempt to discard some observations as outliers. However, this is almost never a good idea unless there are good reasons to think that the data acquisition was somehow compromised. Even if that was the case, these are considerations that should be addressed before running any analyses to avoid any potential bias.

Linear model with log-transformed response variable

Another approach would be to apply some transformation to the data to satisfy the assumptions of our model. One option we could consider for these data is to log-transform the response variable.

Expand code

logData = data
logData$nSpikes = log10(data$nSpikes)
logData1 = logData[logData$id == 1, ]
sampleArray = rbind(head(logData1), NA, tail(logData1))
ShowTable(sampleArray)

trial	id	stim	area	temp	nSpikes
1	1	no-stim	thalamus	26.3	0.6021
21	1	no-stim	thalamus	24.8	0.0000
41	1	no-stim	thalamus	26.8	0.7782
61	1	no-stim	thalamus	25.4	-Inf
81	1	no-stim	thalamus	24.2	0.4771
101	1	no-stim	thalamus	25.8	0.3010
…	…	…	…	…	…
481	1	stim	thalamus	22.7	0.7782
501	1	stim	thalamus	24.4	0.6990
521	1	stim	thalamus	24.0	0.8451
541	1	stim	thalamus	24.7	0.6990
561	1	stim	thalamus	25.8	1.0000
581	1	stim	thalamus	24.7	0.6021

Unfortunately, this approach is problematic not in the least because there were observations with no spikes in some trials and the log of $0$ is not defined. We could still go on and fit a model for the sake of the example, treating those undefined values as missing.

logData = logData[logData$nSpikes != -Inf,]
lm2 = lm(nSpikes ~ 1 + stim, logData)

Expand code

ConstructModelTable(lm2)

	n Spikes
Predictors	Estimates	p
Intercept	0.50 (0.46 – 0.55)	<0.001
stim	0.26 (0.20 – 0.32)	<0.001
Observations	517
R² / adjusted R²	0.130 / 0.128
AIC	349.632

Expand code

lmPredict = predict(lm2, interval = "confidence", level = 0.95)
p3a = OverlayPrediction(PlotBeeswarm(logData, my_theme), 
                        logData, 
                        lmPredict, 
                        fillValue = "black", 
                        fill = "black",
                        colour = rgb(2/3, 2/3, 2/3))
p3b = PlotQQ(lm2, my_theme)
p3c = PlotResidualHistogram(lm2$residuals, blank_style)
ArrangePlots(p3a, p3b, p3c)

This model is also able to detect increased spiking with stimulation. Although the diagnostic plots look much better, the coefficient estimates are underestimating the true effect (e.g. 1.65 from the model versus 3.16 from the data for the intercept on the original scale). The reason for that is Jensen’s inequality, which states that the expected value (i.e. the mean) of a concave function, such as the logarithm, is bound to be smaller than or equal to the transformed expected value: $$ \mathbb{E}[f(X)] \le f(\mathbb{E}[X]) \label{eq:jensen_inequality} $$ It is not trivial to get unbiased estimates in units of the original scale, especially for interval estimates such as confidence intervals.¹

Poisson regression

A better approach is to use a GLM that natively incorporates the structure of the data at hand. Since our response variable consists of counts of a process, the number of spikes per trial, the Poisson distribution is a natural fit (pun intended).

glm1 = glm(nSpikes ~ 1 + stim, family = poisson, data)

Expand code

ConstructModelTable(glm1, show.r2 = FALSE)

	n Spikes
Predictors	Log-Mean	p
Intercept	1.15 (1.09 – 1.21)	<0.001
stim	0.83 (0.76 – 0.91)	<0.001
Observations	600
AIC	3812.542

Our estimates are reasonably close to the real values of 1.6 for the intercept and 0.7 for stimulation.

The evaluation of Q-Q plots for GLMs is not as straightforward as for those we used for our linear regression models since we do not assume that the residuals are normally distributed and it is harder to diagnose deviations from the Poisson distribution visually. For this reason, we will use simulation.

Expand code

nSim = 1000
sim1 = simulateResiduals(glm1, seed = seed, n = nSim)

glmPredict = predict(glm1, se.fit = TRUE)
ci95 = glmPredict$fit + outer(glmPredict$se.fit, qnorm(c(0.025, 0.975)))
glmPredict$low = exp(ci95[, 1])
glmPredict$upp = exp(ci95[, 2])
glmPredict$response = exp(glmPredict$fit)
glmPredTable = with(glmPredict, cbind(response, low, upp))

p4a = OverlayPrediction(p1, 
                        data, 
                        glmPredTable, 
                        fillValue = "black", 
                        fill = "black",
                        colour = rgb(2/3, 2/3, 2/3))

PlotQQGLM = function(sim, my_theme) {
  gp = ggplot() + 
    stat_ecdf(aes(sim$scaledResiduals), pad = FALSE) + 
    geom_abline(slope = 1, size = 1, linetype = "dashed", alpha = 0.5) +
    coord_flip(xlim = c(0, 1)) +
    labs(x = "Sample",
         y = "Theoretical") +
    my_theme +
    theme(panel.grid.minor.x = element_blank())
}

p4b = PlotQQGLM(sim1, my_theme)
p4c = PlotResidualHistogram(sim1$scaledResiduals, blank_style)
ArrangePlots(p4a, p4b, p4c)

In panel a, I plot once more the data overlaid with the output of the GLM. In panel b, I plot the resulting residuals from a simulation which involved replicating the dataset 1000 times. A value of $0$ for the simulated residuals means that all simulated values were greater than the corresponding observed value. If the model fit was perfect, we should get uniformly distributed simulated residuals: they would appear close to the diagonal dashed line with slope $1$ in the left plot of panel b and their histogram on the right would be flat. In this case, we can observe detectable deviations from a uniform distribution.

Including more variables in the model

So far we have kept our models simple to explain some basic ideas. Let’s now add more of the available variables as parameters in our model.

glm2 = glm(nSpikes ~ 1 + stim*area*temp, family = poisson, data)

The asterisks in the formula above designate that we are requesting that the model includes stim, area, and temp as well as all possible interactions between them. We can express the same model with the expanded Wilkinson notation:

nSpikes ~ 1 + stim + area + temp + 
          stim:area + stim:temp + area:temp + 
          stim:area:temp

where the symbol : indicates an interaction term. The corresponding formula for this model is:

$$ \begin{align} nSpikes = {\beta_0 + \\ \beta_{stim}x_{stim} + \\ \beta_{area}x_{area} + \\ \beta_{temp}x_{temp} + \\ \beta_{(stim,area)}x_{(stim,area)} + \\ \beta_{(stim,temp)}x_{(stim,temp)} + \\ \beta_{(area,temp)}x_{(area,temp)} + \\ \beta_{(stim,area,temp)}x_{(stim,area,temp)}} \end{align} \label{eq:glm_interaction} $$

Expand code

ConstructModelTable(glm2, show.r2 = FALSE)

	n Spikes
Predictors	Log-Mean	p
Intercept	2.38 (0.65 – 4.10)	0.007
stim	0.23 (-1.90 – 2.36)	0.833
area	-3.50 (-8.41 – 1.40)	0.161
temp	-0.03 (-0.10 – 0.04)	0.441
stim:area	4.59 (-0.85 – 10.03)	0.098
stim:temp	0.02 (-0.07 – 0.10)	0.683
area:temp	0.07 (-0.13 – 0.26)	0.509
stim:area:temp	-0.15 (-0.37 – 0.07)	0.173
Observations	600
AIC	2735.827

Hm…Something has gone wrong here. The estimates of this model are nowhere near their real values and, except for the intercept, none other differs from $0$. For example, the estimate for stim is 0.23 whereas its real value is 0.7. Let’s plot the variance-covariance matrix of the estimates to investigate:

Expand code

maxVCOV = max(signif(range(vcov(glm2)),1))
scaleLimits = c(-maxVCOV, maxVCOV)

PlotVCOV = function(mdl, scaleLimits) {
  cvMat = vcov(mdl)
  cvMat[lower.tri(cvMat)] = NA
  cvMat = melt(cvMat, na.rm = TRUE)
  ggplot(data = cvMat, aes(x = Var1, y = Var2, fill = value)) + 
    geom_raster() +
    geom_text(aes(label = round(value, 3)), color = "black") +
    scale_y_discrete(labels = c("intercept", attr(glm2$terms, "term.labels"))) +
    scale_x_discrete(labels = c("intercept", attr(glm2$terms, "term.labels"))) +
    my_theme +
    theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                     size = 12, hjust = 1),
          panel.grid = element_blank()) + 
    scale_fill_distiller(palette = "RdYlBu", limits = scaleLimits)
}

PlotVCOV(glm2, scaleLimits)

It appears that the variance-covariance matrix takes values of different orders of magnitudes, based on whether the corresponding estimates include the variable temp or not. The reason is that the scale of the variable temp is in degrees Celsius, which is on a completely different order of magnitude compared to the rest of the variables (which are either $0$ or $1$), and this fact is affecting the numerical stability of the algorithm used for the estimates. The way to solve the problems caused by this issue is to scale the variable temp.

glm3 = glm(nSpikes ~ 1 + stim*area*scale(temp), family = poisson, data)
PlotVCOV(glm3, scaleLimits)

The variance-covariance matrix is now much more homogeneous, as we can observe when we plot both in the same range. More importantly, its values are closer to $0$, which is a requirement for the validity of our inferences from the model.

Another reason to scale, or at least center a continuous variable such as temperature by subtracting its mean, is so that the interpretation of its coefficients is more meaningful. If we had left temp on its original scale, all coefficients except those for interaction terms that include temp would refer to 0 degrees Celsius, which is not meaningful in this case. By centering the continuous variable temp, all estimates that do not include an interaction with it refer to its average. By scaling temp to its standard deviation, we can interpret its coefficients as the effect of one standard deviation change in temp on the response variable. It follows that it may not be required to center or scale a variable when $0$ is a meaningful value or when all variables are approximately on the same scale.

We can now have a look at the estimates of the model:

Expand code

ConstructModelTable(glm3, show.r2 = FALSE)

	n Spikes
Predictors	Log-Mean	p
Intercept	1.70 (1.63 – 1.77)	<0.001
stim	0.67 (0.59 – 0.76)	<0.001
area	-1.86 (-2.05 – -1.67)	<0.001
scale(temp)	-0.03 (-0.09 – 0.04)	0.441
stim:area	0.82 (0.61 – 1.03)	<0.001
stim:scale(temp)	0.02 (-0.07 – 0.10)	0.683
area:scale(temp)	0.06 (-0.13 – 0.26)	0.509
stim:area:scale(temp)	-0.15 (-0.36 – 0.07)	0.173
Observations	600
AIC	2735.827

The estimates of this model are much closer to the real values. How can we interpret the returned coefficients? The intercept estimate, 1.7, corresponds to the case where all factors are set to $0$, i.e. for observations in the thalamus (since that is the first level of the factor area) with no stimulation (since that is the first level of the factor stim) at the average temperature (since we have centered and scaled the variable temp). Since we are using a GLM with the Poisson distribution and the canonical link, the logarithm function, we need to transform the returned value back to the original scale: $exp(1.7) = 5.47$ spikes per trial. The coefficient for stim, 0.673, corresponds to the case where all variables except stim are set to $0$, i.e. for observations in the thalamus at average temperature with stimulation. To obtain the estimated number of spikes for those observations, we need to multiply the exponentiated estimates for the intercept and stim, or, equivalently, add them and then exponentiate: $ exp(1.7 + 0.673) = 10.7 $. The coefficient for temp expresses the influence of a unit change in scaled temperature (in standard deviations) for observations in the thalamus under no stimulation and is not detected to differ from $0$. The interaction term stim:area corresponds to the estimated number of spikes per trial with stimulation for observations in the cortex, in excess of those estimated for the intercept, stim, and area. To calculate the estimated number of spikes for observations in the cortex under stimulation we need to sum all those coefficients for which stim and area are not $0$ but temp = 0, i.e. $ exp(1.7 + 0.673 + (-1.86) + 0.818) = 3.78 $. We can interpret the rest of the coefficients in a similar fashion.

The last entry of the table reports the Akaike Information Criterion (AIC) which is a measure of the information lost by the fitted model, penalized for the additional flexibility of the model due to the number of parameters included in the model and is defined as

$$ AIC = 2k -2 ln(\hat{L}) $$ where $k$ is the number of parameters in the model and $\hat{L}$ is the maximum likelihood estimate. The AIC is useful when we compare multiple models to find the model that is better able to explain the data with the minimum number of parameters. The model with the lowest AIC can be considered the most parsimonious model. In this case, while we have increased the number of parameters by including stim, area, temp and all their interactions, the lower AIC value still justifies this more expanded model over the reduced model we used before. I should note that the AIC is only used as a rule of thumb and it is not a statistical test. In the context of statistical inference, the inclusion of a parameter in a model should be primarily informed by prior theoretical considerations rather than trying to minimize the AIC or any other metric. In the context of prediction, various techniques have been developed to deal with the problem of variable selection that can be used without the risk of commiting data dredging.

We can now evaluate the model. I will first plot the effect of the variable temp on nSpikes for different combinations of the other two variables, stim and area, with independent Poisson regression fits overlaid for each combination, to indicate whether there is any trend in the data.

Expand code

ggplot(data, aes(x = temp, y = nSpikes)) +
  geom_point(alpha = 0.2) +
  facet_grid(cols = vars(area), rows = vars(stim), as.table = FALSE) +
  geom_smooth(formula = y ~ x, 
              method = glm, 
              method.args = list(family = poisson), 
              color = "black") +
  my_theme

We can see no real trend in predicting nSpikes from temp in none of the combinations with the other variables, as we should expect since we did not include the variable temp in the generation of the data. Although we can already observe some effects for the variables stim and area, the comparisons are not visually straight-forward between all the different combinations. To make comparisons simpler, let’s try a different plot where we ignore variations in temp, by evaluating our Poisson regression model at the average temperature:

Expand code

sim2 = simulateResiduals(glm3, seed = seed, n = nSim)

newData = cbind(data[c("stim", "area")], temp = mean(data$temp))
glmPredict = predict(glm3, se.fit = TRUE, newdata = newData)
ci95 = glmPredict$fit + outer(glmPredict$se.fit, qnorm(c(0.025, 0.975)))
predTable = cbind(exp(glmPredict$fit), 
                  exp(ci95[, 1]), 
                  exp(ci95[, 2]))

p5a_temp = PlotBeeswarm(data, my_theme, colour = area)
p5a = OverlayPrediction(p5a_temp, data, predTable, 
                        fillValue = c("#D65D00", "#29A2FF"), 
                        fill = area,
                        colour = "black",
                        showLegend = TRUE)
p5b = PlotQQGLM(sim2, my_theme)

p5c = PlotResidualHistogram(sim2$scaledResiduals, blank_style)
ArrangePlots(p5a, p5b, p5c)

Each dot in panel a represents once again a raw observation, split this time by area designated by the two different colours. The overlaid squares and lines represent the model’s estimates for the respective data subsets, evaluated for $temp = 0$, i.e. the average temperature, since we saw that this variable has no effect on the estimated number of spikes. In panel b, I have plotted the expected versus observed distribution of the simulated residuals. Although the simulated residuals are now closer to being uniformly distributed compared to the simple Poisson regression we ran before, we can still see that they deviate from the expected distribution.

Mixed-effects Poisson regression

A parameter that we have not yet considered in our models is that our observations are not independent of each other, as we have multiple measurements from every neuron. To take this non-independence into account we need to fit a generalized linear mixed-effects model. For this, we will include in our formula a random-effects term, (1|id), which allows for a common offset for estimates of observations coming from the same neuron.

glmm = glmer(nSpikes ~ 1 + stim*area*scale(temp) + (1|id), 
             family = poisson, 
             data)

## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl =
## control$checkConv, : Model failed to converge with max|grad| = 0.00325159
## (tol = 0.001, component 1)

As the warning above indicates we run into a small problem –the iterative algorithm did not converge. Problems such as this one can often be mitigated by choosing appropriate parameters for the optimization. For GLMMs, an integral over the space of random effects needs to be approximated to minimize the log-likelihood function. By default, glmer uses the Laplace approximation to estimate the MLE. But we can get more accurate results, at the expense of processing time, by utilizing the adaptive Gauss-Hermite approximation instead. We can do that by setting the number of nodes in the quadrature formula through the nAGQ argument. I show the results for nAGQ = 15 but identical results can be obtained with lower values.

glmm = glmer(nSpikes ~ 1 + stim*area*scale(temp) + (1|id), 
             family = poisson, 
             data, 
             nAGQ = 15)

Expand code

ConstructModelTable(glmm, 
                    show.r2 = FALSE, 
                    show.icc = FALSE)

	n Spikes
Predictors	Log-Mean	p
Intercept	1.65 (1.43 – 1.87)	<0.001
stim	0.68 (0.59 – 0.76)	<0.001
area	-1.87 (-2.22 – -1.52)	<0.001
scale(temp)	0.03 (-0.04 – 0.09)	0.461
stim:area	0.82 (0.61 – 1.03)	<0.001
stim:scale(temp)	-0.01 (-0.10 – 0.08)	0.801
area:scale(temp)	0.03 (-0.17 – 0.22)	0.791
stim:area:scale(temp)	-0.12 (-0.33 – 0.10)	0.290
Random Effects
σ²	0.22
τ₀₀ _id	0.12
Observations	600
AIC	747.770

The table is divided into two parts. The fixed effects’ estimates can be interpreted as before, except that they now refer to the effect on the average cell (and not the average effect across cells). The term $\sigma^2$ refers to the residual variance and $\tau_{00, id}$ refers to the variance among the random effects.

Let’s proceed to model evaluation. We need to evaluate the random effects to confirm that they are approximately normally distributed.

Expand code

glmmResiduals = residuals(glmm)

sjp = plot_model(glmm, "diag")$id 

p6a = ggplot(data = sjp$data,
             aes(x = nQQ, 
                 y = y, 
                 ymin = conf.low, 
                 ymax = conf.high)) + 
  geom_linerange() + 
  geom_point(size = 2) +
  stat_smooth(method = "lm", 
              colour = "black") +
  labs(x = "Standard normal quantiles",
       y = "Random effect quantiles") +
  my_theme

p6b = ggplot(data, aes(id, glmmResiduals), guide = "Residuals") +
  geom_hline(yintercept = 0, size = 1, linetype = "dashed", alpha = 0.3) +
  geom_smooth(method = "loess", colour = "black") + 
  geom_point(alpha = 0.3) + 
  ylab("Residuals") +
  my_theme
plot_grid(p6a, p6b)

On the left, I plot the expected quantiles of the normal distribution versus the conditional modes of the random effects, which we can think of as the model’s prediction about how much each neuron deviates from the marginal (average) population estimate. We see that they are fairly well fit by a straight line. On the right, I plot the residuals of the individual observations for every neuron. In this plot, the residuals show no detectable structure and are fairly close around the horizontal line $y = 0$. Perfect, we can now move to the evaluation of the simulated residuals for this model.

Expand code

simGLMM = simulateResiduals(glmm, seed = seed, n = nSim, use.u = TRUE)
p7a = PlotQQGLM(simGLMM, my_theme)
p7b = PlotResidualHistogram(simGLMM$scaledResiduals, blank_style)
plot_grid(p7a, p7b)

The simulated residuals of this model show no clear deviation from the uniform distribution, suggesting that it adequately addresses the structure of the data. We have finally arrived at our destination!

Summary

Use a model that respects the structure of your data (appropriate distribution assumptions, random effects)
Check your model’s output by evaluating diagnostic plots
Center and/or scale continuous variables on scales with different orders of magnitude from each other
Use data transformations with care

https://stats.stackexchange.com/a/58077/97671 ↥

Vassilis Kehayas

Comments

Jasonwrork
19/07/2022

I’m not sure where you’re getting your information, but good topic. I needs to spend some time learning more or understanding more. Thanks for wonderful information I was looking for this information for my mission.