by Camille Lévesque & Katherine Hébert
This workshop was developed with support from the NSERC CREATE Computational Biodiversity Science and Services (BIOS²) training program.
Part 2 of this workshop is (heavily!) based on Nicholas J. Clark’s Physalia course about Ecological forecasting with mvgam and brms: nicholasjclark.github.io/physalia-forecasting-course/.
This course is designed to demystify hierarchical modelling as powerful tools to model population dynamics, spatial distributions, and any non-linear relationships in your ecological data. The training will be divided into two blocks.
First, we will cover hierarchies in biology, data, and in models to understand what hierarchical models are, some of the forms they can take, and the fundamentals of how they work.
Second, we will introduce latent variable modelling as a way to explain even more of the variation in our response variables, to better disentangle the hierarchies of variation in our data.
Understand how a hierarchical model works, and how it can be used to capture nonlinear effects
Understand dynamic modelling, and how latent variables can be used to capture dynamic processes like temporal or spatial autocorrelation
Use the R packages mgcv and mvgam packages to build and fit hierarchical models
Understand how to visualize and interpret hierarchical models with these packages
Familiarity with Generalized Additive Modelling
Install R & RStudio
Install the required packages:
Let’s now load them, and set the ggplot theme for all plots we make:
Follow along these slides with the tutorial, available at LINK HERE.
You can also download the model objects and other saved objects to lighten the computation time while following along with this presentation.
In this workshop, we will be analyzing time series of plankton counts (cells per mL) taken from Lake Washington in Washington, USA during a long-term monitoring study. The data are available in the MARSS
package.
The data download and preparation steps are identical to the steps in Nicholas J. Clark’s Physalia course.
Let’s have a look at the data’s structure:
# A tibble: 6 × 6
year month y series temp time
<dbl> <dbl> <dbl> <fct> <dbl> <int>
1 1965 1 -0.542 Greens -1.23 1
2 1965 1 -0.344 Bluegreens -1.23 1
3 1965 1 -0.0768 Diatoms -1.23 1
4 1965 1 -1.52 Unicells -1.23 1
5 1965 1 -0.491 Other.algae -1.23 1
6 1965 2 NA Greens -1.32 2
Let’s also plot the data to get a first look at the time series:
If we plot the data for one of the groups (diatoms), we see that that relationship between abundance and time is not linear.
If we fit a linear model on this data, we can come to the same conclusion
We can plot the relationship between time and abundance using ggplot:
Now, if we fit a GAM on our data
We immediately see a better fit, as the spline seems to follow our data points more accurately.
In the GAM world we have a specific term for this: ✨wiggliness✨
Wiggliness refers to the degree of flexibility in the model’s fitted function. It describes how much the model can fluctuate to fit the data.
The number of basis functions (knots) we set our model to have is the maximum wiggliness we expect our relationship to have.
Nicholas Clark - Temporal autocorrelation in GAMs and the mvgam package
The model penalizes the wiggliness (penalized log likelihood), which should shrink down our wiggliness to the right amount to match the patterns in the data.
If the wiggliness is too high, we risk overfitting, which results in a model that is less generalizable and more specific to this exact set of data.
wiggliness ≠ shape of the curve
It refers to the organization of organisms into levels
You’re probably familiar with the taxonomic hierarchy
We find that nature often tends to “organize” itself into hierarchies.
We find hierarchies in data when a data item is linked to another data item by a hierarchical relationship, where we would qualify one data item to be the “parent” of the other.
“month” is nested in “year”
“series” = grouping variable with 5 levels
# A tibble: 6 × 6
year month y series temp time
<dbl> <dbl> <dbl> <fct> <dbl> <int>
1 1965 1 -0.542 Greens -1.23 1
2 1965 1 -0.344 Bluegreens -1.23 1
3 1965 1 -0.0768 Diatoms -1.23 1
4 1965 1 -1.52 Unicells -1.23 1
5 1965 1 -0.491 Other.algae -1.23 1
6 1965 2 NA Greens -1.32 2
Hierarchical generalised linear models, or Hierarchical models (HM), are used to estimate linear relationships between predictors and a response variable, but impose a structure where the predictors are organized into groups.
The relationship between predictors and response variable may differ across groups.
Pedersen et al. (2019)
A hierarchical generalized additive model, or HGAM, is the integration of GAMS and HMs into one model. HGAMs enable us to estimate nonlinear relationships between predictors and a response variable when at least one of the predictors is a grouping variable.
We’ve previously identified hierarchies in the data; now, how do we incorporate them into our models? We represent these hierarchies using smooth functions in our equations.
Let’s break it down.
s(time, k = k_number_time_steps, bs = “cs”)
s() = smooth function
k = number of knots ; can be the number of time steps or any fraction of that number
bs = type of basis functions
Thin Plate Regression Splines (“tp” in mgcv): General purpose spline basis. If you do not have phenology in your data and your smooth only has one covariate = probably this type of basis functions. To learn more –> Wood (2003)
Random Effect (“re”): Similar to random effect in mixed model. Creates an intercept per grouping level.
Cyclic Cubic Regression Splines (“cs”): Useful to fit models with cyclic components or phenology.
Tensor product (“te”): We use this to look at interaction between predictors if we believe it to have an impact on the relationship between the predictors and the response variable. For example: the interaction of temperature and latitude on the abundance of a species of plant.
Factor Smoother (“fs”): Produces a smooth curve for each level of a single factor. This means that a model can be constructed with a main effect plus factor level smooth deviations from that effect.
Global trend across all levels of a grouping variable
Species-specific trends sharing the same wiggliness, but differing in shapes
From Hierarchical generalized adidtive models in ecology: an introduction with mgcv, by Pedersen et al. (2019)
Model G: A single common smoother for all observations
modG <- gam(y ~ # Abundance
# Global Smooth
s(time, k = 60, bs = "cs") + # k = 60, because number of time steps = 120
# bs = "cs", because cyclic component in the data
# Series as random effect
s(series, k = 5, bs = "re"), # k = 5, because number of series = 5
# bs = "re", to get a random effect
method = "REML", # Standard method for HGAMs
family = "gaussian",
data = plankton_data
)
Not the greatest fit!
Family: gaussian
Link function: identity
Formula:
y ~ s(time, k = 60, bs = "cs") + s(series, k = 5, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.151e-09 3.422e-02 0 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 3.202e+01 59 4.608 <2e-16 ***
s(series) 4.015e-04 4 0.000 1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.321 Deviance explained = 35.9%
-REML = 770.66 Scale est. = 0.67427 n = 576
Model GS: A global smoother plus group-level smoothers that have the same wiggliness.
modGS <- gam(y ~ # Abundance
# Global Smooth
s(time, k = 60, bs = "cs", m = 2) + # k = 60, because number of time steps = 120
# bs = "cs", because cyclic component
# Series as random effect
s(time, series, k = 60, bs = "fs", m = 2), # k = 5, because number of series = 5
# bs = "fs", factor smoother
method = "REML", # Standard method for HGAMs
family = "gaussian",
data = plankton_data
)
We can see more clearly the type of predictions the model makes for each level.
Family: gaussian
Link function: identity
Formula:
y ~ s(time, k = 60, bs = "cs", m = 2) + s(time, series, k = 60,
bs = "fs", m = 2)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.007795 0.023226 -0.336 0.737
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 35.62 59 2.715 <2e-16 ***
s(time,series) 104.77 299 2.223 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.699 Deviance explained = 77.2%
-REML = 734.91 Scale est. = 0.29908 n = 576
Short answer: Probably!
Longer answer: There’s always a trade-off between a more complex model that account amount for more variations in the data, but that is prone to overfitting.
However, these two models were pretty simple and we’ll see in the next part of the training how we can get to another level of model complexity.
Time for a little break!
In Part 1, we modelled how plankton counts fluctuated through time with a hierarchical structure to understand how different groups vary through time.
We noticed a strong seasonal pattern in these fluctuations.
Before we continue, let’s “zoom out” a little and think about how our model relates to ecology.
Our model says that plankton abundances generally follow the seasonal fluctuations in temperature in Lake Washington, and that differently plankton groups fluctuate a bit differently.
All we have to explain these fluctuations is:
Time
Plankton groups
Temperature
There are probably unmeasured predictors that play some role here:
water chemistry
process of measuring these plankton counts through time (i.e., measurement variability or observation error)
What else can you think of?
These factors all leave an “imprint” in the time series we have here.
And most importantly, some of these factors vary through time - they are themselves time series with their own temporal patterns.
Do we have any measured predictors to add to the model to capture the influence of these factors?
Unfortunately, as is often the case in ecological studies, we do not have measurements of these predictors through time…
There are three important things to think about here:
There are unmeasured predictors that influence the trend we measured and are now modelling. The signature of these unmeasured processes is in the data, but we cannot capture it with the previous model because we did not include these predictors. This is called “latent variation”;
Each species responds to these unmeasured predictors in their own way. (This is where the “hierarchical” bit comes in!)
These processes are not static - they are dynamic. Like our response variable, these unmeasured predictors vary through time.
The definition of “latent” in the Merriam-Webster dictionary is:
Present and capable of emerging or developing but not now visible, obvious, active, or symptomatic.
OR
a fingerprint (as at the scene of a crime) that is scarcely visible but can be developed for study
It can be helpful to use this definition to conceptualise latent variation in our data.
As we said above, there are “imprints” of factors that we didn’t measure on the time series we are modelling. These signals are present and influence the trends we estimate, but are “scarcely visible” because we lack information to detect them.
In statistical models, latent variables are essentially random predictors that are generated during the modelling process to capture correlated variations between multiple responses (species, for example).
The idea is that a bunch of these latent variables are randomly generated, and they are penalized until the model only retains a minimal set of latent variables that capture the main axes of covariation between responses. (Here, it can be useful to think of how a PCA reduces the dimensions of a dataset to a minimum of informative “axes”).
Each species is then assigned a “factor loading” for each latent variable, which represents the species’ response to the latent variable.
In a temporal model, these latent variables condense the variation that is left unexplained into a “predictor” that capture some structured pattern across responses (e.g. species’ abundance trends) through time.
For example, latent variables can capture temporal autocorrelation between observations, meaning that we can tell our model to account for the fact that each observation is probably closer to neighbouring values though time (e.g. year 1 and year 2) than to a value that is several time steps away (e.g. year 1 and year 20).
In a spatial model, latent variables can be applied to capture dynamic processes in space, such as spatial autocorrelation.
Similarly to the temporal model, we often make the assumption that observations that are closer in space are more similar than observations that are further apart, because ecological patterns are in part driven by the environment, which is itself spatially-autocorrelated.
Ok, and what do we mean by “dynamic”?
A dynamic factor model assumes the factors that predict our response variable (i.e. biomass) evolve as time series too.
In other words, a dynamic factor model provides even more flexibility to capture the “wiggly” effects of these predictors on our response variable through time.
mvgam
The package mvgam
is an interface to Stan
, which is a language used to specify Bayesian statistical models.
You could code these models directly in Stan
if you wanted to - but mvgam
allows you to specify the model in R
(using the R
syntax you know and love) and produces and compiles the Stan
file for your model for you. Isn’t that nice?
mvgam
Today, we’re just scratching the surface of mvgam
, but there are many functionalities and applications of the package that you can explore if you’re interested. For example, there are blog posts about integrating phylogenetic information into your models, how to build spatial models like Joint Species Distribution Models, and more.
Check it out! nicholasjclark.github.io/mvgam.
In Part 2, we will build a Bayesian model but will not discuss Bayesian theory or practices in detail. To learn more:
Towards A Principled Bayesian Workflow by Michael Betancourt
Bayes Rules! An Introduction to Applied Bayesian Modeling by Alicia A. Johnson, Miles Q. Ott, Mine Dogucu.
How to change the default priors in mvgam
in this example
More documentation, vignettes, and tutorials can be found on the mvgam
site
mvgam
First, load the mvgam
package and the marginaleffects
package that we will be using to visualize the model:
mvgam
mvgam
Our model is an attempt to estimate how plankton vary in abundance through time. Let’s consider what we know about the system, to help us build this model:
Seasonal variation due to temperature changes within each year that drives all plankton groups to fluctuate through time.
We are also interested in how plankton abundance is changing annually, on the longer term of the time series.
We know that these plankton are embedded in a complex system within a lake, and that their dynamics may be dependent for many reasons. In other words, some groups may be correlated through time, and even more complicated - these correlations may not be happening immediately. There may be lagged correlations between groups as well!
Let’s build this model piece by piece, to capture each of these levels of variation.
First, let’s split the data into a training set, used to build the model, and a testing set, used to evaluate the model.
Next, we’ll build a hierarchical GAM with a global smoother for all groups, and species-specific smooths in mvgam
.
This will allow us to capture the “global” seasonality that drives all plankton groups to fluctuate similarly through the time series, and to capture how each group’s seasonal fluctuations differ from this overall, global trend.
notrend_mod <- mvgam(formula =
y ~
# tensor of temp and month to capture
# "global" seasonality across plankton groups
te(temp, month, k = c(4, 4)) +
# series-specific deviation tensor products
# in other words, each plankton group can deviate from the
# global trend in its own way
te(temp, month, k = c(4, 4), by = series),
# set the observation family to Gaussian
family = gaussian(),
# our long-format dataset, split into a training and a testing set
data = plankton_train,
newdata = plankton_test,
# no latent trend model for now (so we can see its impact later on!)
trend_model = 'None',
# compilation & sampling settings
use_stan = TRUE,
# here, we are only going to use the default sampling settings to keep the
# model quick for the tutorial. If you were really going to run
# this, you should use set the chains, samples, and burnin arguments.
)
This is the shared seasonal trend estimated across all groups at once.
We can see that the model was able to capture the seasonal temporal structure of the plankton counts.
We can also see how each group deviates from this global smooth (i.e. the partial effects):
The colour shows us the strength of the partial effect of We can see that the model was able to capture some differences between each plankton group’s seasonal trend and the community’s global trend.
But how’s the model doing, overall? Let’s plot the residuals:
There’s still a lot of temporal autocorrelation to deal with in the model. Let’s try to address some of this with a dynamic model!
Let’s now add a dynamic trend component to the model, to capture some of the variation that isn’t captured in our previous model.
You can specify many kinds of dynamic trends with mvgam
, but we won’t go into much detail here. We want to demonstrate the power of this approach, but you will need to select the dynamic trend model that works best for you.
Here, we will use a Gaussian Process to model temporal autocorrelation in our data, which can specified as GP
in mvgam
.
gptrend_mod <- mvgam(formula=
y ~
te(temp, month, k = c(4, 4)) +
te(temp, month, k = c(4, 4), by = series),
# latent trend model
# setting to order 1, meaning the autocorrelation is assumed to be 1 time step.
trend_model = "GP",
# use dynamic factors to estimate the latent trends
use_lv = TRUE,
# observation family
family = gaussian(),
# our long-format datasets
data = plankton_train,
newdata = plankton_test,
# use reduced samples for inclusion in tutorial data
samples = 100)
Posterior predictive checks are one way to investigate a Bayesian model’s ability to make unbiased predictions. In this check, we’ll be measuring the posterior prediction error of our model.
In other words, how close are our model predictions to the expected value of Y?
These plots are pretty encouraging - our model isn’t generating biased predictions compared to the observation data (black line). Our model is doing a pretty good job!
Plot the global smoother for all species over time:
Plot each species’ estimated trend:
And, importantly, let’s have a look at the dynamic factors estimated by the model to capture temporal dynamics in our response data:
These are the famous dynamic latent variables we’ve been talking about. You can see that they are capturing some of the unexplained temporal variation in our model.
We can plot the partial effect of each variable we included in our model, to understand how they contribute to the overall model.
We often want to know how much our response is changing through time.
We can take a derivative of our estimated trends to get a rate of change in abundance through time.
If the derivative trend is centred on the zero line, the response variable (here, counts) were fluctuating around their mean abundance rather than declining or growing in a consistent way through time.
As an example, let’s look at the blue-green algae trend:
It seems that these blue-green algae were pretty stable in the beginning of the time series, then declined slightly, then declined quickly around time step 100. The counts then increaased after this point.
We made a slightly “hacked” version of this function to extract the derivatives:
source(here::here("saved-objects/custom-function.R"))
derivs = lapply(1:5, function(x) plot_mvgam_trend_custom(gptrend_mod, series = x, derivatives = TRUE))
names(derivs) = gptrend_mod$obs_data$series |> levels()
# reformatting the data for easier plotting
df = lapply(derivs, function(x) x[,-1]) |>
lapply(as.data.frame) |>
lapply(pivot_longer, cols = everything()) |>
bind_rows(.id = "Group")
Let’s look at a histogram of the derivatives to get an idea of how this group’s counts changed over time, on average:
The median of the distribution of rates of change is ~0 for all the groups.
One way to investigate community dynamics is to check out the correlations between plankton groups.
These correlations are captured with the dynamic model, which estimates unmeasured temporal processes in our data.
Let’s calculate and plot species’ temporal correlations and plot them as a heatmap:
# Calculate the correlations from the latent trend model
species_correlations = lv_correlations(gptrend_mod)
# prepare the matrix for plotting
toplot = species_correlations$mean_correlations
colnames(toplot) = gsub("_", "\n", stringr::str_to_sentence(colnames(toplot)))
rownames(toplot) = gsub("_", "\n", stringr::str_to_sentence(rownames(toplot)))
# plot as a heatmap
corrplot::corrplot(toplot,
type = "lower",
method = "color",
tl.cex = 1, cl.cex = 1, tl.col = "black", font = 3,)
So, which model is best?
We will use leave-one-out comparison to compare the two models we just built: notrend_mod
which does not have a dynamic trend, and gptrend
which includes an autoregressive latent trend model.
This workshop isn’t really about model comparison, so we will just use this approach as an example, but there are other ways you could compare these models.
One way is to compare the predictive performance of the models on new data using a metric of comparison like expected log pointwise predictive density (ELPD).
The AR1 trend model performs better than the model without a dynamic trend (here, we want high - more positive- values for elpd_dif).
elpd_diff se_diff
AR1 trend 0.0 0.0
No trend -101.6 11.7
Another way is to consider the ecological processes that influence our response variable. Which model do you think does a better job of representing the factors that cause plankton counts to vary through time?
We hope you’ve enjoyed this short tutorial on hierarchical generalized additive models with mgcv
and mvgam
.
At the end of this tutorial, we hope you now feel you understand how a hierarchical model works, and how it can be used to capture nonlinear effects. The R packages mgcv
and mvgam
packages are great tools to build and fit hierarchical models!
We also covered dynamic modelling, and how latent variables can be used to capture dynamic processes like temporal or spatial autocorrelation.
Can these models help you answer your research questions?
Thank you to BIOS2 and Gracielle Higino, Andrew Macdonald, and Kim Gauthier-Schampaert for supporting this workshop. Special thanks to Nicholas J. Clark for helpful suggestions and for providing much of the material we used to build this workshop.
If you’re interested in talking more about these models, please join our community calls during March and April 2025! We welcome anyone interested in GAMs, computational ecology, or eager to learn more about HGAMs to participate in the following sessions:
Temporal Ecology with HGAMs
When: March 7, 2025 at 1:30 PM (EST)
Spatial Ecology with HGAMs
When: March 14, 2025 at 1:30 PM (EST)
Behavioural Ecology with HGAMs
When: March 24, 2025 at 1:30 PM (EST)
Join us in thinking about how we could use HGAMs to push ecological research forward!