# install packages from CRAN
# install.packages(pkgs = c("dplyr", "tidyr", mvgam", "mgcv", "gratia", "marginaleffects", "corrplot", "ggplot2"), dependencies = TRUE)
library("dplyr")
library("tidyr")
library("mvgam")
library("mgcv")
library("gratia")
library("marginaleffects")
library("corrplot")
library("ggplot2")
# set all ggplots to this theme
theme_set(theme_minimal())
Introduction to Hierarchical Generalized Additive Models
This course is designed to demystify hierarchical modelling as powerful tools to model population dynamics, spatial distributions, and any non-linear relationships in your ecological data. The training will be divided into two blocks. First, we will cover hierarchies in biology, data, and in models to understand what hierarchical models are, some of the forms they can take, and the fundamentals of how they work. Second, we will introduce latent variable modelling as a way to explain even more of the variation in our response variables, to better disentangle the hierarchies of variation in our data. Both blocks will include a theoretical presentation followed by hands-on coding exercises to implement and interpret hierarchical GAMs.
1 Overview
This course is designed to demystify hierarchical modelling as powerful tools to model population dynamics, spatial distributions, and any non-linear relationships in your ecological data. The training will be divided into two blocks.
First, we will cover hierarchies in biology, data, and in models to understand what hierarchical models are, some of the forms they can take, and the fundamentals of how they work.
Second, we will introduce latent variable modelling as a way to explain even more of the variation in our response variables, to better disentangle the hierarchies of variation in our data.
Both blocks will include a theoretical presentation followed by hands-on coding exercises to implement and interpret hierarchical GAMs.
This workshop was developed with support from the NSERC CREATE Computational Biodiversity Science and Services (BIOS²) training program.
Credits
Part 2 of this workshop is based on Nicholas J. Clark’s Physalia course “Ecological forecasting with mvgam
and brms
”. It was reworked by Katherine Hébert and Camille Lévesque into this short tutorial (mainly in Part 2), but we recommend having a look at the original for a more in-depth look at the methods we are covering today:
Clark, N. J. (2024). “Ecological forecasting with
mvgam
andbrms
”. Physalia. Retrieved from https://nicholasjclark.github.io/physalia-forecasting-course/.
Learning outcomes
Understand how a hierarchical model works, and how it can be used to capture nonlinear effects
Understand dynamic modelling, and how latent variables can be used to capture dynamic processes like temporal or spatial autocorrelation
Use the R packages
mgcv
andmvgam
packages to build and fit hierarchical modelsUnderstand how to visualize and interpret hierarchical models with these packages
Requirements
Familiarity with Generalized Additive Modelling
We recommend previous experience with GAMs before taking this training. If you would like to follow an introduction to GAMs before this workshop, please have a look at Eric Pedersen’s Introduction to GAMs and/or the Québec Centre for Biodiversity Science’s Workshop 8: GAMs.
Familiarity with Bayesian
R & RStudio
The workshop assumes basic familiarity with R/RStudio. To be able to follow along with the practical exercises on your own computer, in addition to downloading the data files above, you will need to do the following:
Install the latest version of R for your operating system: https://cran.r-project.org/. Install the latest version of RStudio for your operating system: https://www.rstudio.com/products/rstudio/download/
R packages
Install and load the packages that we will use during the workshop by executing the following code in R version 4.2.0:
Stan
Please note that for Block 2, we will be fitting models using Stan
. Stan
must be installed (along with either rstan
and/or cmdstanr
). Please refer to installation links for Stan
with cmdstandr
here (or with rstan
here).
2 Workshop materials
About the data set
In this workshop, we will be analyzing time series of plankton counts (cells per mL) taken from Lake Washington in Washington, USA during a long-term monitoring study. The data are available in the MARSS
package, and are described and analyse in greater detail in this publication:
Hampton Stephanie E. , Scheuerell Mark D. , Schindler Daniel E. , (2006), Coalescence in the Lake Washington story: Interaction strengths in a planktonic food web, Limnology and Oceanography, 51, doi: 10.4319/lo.2006.51.5.2042.
Some information about the data, from the MARSS
package documentation (link):
The
lakeWAplankton
data set consists for two data sets:lakeWAplanktonRaw
and a dataset derived from the raw dataset,lakeWAplanktonTrans
.lakeWAplanktonRaw
is a 32-year time series (1962-1994) of monthly plankton counts from Lake Washington, Washington, USA. Columns 1 and 2 are year and month. Column 3 is temperature (C), column 4 is total phosphorous, and column 5 is pH. The next columns are the plankton counts in units of cells per mL for the phytoplankton and organisms per L for the zooplankton.
lakeWAplanktonTrans
is a transformed version oflakeWAplanktonRaw
. Zeros have been replaced with NAs (missing). The logged (natural log) raw plankton counts have been standardized to a mean of zero and variance of 1 (so logged and then z-scored). Temperature, TP & pH were also z-scored but not logged (so z-score of the untransformed values for these covariates). The single missing temperature value was replaced with -1 and the single missing TP value was replaced with -0.3.
The data download and preparation steps are identical to the steps in Nicholas J. Clark’s Physalia course.
Data download
Load the dataset:
load(url('https://github.com/atsa-es/MARSS/raw/master/data/lakeWAplankton.rda'))
Prepare the data
First, the data needs to be prepared into a long format (i.e., one observation per row). In Part 2, the mvgam
package will require specific column names, so we will prepare the data to match the package’s requirements.
## Prepare the time series data for analysis
# This code is from https://nicholasjclark.github.io/physalia-forecasting-course/day4/tutorial_4_physalia
# We will work with five different groups of plankton:
<- c('Greens', 'Bluegreens', 'Diatoms', 'Unicells', 'Other.algae')
outcomes
<- xfun::cache_rds(do.call(rbind, lapply(outcomes, function(x){
plankton_data
# create a group-specific dataframe with counts labelled 'y'
# and the group name in the 'series' variable
data.frame(year = lakeWAplanktonTrans[, 'Year'],
month = lakeWAplanktonTrans[, 'Month'],
y = lakeWAplanktonTrans[, x],
series = x,
temp = lakeWAplanktonTrans[, 'Temp'])})) %>%
# change the 'series' label to a factor
::mutate(series = factor(series)) %>%
dplyr
# filter to only include some years in the data
::filter(year >= 1965 & year < 1975) %>%
dplyr::arrange(year, month) %>%
dplyr::group_by(series) %>%
dplyr
# z-score the counts so they are approximately standard normal
::mutate(y = as.vector(scale(y))) %>%
dplyr
# add the time indicator
::mutate(time = dplyr::row_number()) %>%
dplyr::ungroup())
dplyr
# loop across each plankton group to create the long datframe
<- do.call(rbind, lapply(outcomes, function(x){
plankton_data
# create a group-specific dataframe with counts labelled 'y'
# and the group name in the 'series' variable
data.frame(year = lakeWAplanktonTrans[, 'Year'],
month = lakeWAplanktonTrans[, 'Month'],
y = lakeWAplanktonTrans[, x],
series = x,
temp = lakeWAplanktonTrans[, 'Temp'])})) %>%
# change the 'series' label to a factor
::mutate(series = factor(series)) %>%
dplyr
# filter to only include some years in the data
::filter(year >= 1965 & year < 1975) %>%
dplyr::arrange(year, month) %>%
dplyr::group_by(series) %>%
dplyr
# z-score the counts so they are approximately standard normal
::mutate(y = as.vector(scale(y))) %>%
dplyr
# add the time index
::mutate(time = dplyr::row_number()) %>%
dplyr::ungroup() dplyr
Let’s have a look at the data’s structure:
head(plankton_data)
# A tibble: 6 × 6
year month y series temp time
<dbl> <dbl> <dbl> <fct> <dbl> <int>
1 1965 1 -0.542 Greens -1.23 1
2 1965 1 -0.344 Bluegreens -1.23 1
3 1965 1 -0.0768 Diatoms -1.23 1
4 1965 1 -1.52 Unicells -1.23 1
5 1965 1 -0.491 Other.algae -1.23 1
6 1965 2 NA Greens -1.32 2
Let’s also plot the data to get a first look at the time series:
ggplot(data = plankton_data, aes(x = time)) +
geom_line(aes(y = temp), col = "black") + # Temperature in black
geom_line(aes(y = y, col = series), lwd = .4) + # Taxa counts in color
labs(x = "Time", y = "Abundance (z-score)\n(Temperature z-score in black)", col = "Group")
3 Part 1: Hierarchical Generalized Additive Models (HGAMs)
A (very) brief introduction to GAMs
- Definition: Generalized additive models (GAMs) are models that allow us enough flexibility to depict nonlinear relationships, without being so “black-boxy” that they become so hard to interpret. They offer a middle ground between simple and complex models.
- Why are they relevant in ecology? A lot of relationships in ecology are inherently nonlinear and GAMs are particularly useful in capturing these more complex relationships.
Nonlinear relationship
If we plot the data for one of the groups (diatoms), we see that that relationship between abundance and time is not linear.
<- plankton_data[plankton_data$series == "Diatoms", ]
diatoms_data ggplot(diatoms_data, aes(x = time, y = y)) +
geom_point() +
geom_line(aes(y = y), lwd = .4) +
labs(title = "Diatoms Data Points", x = "Time", y = "Abundance (z-score)") +
theme_minimal()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Linear model on diatoms data
If we fit a linear model on this data, we can come to the same conclusion
<- lm(y ~ time, data = diatoms_data) diatoms_lm
We can plot the relationship between time and abundance using ggplot:
ggplot(diatoms_data, aes(x = time, y = y)) +
geom_point(color = "black") + # Scatter points
geom_smooth(method = "lm", color = "blue1", fill = "lightgray", se = TRUE) + # Regression line
labs(title = "Diatoms Data with Linear Regression Line",
x = "Time",
y = "Y Value") +
theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
GAM on diatoms data
Now, if we fit a GAM on our data
# model
<- gam(y ~ s(time, bs = "cs", k = 50), data = diatoms_data) diatoms_gam
We immediately see a better fit, as the spline seems to follow our data points more accurately.
# graph
::draw(diatoms_gam, rug = FALSE) gratia
What are splines?
- Splines (also known as smooths or smoothers) are a type of function used to estimate nonlinear relationships between covariates and the response variable.
- Splines are the sum of simpler functions, called basis functions.
What determines how “curvy” splines are?
In the GAM world we have a specific term for this: ✨wiggliness✨
Wiggliness refers to the degree of flexibility in the model’s fitted function. It describes how much the model can fluctuate to fit the data.
The number of basis functions (knots) we set our model to have is the maximum wiggliness we expect our relationship to have.
Wiggliness
The model penalizes the wiggliness (penalized log likelihood), which should shrink down our wiggliness to the right amount to match the patterns in the data.
If the wiggliness is too high, we risk overfitting, which results in a model that is less generalizable and more specific to this exact set of data.
wiggliness ≠ shape of the curve
What is a hierarchy?
What is a hierarchy in biology?
It refers to the organization of organisms into levels
You’re probably familiar with the taxonomic hierarchy
We find that nature often tends to “organize” itself into hierarchies.
What is a hierarchy in data?
We find hierarchies in data when a data item is linked to another data item by a hierarchical relationship, where we would qualify one data item to be the “parent” of the other.
head(plankton_data)
# A tibble: 6 × 6
year month y series temp time
<dbl> <dbl> <dbl> <fct> <dbl> <int>
1 1965 1 -0.542 Greens -1.23 1
2 1965 1 -0.344 Bluegreens -1.23 1
3 1965 1 -0.0768 Diatoms -1.23 1
4 1965 1 -1.52 Unicells -1.23 1
5 1965 1 -0.491 Other.algae -1.23 1
6 1965 2 NA Greens -1.32 2
“month” is nested in “year”
“series” = grouping variable with 5 levels
What is a hierarchical model (HM)?
Hierarchical generalised linear models, or Hierarchical models (HM), are used to estimate linear relationships between predictors and a response variable, but impose a structure where the predictors are organized into groups.
The relationship between predictors and response variable may differ across groups.
Pedersen et al. (2019)
Now that we have seen GAMs and HMs… What are HGAMs?
HGAM
A hierarchical generalized additive model, or HGAM, is the integration of GAMS and HMs into one model. HGAMs enable us to estimate nonlinear relationships between predictors and a response variable when at least one of the predictors is a grouping variable.
We’ve previously identified hierarchies in the data; now, how do we incorporate them into our models? We represent these hierarchies using smooth functions in our equations.
HGAM equation with mgcv
# gam(y ~ s(time, k = k_number_time_steps, bs = "cs") +
# s(time, series, k = number_series, bs = "fs", m = 2),
# family = "gaussian"
# data = plankton_data
# method = "REML")
Let’s break it down.
Smooth function
s(time, k = k_number_time_steps, bs = “cs”)
s() = smooth function
k = number of knots ; can be the number of time steps or any fraction of that number
bs = type of basis functions
Types of basis functions
Thin Plate Regression Splines (“tp” in mgcv): General purpose spline basis. If you do not have phenology in your data and your smooth only has one covariate = probably this type of basis functions. To learn more –> Wood (2003)
Random Effect (“re”): Similar to random effect in mixed model. Creates an intercept per grouping level.
Cyclic Cubic Regression Splines (“cs”): Useful to fit models with cyclic components or phenology.
Tensor product (“te”): We use this to look at interaction between predictors if we believe it to have an impact on the relationship between the predictors and the response variable. For example: the interaction of temperature and latitude on the abundance of a species of plant.
Factor Smoother (“fs”): Produces a smooth curve for each level of a single factor. This means that a model can be constructed with a main effect plus factor level smooth deviations from that effect.
Factor smoother
Global trend across all levels of a grouping variable
Species-specific trends sharing the same wiggliness, but differing in shapes
Types of functional variations that can be fitted with HGAMs
- Based on Pedersen et al. (2019)
Types of functional variations that can be fitted with HGAMs
- 5 models are discussed in the paper (G, GS, GI, S, and I)
Let’s try some of these on our data!
Model G: A single common smoother for all observations
<- gam(y ~ # Abundance
modG
# Global Smooth
s(time, k = 60, bs = "cs") + # k = 60, because number of time steps = 120
# bs = "cs", because cyclic component in the data
# Series as random effect
s(series, k = 5, bs = "re"), # k = 5, because number of series = 5
# bs = "re", to get a random effect
method = "REML", # Standard method for HGAMs
family = "gaussian",
data = plankton_data
)
Let’s look at it graphically
::draw(modG) gratia
Summary
summary(modG)
Family: gaussian
Link function: identity
Formula:
y ~ s(time, k = 60, bs = "cs") + s(series, k = 5, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.151e-09 3.422e-02 0 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 3.202e+01 59 4.608 <2e-16 ***
s(series) 4.015e-04 4 0.000 1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.321 Deviance explained = 35.9%
-REML = 770.66 Scale est. = 0.67427 n = 576
Not the greatest fit!
Now, if we add a bit of complexity to the model…
Model GS: A global smoother plus group-level smoothers that have the same wiggliness.
<- gam(y ~ # Abundance
modGS
# Global Smooth
s(time, k = 60, bs = "cs", m = 2) + # k = 60, because number of time steps = 120
# bs = "cs", because cyclic component
# Series as random effect
s(time, series, k = 60, bs = "fs", m = 2), # k = 5, because number of series = 5
# bs = "fs", factor smoother
method = "REML", # Standard method for HGAMs
family = "gaussian",
data = plankton_data
)
Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.
Let’s look at it graphically
::draw(modGS) gratia
Predictions with the marginal effects library
plot_predictions(modGS,
condition = c('time', 'series', 'series'),
points = 0.5,
rug = TRUE) +
theme(legend.position = 'none') +
labs(y = 'Abundance', x = 'Time')
We can see more clearly the type of predictions the model makes for each level.
Summary
summary(modGS)
Family: gaussian
Link function: identity
Formula:
y ~ s(time, k = 60, bs = "cs", m = 2) + s(time, series, k = 60,
bs = "fs", m = 2)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.007795 0.023226 -0.336 0.737
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 35.62 59 2.715 <2e-16 ***
s(time,series) 104.77 299 2.223 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.699 Deviance explained = 77.2%
-REML = 734.91 Scale est. = 0.29908 n = 576
But… Could get a better fit?
Short answer: Probably!
Longer answer: There’s always a trade-off between a more complex model that account amount for more variations in the data, but that is prone to overfitting.
However, these two models were pretty simple and we’ll see in the next part of the training how we can get to another level of model complexity.
Model comparison
- We won’t delve into model comparison here, as it falls outside the scope of this training. However, a few options are available for comparison, such as using AIC and evaluating predictive performance.
4 Part 2 - Digging deeper with Dynamic (and Hierarchical!) Generalized Additive Models
Above, we modelled how plankton counts fluctuated through time with a hierarchical structure to understand how different groups vary through time. We noticed a strong seasonal pattern in these fluctuations.
Before we continue, let’s “zoom out” a little and think about how our model relates to ecology. Our model is saying that plankton abundances generally follow the seasonal fluctuations in temperature in Lake Washington, and that differently plankton groups fluctuate a bit differently. All we have to explain these fluctuations through time is (1) time itself, (2) plankton groups, and (3) temperature through time.
Are there other factors that could explain why plankton groups vary through time during this period? Yes, absolutely! There are unmeasured environmental factors that play some role here - for example, water chemistry - as well as some variations due to the process of measuring these plankton counts through time (i.e., measurement variability or observation error). These factors all leave an “imprint” in the time series we have here. And most importantly, these factors vary through time - they are themselves time series with their own temporal patterns.
But, do we have any measured predictors to add to the model to capture the influence of these factors? Unfortunately, as is often the case in ecological studies, we do not have measurements of these predictors through time.
There are three important things to think about here:
There are unmeasured predictors that influence the trend we measured and are now modelling. The signature of these unmeasured processes is in the data, but we cannot capture it with the previous model because we did not include these predictors. This is called “latent variation”;
Each species responds to these unmeasured predictors in their own way. (This is where the “hierarchical” bit comes in!)
These processes are not static - they are dynamic. Like our response variable, these unmeasured predictors vary through time.
Pause: Let’s talk latents
You may have noticed that we slipped a new term into the previous section: “latent variation”. The definition of “latent” in the Merriam-Webster dictionary is:
Present and capable of emerging or developing but not now visible, obvious, active, or symptomatic.
OR
a fingerprint (as at the scene of a crime) that is scarcely visible but can be developed for study
It can be helpful to use this definition to conceptualise latent variation in our data. As we said above, there are “imprints” of factors that we didn’t measure on the time series we are modelling. These signals are present and influence the trends we estimate, but are “scarcely visible” because we lack information to detect them.
In statistical models, latent variables are essentially random predictors that are generated during the modelling process to capture correlated variations between multiple responses (species, for example). The idea is that a bunch of these latent variables are randomly generated, and they are penalized until the model only retains a minimal set of latent variables that capture the main axes of covariation between responses. (Here, it can be useful to think of how a PCA reduces the dimensions of a dataset to a minimum of informative “axes”). Each species is then assigned a “factor loading” for each latent variable, which represents the species’ response to the latent variable.
In a temporal model, latent variables condense the variation that is left unexplained into a “predictor” that capture some structured pattern across responses (e.g. species’ abundance trends) through time. For example, these latent variables can capture temporal autocorrelation between observations, meaning that we can tell our model to account for the fact that each observation is probably closer to neighbouring values though time (e.g. year 1 and year 2) than to a value that is several time steps away (e.g. year 1 and year 20).
In a spatial model, latent variables can be applied to capture dynamic processes in space, such as spatial autocorrelation. Similarly to the temporal model, we often make the assumption that observations that are closer in space are more similar than observations that are further apart, because ecological patterns are in part driven by the environment, which is itself spatially-autocorrelated.
Mini pause: Let’s talk dynamic
Ok, and what do we mean by “dynamic”? A dynamic factor model assumes the factors that predict our response variable (i.e. biomass) evolve as time series too. In other words, a dynamic factor model provides even more flexibility to capture the “wiggly” effects of these predictors on our response variable through time.
Okay, now that we’ve covered what a dynamic model is, let’s make one!
Modelling multivariate time series with mvgam
The mvgam
package
The package mvgam
is an interface to Stan
, which is a language used to specify Bayesian statistical models. You could code these models directly in Stan
if you wanted to - but mvgam
allows you to specify the model in R
(using the R
syntax you know and love) and produces and compiles the Stan
file for your model for you. Isn’t that nice? That means you can focus on thinking about what your model should be doing, rather than on learning a new programming language (which can be great to learn, too!).
The mvgam
package has a lot more functionalities (many observation families for different types of data, forecasting functions, and more) than we will be using here. We really recommend that you have a look at the quite impressive amount of documentation about the package if you’re interested in specifying different models with your own data. See the seminars, tutorials, vignettes, and more here: nicholasjclark.github.io/mvgam.
A little warning
In Part 2, we will build a Bayesian model but will not discuss Bayesian theory or practices in detail. This is just a quick tutorial to hit the ground running, but before building your own model, you should have a better grasp of how Bayesian models work, and how to build them with care. To learn more:
Towards A Principled Bayesian Workflow by Michael Betancourt
Bayes Rules! An Introduction to Applied Bayesian Modeling by Alicia A. Johnson, Miles Q. Ott, Mine Dogucu.
How to change the default priors in
mvgam
in this exampleMore documentation, vignettes, and tutorials can be found on the
mvgam
site
Build a hierarchical model with mvgam
First, load the mvgam
package and the marginaleffects
package that we will be using to visualize the model:
Our model is an attempt to estimate how plankton vary in abundance through time. Let’s consider what we know about the system, to help us build this model:
We know that there is strong seasonal variation due to temperature changes within each year that drives all plankton groups to fluctuate through time.
We are also interested in how plankton abundance is changing annually, on the longer term of the time series.
We know that these plankton are embedded in a complex system within a lake, and that their dynamics may be dependent for many reasons. In other words, some groups may be correlated through time, and even more complicated - these correlations may not be happening immediately. There may be lagged correlations between groups as well!
Let’s build this model piece by piece, to capture each of these levels of variation:
First, let’s split the data into a training set, used to build the model, and a testing set, used to evaluate the model.
<- plankton_data %>%
plankton_train ::filter(time <= 112)
dplyr<- plankton_data %>%
plankton_test ::filter(time > 112) dplyr
Specify the model
Next, we’ll build a hierarchical GAM with a global smoother for all groups, and species-specific smooths in mvgam
.
This will allow us to capture the “global” seasonality that drives all plankton groups to fluctuate similarly through the time series, and to capture how each group’s seasonal fluctuations differ from this overall, global trend.
<- mvgam(formula =
notrend_mod ~
y # tensor of temp and month to capture
# "global" seasonality across plankton groups
te(temp, month, k = c(4, 4)) +
# series-specific deviation tensor products
# in other words, each plankton group can deviate from the
# global trend in its own way
te(temp, month, k = c(4, 4), by = series),
# set the observation family to Gaussian
family = gaussian(),
# our long-format dataset, split into a training and a testing set
data = plankton_train,
newdata = plankton_test,
# no latent trend model for now (so we can see its impact later on!)
trend_model = 'None',
# compilation & sampling settings
use_stan = TRUE,
# here, we are only going to use the default sampling settings to keep the
# model quick for the tutorial. If you were really going to run
# this, you should use set the chains, samples, and burnin arguments.
)
Let’s look at the Stan
code that mvgam
just wrote and ran for us:
stancode(notrend_mod)
// Stan model code generated by package mvgam
functions {
vector rep_each(vector x, int K) {
int N = rows(x);
vector[N * K] y;
int pos = 1;
for (n in 1:N) {
for (k in 1:K) {
y[pos] = x[n];
pos += 1;
}
}
return y;
}
}
data {
int<lower=0> total_obs; // total number of observations
int<lower=0> n; // number of timepoints per series
int<lower=0> n_sp; // number of smoothing parameters
int<lower=0> n_series; // number of series
int<lower=0> num_basis; // total number of basis coefficients
vector[num_basis] zero; // prior locations for basis coefficients
matrix[total_obs, num_basis] X; // mgcv GAM design matrix
array[n, n_series] int<lower=0> ytimes; // time-ordered matrix (which col in X belongs to each [time, series] observation?)
matrix[15,45] S1; // mgcv smooth penalty matrix S1
matrix[15,45] S2; // mgcv smooth penalty matrix S2
matrix[15,45] S3; // mgcv smooth penalty matrix S3
matrix[15,45] S4; // mgcv smooth penalty matrix S4
matrix[15,45] S5; // mgcv smooth penalty matrix S5
matrix[15,45] S6; // mgcv smooth penalty matrix S6
int<lower=0> n_nonmissing; // number of nonmissing observations
vector[n_nonmissing] flat_ys; // flattened nonmissing observations
matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
}
parameters {
// raw basis coefficients
vector[num_basis] b_raw;
// gaussian observation error
vector<lower=0>[n_series] sigma_obs;
// smoothing parameters
vector<lower=0>[n_sp] lambda;
}
transformed parameters {
// basis coefficients
vector[num_basis] b;
b[1:num_basis] = b_raw[1:num_basis];
}
model {
// prior for (Intercept)...
b_raw[1] ~ student_t(3, -0.1, 2.5);
// prior for te(temp,month)...
b_raw[2:16] ~ multi_normal_prec(zero[2:16],S1[1:15,1:15] * lambda[1] + S1[1:15,16:30] * lambda[2] + S1[1:15,31:45] * lambda[3]);
// prior for te(temp,month):seriesBluegreens...
b_raw[17:31] ~ multi_normal_prec(zero[17:31],S2[1:15,1:15] * lambda[4] + S2[1:15,16:30] * lambda[5] + S2[1:15,31:45] * lambda[6]);
// prior for te(temp,month):seriesDiatoms...
b_raw[32:46] ~ multi_normal_prec(zero[32:46],S3[1:15,1:15] * lambda[7] + S3[1:15,16:30] * lambda[8] + S3[1:15,31:45] * lambda[9]);
// prior for te(temp,month):seriesGreens...
b_raw[47:61] ~ multi_normal_prec(zero[47:61],S4[1:15,1:15] * lambda[10] + S4[1:15,16:30] * lambda[11] + S4[1:15,31:45] * lambda[12]);
// prior for te(temp,month):seriesOther.algae...
b_raw[62:76] ~ multi_normal_prec(zero[62:76],S5[1:15,1:15] * lambda[13] + S5[1:15,16:30] * lambda[14] + S5[1:15,31:45] * lambda[15]);
// prior for te(temp,month):seriesUnicells...
b_raw[77:91] ~ multi_normal_prec(zero[77:91],S6[1:15,1:15] * lambda[16] + S6[1:15,16:30] * lambda[17] + S6[1:15,31:45] * lambda[18]);
// priors for smoothing parameters
lambda ~ normal(5, 30);
// priors for observation error parameters
sigma_obs ~ student_t(3, 0, 2.5);
{
// likelihood functions
vector[n_nonmissing] flat_sigma_obs;
flat_sigma_obs = rep_each(sigma_obs, n)[obs_ind];
flat_ys ~ normal_id_glm(flat_xs,
0.0,b,
flat_sigma_obs);
}
}
generated quantities {
vector[total_obs] eta;
matrix[n, n_series] sigma_obs_vec;
matrix[n, n_series] mus;
vector[n_sp] rho;
array[n, n_series] real ypred;
rho = log(lambda);
// posterior predictions
eta = X * b;
for (s in 1:n_series) {
sigma_obs_vec[1:n,s] = rep_vector(sigma_obs[s], n);
}
for(s in 1:n_series){
mus[1:n, s] = eta[ytimes[1:n, s]];
ypred[1:n, s] = normal_rng(mus[1:n, s], sigma_obs_vec[1:n, s]);
}
}
And finally, the model summary.
summary(notrend_mod, include_betas = FALSE)
GAM formula:
y ~ te(temp, month, k = c(4, 4)) + te(temp, month, k = c(4, 4),
by = series)
Family:
gaussian
Link function:
identity
Trend model:
None
N series:
5
N timepoints:
120
Status:
Fitted using Stan
4 chains, each with iter = 1000; warmup = 500; thin = 1
Total post-warmup draws = 2000
Observation error parameter estimates:
2.5% 50% 97.5% Rhat n_eff
sigma_obs[1] 0.76 0.87 1.00 1 2620
sigma_obs[2] 0.58 0.66 0.76 1 2721
sigma_obs[3] 0.80 0.92 1.10 1 2954
sigma_obs[4] 0.66 0.75 0.87 1 2969
sigma_obs[5] 0.63 0.73 0.85 1 2457
GAM coefficient (beta) estimates:
2.5% 50% 97.5% Rhat n_eff
(Intercept) -0.11 -0.044 0.023 1 2869
Approximate significance of GAM smooths:
edf Ref.df Chi.sq p-value
te(temp,month) 5.76 15 52.31 2.4e-05 ***
te(temp,month):seriesBluegreens 2.86 15 3.98 1.00
te(temp,month):seriesDiatoms 3.34 15 56.02 0.11
te(temp,month):seriesGreens 2.69 15 1.01 1.00
te(temp,month):seriesOther.algae 1.34 15 2.83 1.00
te(temp,month):seriesUnicells 2.49 15 6.47 0.99
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Stan MCMC diagnostics:
n_eff / iter looks reasonable for all parameters
Rhat looks reasonable for all parameters
0 of 2000 iterations ended with a divergence (0%)
0 of 2000 iterations saturated the maximum tree depth of 10 (0%)
E-FMI indicated no pathological behavior
Samples were drawn using NUTS(diag_e) at Mon Feb 24 15:41:29 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split MCMC chains
(at convergence, Rhat = 1)
Use how_to_cite(notrend_mod) to get started describing this model
Visualise the model
Let’s first plot the global smoother for all species over time:
plot_mvgam_smooth(notrend_mod)
This is the shared seasonal trend estimated across all groups at once. We can see that the model was able to capture the seasonal temporal structure of the plankton counts, based on temperature and time (months).
We can also see how each group deviates from this global smooth (i.e. the partial effects):
::draw(notrend_mod$mgcv_model) gratia
The colour shows us the strength of the partial effect of We can see that the model was able to capture some differences between each plankton group’s seasonal trend and the community’s global trend.
Inspect the model
A great test of how good a model is, is to see how well it forecasts data we already have. We split the data into a training set and a test set above. Let’s see how well the trained model predicts this test set!
sapply(1:5, function(x) plot(notrend_mod, type = 'forecast', series = x))
Out of sample CRPS:
6.76106172887687
Out of sample CRPS:
6.74284136216226
Out of sample CRPS:
4.14878078336935
Out of sample CRPS:
3.52581136668436
Out of sample CRPS:
2.83716936348026
The points are the data points, and the red line and ribbon show the forecast trend and its credible intervals. Overall, these forecasts are okay, but not perfect - the data points are often within the credible intervals, and the forecasted trend seems to follow the seasonal trend pretty well. The model seems to understand that there is a strong seasonal trend in our observations, and is trying to predict it for each plankton group.
But how’s the model doing, overall? Let’s plot the residuals:
sapply(1:5, function(x) plot_mvgam_resids(notrend_mod, series = x))
Looking at the Autocorrelation Function plots (ACF), we can see that there’s still a lot of temporal autocorrelation to deal with in the model. Let’s try to address some of this with a dynamic model!
Dynamic model
Let’s now add a dynamic trend component to the model, to capture some of the variation that isn’t captured in our previous model.
You can specify many kinds of dynamic trends with mvgam
, but we won’t go into much detail here. We want to demonstrate the power of this approach, but you will need to select the dynamic trend model that works best for you. Here, we will use a Gaussian Process to model temporal autocorrelation in our data, which can specified as GP
in mvgam
.
<- mvgam(formula=
gptrend_mod ~
y te(temp, month, k = c(4, 4)) +
te(temp, month, k = c(4, 4), by = series),
# latent trend model
# setting to order 1, meaning the autocorrelation is assumed to be 1 time step.
trend_model = "GP",
# use dynamic factors to estimate the latent trends
use_lv = TRUE,
# observation family
family = gaussian(),
# our long-format datasets
data = plankton_train,
newdata = plankton_test,
# use reduced samples for inclusion in tutorial data
samples = 100)
Inspect the model
This workshop is only a quick glimpse of how to build a Bayesian model. If you want to learn more about how to inspect your model and understand how “good” it is, please check out: Bayes Rules! An Introduction to Applied Bayesian Modeling, Chapter 10: Evaluating Regression Models.
Posterior predictive checks are one way to investigate a Bayesian model’s ability to make unbiased predictions. In this check, we’ll be measuring the posterior prediction error of our model - in other words, how close are our model predictions to the expected value of Y?
These plots compare the observed data (solid black line) to the posterior predictions drawn from the fitted model (red line and credible interval ribbons). If our model is unbiased, these two lines should match pretty well. If there is a bias, we would see that the model predictions are very different from the black solid line.
sapply(1:5, FUN = function(x) ppc(gptrend_mod, series = x, type = 'density'))
These plots are pretty encouraging - our model isn’t generating biased predictions compared to the observation data (black line). Our model is doing a pretty good job!
Let’s check out the model summary - how does it differ from the model without the dynamic factors?
summary(gptrend_mod, include_betas = FALSE)
GAM formula:
y ~ te(temp, month, k = c(4, 4)) + te(temp, month, k = c(4, 4),
by = series)
Family:
gaussian
Link function:
identity
Trend model:
GP
N latent factors:
2
N series:
5
N timepoints:
120
Status:
Fitted using Stan
4 chains, each with iter = 600; warmup = 500; thin = 1
Total post-warmup draws = 400
Observation error parameter estimates:
2.5% 50% 97.5% Rhat n_eff
sigma_obs[1] 0.36 0.42 0.48 1.00 568
sigma_obs[2] 0.55 0.62 0.72 1.01 390
sigma_obs[3] 0.79 0.91 1.10 1.00 595
sigma_obs[4] 0.62 0.71 0.80 1.00 600
sigma_obs[5] 0.52 0.61 0.71 1.00 499
GAM coefficient (beta) estimates:
2.5% 50% 97.5% Rhat n_eff
(Intercept) -0.16 -0.057 0.042 1.01 528
Approximate significance of GAM smooths:
edf Ref.df Chi.sq p-value
te(temp,month) 4.692 15 30.53 0.045 *
te(temp,month):seriesBluegreens 2.751 15 9.26 0.940
te(temp,month):seriesDiatoms 3.755 15 50.18 0.201
te(temp,month):seriesGreens 2.241 15 0.79 1.000
te(temp,month):seriesOther.algae 3.877 15 10.41 0.906
te(temp,month):seriesUnicells 0.933 15 16.65 0.984
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Latent trend length scale (rho) estimates:
2.5% 50% 97.5% Rhat n_eff
rho_gp[1] 2.1 4.6 8.8 1.01 294
rho_gp[2] 1.4 4.3 24.0 1.02 188
Warning: Supplying trend_model as a character string is deprecated
Please use the dedicated functions (i.e. RW() or ZMVN()) instead
This warning is displayed once per session.
Stan MCMC diagnostics:
n_eff / iter looks reasonable for all parameters
Rhats above 1.05 found for 24 parameters
*Diagnose further to investigate why the chains have not mixed
0 of 400 iterations ended with a divergence (0%)
0 of 400 iterations saturated the maximum tree depth of 10 (0%)
E-FMI indicated no pathological behavior
Samples were drawn using NUTS(diag_e) at Thu Feb 27 23:14:09 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split MCMC chains
(at convergence, Rhat = 1)
Use how_to_cite(gptrend_mod) to get started describing this model
Visualise the model
Plot the global smoother for all species over time:
plot_mvgam_smooth(gptrend_mod)
This is the shared temporal trend estimated across all species at once. We can see that, overall, species’ biomassess declined during this time period.
Plot the species’ estimated trends:
sapply(1:5, function(x) plot_mvgam_trend(gptrend_mod, series = x))
And, importantly, let’s have a look at the dynamic factors estimated by the model to capture temporal dynamics in our response data:
plot(gptrend_mod, type = 'factors')
These are the famous dynamic latent variables we’ve been talking about. You can see that they are capturing some of the unexplained temporal variation in our model. The model actually makes a bunch of these, and penalized them harshly to make latent variables that have a high explanatory power. This is the result!
Partial effects
We can plot the partial effect of each variable we included in our model, to understand how they contribute to the overall model.
Let’s start with the partial effect of temperature on plankton counts across groups:
plot_predictions(gptrend_mod, condition = c("temp"), conf_level = 0.9)
Next, the partial effect of seasonality (here, months) on plankton counts across groups:
plot_predictions(gptrend_mod, condition = c("month"), conf_level = 0.9)
And finally, the effect of different plankton groups on the model:
plot_predictions(gptrend_mod, condition = c("series"), conf_level = 0.9)
Calculate the rate of change in plankton counts
We can add each group’s estimated rate of change in abundance through time to these plots. Think of this as the slope of their estimated trend at each time point - this can give us an idea of the rate at which some groups declined or grew during the time series. If the derivative trend is centred on the zero line, the response variable (here, counts) were fluctuating around their mean abundance rather than declining or growing in a consistent way through time.
As an example, let’s look at the blue-green algae trend, where the first panel is the estimated trend of the blue-green algae counts through time, and the bottom panel is the rate of change in these counts over time. In the second panel, negative values mean there was a decline in counts, and positive values mean there was an increase in counts. This can be very useful to calculate population growth rates through time:
plot_mvgam_trend(gptrend_mod, series = 1, derivatives = TRUE)
It seems that these blue-green algae were pretty stable in the beginning of the time series, then declined slightly, then declined quickly around time step 100. The counts then increaased after this point.
We made a slightly “hacked” version of this function to extract the derivatives:
source(here::here("saved-objects/custom-function.R"))
= lapply(1:5, function(x) plot_mvgam_trend_custom(gptrend_mod, series = x, derivatives = TRUE))
derivs names(derivs) = gptrend_mod$obs_data$series |> levels()
Let’s look at a histogram of the derivatives to get an idea of how this group’s counts changed over time, on average:
# reformatting the data for easier plotting
= lapply(derivs, function(x) x[,-1]) |>
df lapply(as.data.frame) |>
lapply(pivot_longer, cols = everything()) |>
bind_rows(.id = "Group")
# make a histogram of rates of changes in counts for each plankton group
plot_trenddensity =
(ggplot(data = df) +
geom_histogram(aes(x = value, fill = after_stat(x)),
col = "black", linewidth = .2, bins = 19) +
geom_vline(xintercept = mean(df$value, na.rm = TRUE)) +
geom_vline(xintercept = mean(df$value, na.rm = TRUE) - sd(df$value, na.rm = TRUE), lty = 2) +
geom_vline(xintercept = mean(df$value, na.rm = TRUE) + sd(df$value, na.rm = TRUE), lty = 2) +
theme(panel.grid.major.x = element_line()) +
scale_y_sqrt() +
labs(x = "Rate of change",
y = "Frequency",
fill = "Rate of change") +
scale_fill_distiller(palette = "RdYlGn",
direction = 1,
limits = c(-.4,.4)) +
coord_cartesian(xlim = c(-.4, .4))) +
facet_wrap(~Group, ncol = 2) +
theme(legend.position = "top")
The median of the distribution of rates of change is ~0 for all the groups - but does this mean they didn’t vary throughout the time series? Not at all! Look at the difference in the spread of rates of change between Bluegreens and the other groups. Bluegreens went through several periods of large growth (green) and of large declines (red), while Diatoms and Greens varied relatively little (all of their rates of change are at or close to 0). If you scroll back up to see the predicted trends for each group, do these histograms make sense to you?
Species correlations
One way to investigate community dynamics is to check out the correlations between plankton groups. These correlations are captured with the dynamic model, which estimates unmeasured temporal processes in our data.
Let’s calculate and plot species’ temporal correlations and plot them as a heatmap:
# Calculate the correlations from the latent trend model
= lv_correlations(gptrend_mod)
species_correlations
# prepare the matrix for plotting
= species_correlations$mean_correlations
toplot colnames(toplot) = gsub("_", "\n", stringr::str_to_sentence(colnames(toplot)))
rownames(toplot) = gsub("_", "\n", stringr::str_to_sentence(rownames(toplot)))
# plot as a heatmap
::corrplot(toplot,
corrplottype = "lower",
method = "color",
tl.cex = 1, cl.cex = 1, tl.col = "black", font = 3,)
Model comparison
So, which model is best?
We will use leave-one-out comparison to compare the two models we just built: notrend_mod
which does not have a dynamic trend, and gptrend
which includes an autoregressive latent trend model.
This workshop isn’t really about model comparison, so we will just use this approach as an example, but there are other ways you could compare these models.
One way is to compare the predictive performance of the models on new data using a metric of comparison like expected log pointwise predictive density (ELPD).
::loo_compare(notrend_mod, gptrend_mod,
mvgammodel_names = c("No trend", "Dynamic trend"))
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
elpd_diff se_diff
Dynamic trend 0.0 0.0
No trend -101.6 11.7
The dynamic trend model performs better than the model with no trend (here, we want high - more positive- values for elpd_dif).
Another way is to consider the ecological processes that influence our response variable. Which model do you think does a better job of representing the factors that cause plankton counts to vary through time?
5 Wrap up
We hope you’ve enjoyed this short tutorial on hierarchical generalized additive models with mgcv
and mvgam
.
At the end of this tutorial, we hope you now feel you understand how a hierarchical model works, and how it can be used to capture nonlinear effects. The R packages mgcv
and mvgam
packages are great tools to build and fit hierarchical models!
We also covered dynamic modelling, and how latent variables can be used to capture dynamic processes like temporal or spatial autocorrelation. Can you use latent variables to explain variation in your own models?
6 Thank you
Thank you to BIOS2 and Gracielle Higino, Andrew Macdonald, and Kim Gauthier-Schampaert for supporting this workshop. Special thanks to Nicholas J. Clark for helpful suggestions and for providing much of the material we used to build this workshop.
If you’re interested in talking more about these models, please join our community calls during March and April 2025! We welcome anyone interested in GAMs, computational ecology, or eager to learn more about HGAMs to participate in the following sessions:
Temporal Ecology with HGAMs
When: March 7, 2025 at 1:30 PM (EST)
Registration: https://us02web.zoom.us/meeting/register/eocqFUC7QZG9HqydWb2Lsw#/registrationSpatial Ecology with HGAMs
When: March 14, 2025 at 1:30 PM (EST)
Registration: https://us02web.zoom.us/meeting/register/BAdUd6hKSDmzG_w4hgbgrQ#/registrationBehavioural Ecology with HGAMs
When: March 24, 2025 at 1:30 PM (EST)
Registration: https://us02web.zoom.us/meeting/register/taVdATjWRw2uhzYTry2bHA#/registration
Each discussion will focus on the outstanding ecological questions that we could answer with HGAMs, highlighting a wide array of potential applications for specific types of ecological and evolutionary data. Join us in thinking about how we could use HGAMs to push ecological research forward!