Introduction to Hierarchical Generalized Additive Models

Technical
EN

This course is designed to demystify hierarchical modelling as powerful tools to model population dynamics, spatial distributions, and any non-linear relationships in your ecological data. The training will be divided into two blocks. First, we will cover hierarchies in biology, data, and in models to understand what hierarchical models are, some of the forms they can take, and the fundamentals of how they work. Second, we will introduce latent variable modelling as a way to explain even more of the variation in our response variables, to better disentangle the hierarchies of variation in our data. Both blocks will include a theoretical presentation followed by hands-on coding exercises to implement and interpret hierarchical GAMs.

Authors
Affiliations

Camille Lévesque

Université de Sherbrooke

Katherine Hébert

McGill University

Published

March 3, 2025

1 Overview

This course is designed to demystify hierarchical modelling as powerful tools to model population dynamics, spatial distributions, and any non-linear relationships in your ecological data. The training will be divided into two blocks.

  1. First, we will cover hierarchies in biology, data, and in models to understand what hierarchical models are, some of the forms they can take, and the fundamentals of how they work.

  2. Second, we will introduce latent variable modelling as a way to explain even more of the variation in our response variables, to better disentangle the hierarchies of variation in our data.

Both blocks will include a theoretical presentation followed by hands-on coding exercises to implement and interpret hierarchical GAMs.

This workshop was developed with support from the NSERC CREATE Computational Biodiversity Science and Services (BIOS²) training program.

Credits

Part 2 of this workshop is based on Nicholas J. Clark’s Physalia course “Ecological forecasting with mvgam and brms”. It was reworked by Katherine Hébert and Camille Lévesque into this short tutorial (mainly in Part 2), but we recommend having a look at the original for a more in-depth look at the methods we are covering today:

Clark, N. J. (2024). “Ecological forecasting with mvgam and brms”. Physalia. Retrieved from https://nicholasjclark.github.io/physalia-forecasting-course/.

Learning outcomes

  1. Understand how a hierarchical model works, and how it can be used to capture nonlinear effects

  2. Understand dynamic modelling, and how latent variables can be used to capture dynamic processes like temporal or spatial autocorrelation

  3. Use the R packages mgcv and mvgam packages to build and fit hierarchical models

  4. Understand how to visualize and interpret hierarchical models with these packages

Requirements

Familiarity with Generalized Additive Modelling

We recommend previous experience with GAMs before taking this training. If you would like to follow an introduction to GAMs before this workshop, please have a look at Eric Pedersen’s Introduction to GAMs and/or the Québec Centre for Biodiversity Science’s Workshop 8: GAMs.

Familiarity with Bayesian

R & RStudio

The workshop assumes basic familiarity with R/RStudio. To be able to follow along with the practical exercises on your own computer, in addition to downloading the data files above, you will need to do the following:

Install the latest version of R for your operating system: https://cran.r-project.org/. Install the latest version of RStudio for your operating system: https://www.rstudio.com/products/rstudio/download/

R packages

Install and load the packages that we will use during the workshop by executing the following code in R version 4.2.0:

# install packages from CRAN

# install.packages(pkgs = c("dplyr", "tidyr", mvgam", "mgcv", "gratia", "marginaleffects", "corrplot", "ggplot2"), dependencies = TRUE)
library("dplyr")
library("tidyr")
library("mvgam")
library("mgcv")
library("gratia")
library("marginaleffects")
library("corrplot")
library("ggplot2")

# set all ggplots to this theme
theme_set(theme_minimal())

Stan

Please note that for Block 2, we will be fitting models using Stan. Stan must be installed (along with either rstan and/or cmdstanr). Please refer to installation links for Stan with cmdstandr here (or with rstan here).


2 Workshop materials

About the data set

In this workshop, we will be analyzing time series of plankton counts (cells per mL) taken from Lake Washington in Washington, USA during a long-term monitoring study. The data are available in the MARSS package, and are described and analyse in greater detail in this publication:

Hampton Stephanie E. , Scheuerell Mark D. , Schindler Daniel E. , (2006), Coalescence in the Lake Washington story: Interaction strengths in a planktonic food web, Limnology and Oceanography, 51, doi: 10.4319/lo.2006.51.5.2042.

Some information about the data, from the MARSS package documentation (link):

The lakeWAplankton data set consists for two data sets: lakeWAplanktonRaw and a dataset derived from the raw dataset, lakeWAplanktonTrans. lakeWAplanktonRaw is a 32-year time series (1962-1994) of monthly plankton counts from Lake Washington, Washington, USA. Columns 1 and 2 are year and month. Column 3 is temperature (C), column 4 is total phosphorous, and column 5 is pH. The next columns are the plankton counts in units of cells per mL for the phytoplankton and organisms per L for the zooplankton.

lakeWAplanktonTrans is a transformed version of lakeWAplanktonRaw. Zeros have been replaced with NAs (missing). The logged (natural log) raw plankton counts have been standardized to a mean of zero and variance of 1 (so logged and then z-scored). Temperature, TP & pH were also z-scored but not logged (so z-score of the untransformed values for these covariates). The single missing temperature value was replaced with -1 and the single missing TP value was replaced with -0.3.

The data download and preparation steps are identical to the steps in Nicholas J. Clark’s Physalia course.

Data download

Load the dataset:

load(url('https://github.com/atsa-es/MARSS/raw/master/data/lakeWAplankton.rda'))

Prepare the data

First, the data needs to be prepared into a long format (i.e., one observation per row). In Part 2, the mvgam package will require specific column names, so we will prepare the data to match the package’s requirements.

## Prepare the time series data for analysis 

# This code is from https://nicholasjclark.github.io/physalia-forecasting-course/day4/tutorial_4_physalia

# We will work with five different groups of plankton:
outcomes <- c('Greens', 'Bluegreens', 'Diatoms', 'Unicells', 'Other.algae')

plankton_data <- xfun::cache_rds(do.call(rbind, lapply(outcomes, function(x){
  
  # create a group-specific dataframe with counts labelled 'y'
  # and the group name in the 'series' variable
  data.frame(year = lakeWAplanktonTrans[, 'Year'],
             month = lakeWAplanktonTrans[, 'Month'],
             y = lakeWAplanktonTrans[, x],
             series = x,
             temp = lakeWAplanktonTrans[, 'Temp'])})) %>%
  
  # change the 'series' label to a factor
  dplyr::mutate(series = factor(series)) %>%
  
  # filter to only include some years in the data
  dplyr::filter(year >= 1965 & year < 1975) %>%
  dplyr::arrange(year, month) %>%
  dplyr::group_by(series) %>%
  
  # z-score the counts so they are approximately standard normal
  dplyr::mutate(y = as.vector(scale(y))) %>%
  
  # add the time indicator
  dplyr::mutate(time = dplyr::row_number()) %>%
  dplyr::ungroup())

# loop across each plankton group to create the long datframe
plankton_data <- do.call(rbind, lapply(outcomes, function(x){
  
  # create a group-specific dataframe with counts labelled 'y'
  # and the group name in the 'series' variable
  data.frame(year = lakeWAplanktonTrans[, 'Year'],
             month = lakeWAplanktonTrans[, 'Month'],
             y = lakeWAplanktonTrans[, x],
             series = x,
             temp = lakeWAplanktonTrans[, 'Temp'])})) %>%
  
  # change the 'series' label to a factor
  dplyr::mutate(series = factor(series)) %>%
  
  # filter to only include some years in the data
  dplyr::filter(year >= 1965 & year < 1975) %>%
  dplyr::arrange(year, month) %>%
  dplyr::group_by(series) %>%
  
  # z-score the counts so they are approximately standard normal
  dplyr::mutate(y = as.vector(scale(y))) %>%
  
  # add the time index
  dplyr::mutate(time = dplyr::row_number()) %>%
  dplyr::ungroup()

Let’s have a look at the data’s structure:

head(plankton_data)
# A tibble: 6 × 6
   year month       y series       temp  time
  <dbl> <dbl>   <dbl> <fct>       <dbl> <int>
1  1965     1 -0.542  Greens      -1.23     1
2  1965     1 -0.344  Bluegreens  -1.23     1
3  1965     1 -0.0768 Diatoms     -1.23     1
4  1965     1 -1.52   Unicells    -1.23     1
5  1965     1 -0.491  Other.algae -1.23     1
6  1965     2 NA      Greens      -1.32     2

Let’s also plot the data to get a first look at the time series:

ggplot(data = plankton_data, aes(x = time)) +
  geom_line(aes(y = temp), col = "black") + # Temperature in black
  geom_line(aes(y = y, col = series), lwd = .4) + # Taxa counts in color
  labs(x = "Time", y = "Abundance (z-score)\n(Temperature z-score in black)", col = "Group")


3 Part 1: Hierarchical Generalized Additive Models (HGAMs)

A (very) brief introduction to GAMs

  • Definition: Generalized additive models (GAMs) are models that allow us enough flexibility to depict nonlinear relationships, without being so “black-boxy” that they become so hard to interpret. They offer a middle ground between simple and complex models.
  • Why are they relevant in ecology? A lot of relationships in ecology are inherently nonlinear and GAMs are particularly useful in capturing these more complex relationships.

Nonlinear relationship

If we plot the data for one of the groups (diatoms), we see that that relationship between abundance and time is not linear.

diatoms_data <- plankton_data[plankton_data$series == "Diatoms", ]
ggplot(diatoms_data, aes(x = time, y = y)) +
  geom_point() +
  geom_line(aes(y = y), lwd = .4) +
  labs(title = "Diatoms Data Points", x = "Time", y = "Abundance (z-score)") +
  theme_minimal()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Linear model on diatoms data

If we fit a linear model on this data, we can come to the same conclusion

diatoms_lm <- lm(y ~ time, data = diatoms_data)

We can plot the relationship between time and abundance using ggplot:

ggplot(diatoms_data, aes(x = time, y = y)) +
  geom_point(color = "black") +  # Scatter points
  geom_smooth(method = "lm", color = "blue1", fill = "lightgray", se = TRUE) +  # Regression line
  labs(title = "Diatoms Data with Linear Regression Line",
       x = "Time",
       y = "Y Value") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

GAM on diatoms data

Now, if we fit a GAM on our data

# model
diatoms_gam <- gam(y ~ s(time, bs = "cs", k = 50), data = diatoms_data)

We immediately see a better fit, as the spline seems to follow our data points more accurately.

# graph
gratia::draw(diatoms_gam, rug = FALSE)

What are splines?

  • Splines (also known as smooths or smoothers) are a type of function used to estimate nonlinear relationships between covariates and the response variable.
  • Splines are the sum of simpler functions, called basis functions.

What determines how “curvy” splines are?

  • In the GAM world we have a specific term for this: ✨wiggliness

  • Wiggliness refers to the degree of flexibility in the model’s fitted function. It describes how much the model can fluctuate to fit the data.

  • The number of basis functions (knots) we set our model to have is the maximum wiggliness we expect our relationship to have.

Wiggliness

  • The model penalizes the wiggliness (penalized log likelihood), which should shrink down our wiggliness to the right amount to match the patterns in the data.

  • If the wiggliness is too high, we risk overfitting, which results in a model that is less generalizable and more specific to this exact set of data.

  • wiggliness ≠ shape of the curve

What is a hierarchy?

What is a hierarchy in biology?

  • It refers to the organization of organisms into levels

  • You’re probably familiar with the taxonomic hierarchy

We find that nature often tends to “organize” itself into hierarchies.

What is a hierarchy in data?

We find hierarchies in data when a data item is linked to another data item by a hierarchical relationship, where we would qualify one data item to be the “parent” of the other.

head(plankton_data) 
# A tibble: 6 × 6
   year month       y series       temp  time
  <dbl> <dbl>   <dbl> <fct>       <dbl> <int>
1  1965     1 -0.542  Greens      -1.23     1
2  1965     1 -0.344  Bluegreens  -1.23     1
3  1965     1 -0.0768 Diatoms     -1.23     1
4  1965     1 -1.52   Unicells    -1.23     1
5  1965     1 -0.491  Other.algae -1.23     1
6  1965     2 NA      Greens      -1.32     2
  • “month” is nested in “year”

  • “series” = grouping variable with 5 levels

What is a hierarchical model (HM)?

  • Hierarchical generalised linear models, or Hierarchical models (HM), are used to estimate linear relationships between predictors and a response variable, but impose a structure where the predictors are organized into groups.

  • The relationship between predictors and response variable may differ across groups.

Pedersen et al. (2019)

Now that we have seen GAMs and HMs… What are HGAMs?

HGAM

  • A hierarchical generalized additive model, or HGAM, is the integration of GAMS and HMs into one model. HGAMs enable us to estimate nonlinear relationships between predictors and a response variable when at least one of the predictors is a grouping variable.

  • We’ve previously identified hierarchies in the data; now, how do we incorporate them into our models? We represent these hierarchies using smooth functions in our equations.

HGAM equation with mgcv

# gam(y ~ s(time, k = k_number_time_steps, bs = "cs") +
#      s(time, series, k = number_series, bs = "fs", m = 2),
#      family = "gaussian"
#      data = plankton_data
#      method = "REML")

Let’s break it down.

Smooth function

s(time, k = k_number_time_steps, bs = “cs”)

s() = smooth function

k = number of knots ; can be the number of time steps or any fraction of that number

bs = type of basis functions

Types of basis functions

  • Thin Plate Regression Splines (“tp” in mgcv): General purpose spline basis. If you do not have phenology in your data and your smooth only has one covariate = probably this type of basis functions. To learn more –> Wood (2003)

  • Random Effect (“re”): Similar to random effect in mixed model. Creates an intercept per grouping level.

  • Cyclic Cubic Regression Splines (“cs”): Useful to fit models with cyclic components or phenology.

  • Tensor product (“te”): We use this to look at interaction between predictors if we believe it to have an impact on the relationship between the predictors and the response variable. For example: the interaction of temperature and latitude on the abundance of a species of plant.

  • Factor Smoother (“fs”): Produces a smooth curve for each level of a single factor. This means that a model can be constructed with a main effect plus factor level smooth deviations from that effect.

Factor smoother

  • Global trend across all levels of a grouping variable

  • Species-specific trends sharing the same wiggliness, but differing in shapes

Types of functional variations that can be fitted with HGAMs

Types of functional variations that can be fitted with HGAMs

  • 5 models are discussed in the paper (G, GS, GI, S, and I)

From Hierarchical generalized adidtive models in ecology: an introduction with mgcv, by Pedersen et al. (2019)

Let’s try some of these on our data!

Model G: A single common smoother for all observations

modG <- gam(y ~                           # Abundance
              
            #  Global Smooth             
            s(time, k = 60, bs = "cs") +  # k = 60, because number of time steps = 120 
                                          # bs = "cs", because cyclic component in the data
            
            # Series as random effect    
            s(series, k = 5, bs = "re"),  # k = 5, because number of series = 5
                                          # bs = "re", to get a random effect
            
            method = "REML",               # Standard method for HGAMs
            
            family = "gaussian",
            
            data = plankton_data
              
            )

Let’s look at it graphically

gratia::draw(modG)

Summary

summary(modG)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(time, k = 60, bs = "cs") + s(series, k = 5, bs = "re")

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.151e-09  3.422e-02       0        1

Approximate significance of smooth terms:
                edf Ref.df     F p-value    
s(time)   3.202e+01     59 4.608  <2e-16 ***
s(series) 4.015e-04      4 0.000       1    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.321   Deviance explained = 35.9%
-REML = 770.66  Scale est. = 0.67427   n = 576

Not the greatest fit!

Now, if we add a bit of complexity to the model…

Model GS: A global smoother plus group-level smoothers that have the same wiggliness.

modGS <- gam(y ~                           # Abundance
              
            #  Global Smooth             
            s(time, k = 60, bs = "cs", m = 2) +  # k = 60, because number of time steps = 120 
                                                 # bs = "cs", because cyclic component
            
            # Series as random effect    
            s(time, series, k = 60, bs = "fs", m = 2),  # k = 5, because number of series = 5
                                                        # bs = "fs", factor smoother
            
            method = "REML",                            # Standard method for HGAMs
            
            family = "gaussian",
            
            data = plankton_data
              
            )
Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.

Let’s look at it graphically

gratia::draw(modGS)

Predictions with the marginal effects library

plot_predictions(modGS, 
                 condition = c('time', 'series', 'series'),
                 points = 0.5,
                 rug = TRUE) +
  theme(legend.position = 'none') +
  labs(y = 'Abundance', x = 'Time')

We can see more clearly the type of predictions the model makes for each level.

Summary

summary(modGS)

Family: gaussian 
Link function: identity 

Formula:
y ~ s(time, k = 60, bs = "cs", m = 2) + s(time, series, k = 60, 
    bs = "fs", m = 2)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.007795   0.023226  -0.336    0.737

Approximate significance of smooth terms:
                  edf Ref.df     F p-value    
s(time)         35.62     59 2.715  <2e-16 ***
s(time,series) 104.77    299 2.223  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.699   Deviance explained = 77.2%
-REML = 734.91  Scale est. = 0.29908   n = 576

But… Could get a better fit?

  • Short answer: Probably!

  • Longer answer: There’s always a trade-off between a more complex model that account amount for more variations in the data, but that is prone to overfitting.

  • However, these two models were pretty simple and we’ll see in the next part of the training how we can get to another level of model complexity.

Model comparison

  • We won’t delve into model comparison here, as it falls outside the scope of this training. However, a few options are available for comparison, such as using AIC and evaluating predictive performance.

4 Part 2 - Digging deeper with Dynamic (and Hierarchical!) Generalized Additive Models

Above, we modelled how plankton counts fluctuated through time with a hierarchical structure to understand how different groups vary through time. We noticed a strong seasonal pattern in these fluctuations.

Before we continue, let’s “zoom out” a little and think about how our model relates to ecology. Our model is saying that plankton abundances generally follow the seasonal fluctuations in temperature in Lake Washington, and that differently plankton groups fluctuate a bit differently. All we have to explain these fluctuations through time is (1) time itself, (2) plankton groups, and (3) temperature through time.

Are there other factors that could explain why plankton groups vary through time during this period? Yes, absolutely! There are unmeasured environmental factors that play some role here - for example, water chemistry - as well as some variations due to the process of measuring these plankton counts through time (i.e., measurement variability or observation error). These factors all leave an “imprint” in the time series we have here. And most importantly, these factors vary through time - they are themselves time series with their own temporal patterns.

But, do we have any measured predictors to add to the model to capture the influence of these factors? Unfortunately, as is often the case in ecological studies, we do not have measurements of these predictors through time.

There are three important things to think about here:

  1. There are unmeasured predictors that influence the trend we measured and are now modelling. The signature of these unmeasured processes is in the data, but we cannot capture it with the previous model because we did not include these predictors. This is called “latent variation”;

  2. Each species responds to these unmeasured predictors in their own way. (This is where the “hierarchical” bit comes in!)

  3. These processes are not static - they are dynamic. Like our response variable, these unmeasured predictors vary through time.

Pause: Let’s talk latents

You may have noticed that we slipped a new term into the previous section: “latent variation”. The definition of “latent” in the Merriam-Webster dictionary is:

Present and capable of emerging or developing but not now visible, obvious, active, or symptomatic.

OR

a fingerprint (as at the scene of a crime) that is scarcely visible but can be developed for study

It can be helpful to use this definition to conceptualise latent variation in our data. As we said above, there are “imprints” of factors that we didn’t measure on the time series we are modelling. These signals are present and influence the trends we estimate, but are “scarcely visible” because we lack information to detect them.

In statistical models, latent variables are essentially random predictors that are generated during the modelling process to capture correlated variations between multiple responses (species, for example). The idea is that a bunch of these latent variables are randomly generated, and they are penalized until the model only retains a minimal set of latent variables that capture the main axes of covariation between responses. (Here, it can be useful to think of how a PCA reduces the dimensions of a dataset to a minimum of informative “axes”). Each species is then assigned a “factor loading” for each latent variable, which represents the species’ response to the latent variable.

In a temporal model, latent variables condense the variation that is left unexplained into a “predictor” that capture some structured pattern across responses (e.g. species’ abundance trends) through time. For example, these latent variables can capture temporal autocorrelation between observations, meaning that we can tell our model to account for the fact that each observation is probably closer to neighbouring values though time (e.g. year 1 and year 2) than to a value that is several time steps away (e.g. year 1 and year 20).

In a spatial model, latent variables can be applied to capture dynamic processes in space, such as spatial autocorrelation. Similarly to the temporal model, we often make the assumption that observations that are closer in space are more similar than observations that are further apart, because ecological patterns are in part driven by the environment, which is itself spatially-autocorrelated.

Mini pause: Let’s talk dynamic

Ok, and what do we mean by “dynamic”? A dynamic factor model assumes the factors that predict our response variable (i.e. biomass) evolve as time series too. In other words, a dynamic factor model provides even more flexibility to capture the “wiggly” effects of these predictors on our response variable through time.

Okay, now that we’ve covered what a dynamic model is, let’s make one!

Modelling multivariate time series with mvgam

The mvgam package

The package mvgam is an interface to Stan, which is a language used to specify Bayesian statistical models. You could code these models directly in Stan if you wanted to - but mvgam allows you to specify the model in R (using the R syntax you know and love) and produces and compiles the Stan file for your model for you. Isn’t that nice? That means you can focus on thinking about what your model should be doing, rather than on learning a new programming language (which can be great to learn, too!).

The mvgam package has a lot more functionalities (many observation families for different types of data, forecasting functions, and more) than we will be using here. We really recommend that you have a look at the quite impressive amount of documentation about the package if you’re interested in specifying different models with your own data. See the seminars, tutorials, vignettes, and more here: nicholasjclark.github.io/mvgam.

A little warning

In Part 2, we will build a Bayesian model but will not discuss Bayesian theory or practices in detail. This is just a quick tutorial to hit the ground running, but before building your own model, you should have a better grasp of how Bayesian models work, and how to build them with care. To learn more:

Build a hierarchical model with mvgam

First, load the mvgam package and the marginaleffects package that we will be using to visualize the model:

Our model is an attempt to estimate how plankton vary in abundance through time. Let’s consider what we know about the system, to help us build this model:

  1. We know that there is strong seasonal variation due to temperature changes within each year that drives all plankton groups to fluctuate through time.

  2. We are also interested in how plankton abundance is changing annually, on the longer term of the time series.

  3. We know that these plankton are embedded in a complex system within a lake, and that their dynamics may be dependent for many reasons. In other words, some groups may be correlated through time, and even more complicated - these correlations may not be happening immediately. There may be lagged correlations between groups as well!

Let’s build this model piece by piece, to capture each of these levels of variation:

First, let’s split the data into a training set, used to build the model, and a testing set, used to evaluate the model.

plankton_train <- plankton_data %>%
  dplyr::filter(time <= 112)
plankton_test <- plankton_data %>%
  dplyr::filter(time > 112)

Specify the model

Next, we’ll build a hierarchical GAM with a global smoother for all groups, and species-specific smooths in mvgam.

This will allow us to capture the “global” seasonality that drives all plankton groups to fluctuate similarly through the time series, and to capture how each group’s seasonal fluctuations differ from this overall, global trend.

notrend_mod <- mvgam(formula = 
                       y ~ 
                       # tensor of temp and month to capture
                       # "global" seasonality across plankton groups
                       te(temp, month, k = c(4, 4)) +
                       
                       # series-specific deviation tensor products
                       # in other words, each plankton group can deviate from the
                       # global trend in its own way
                       te(temp, month, k = c(4, 4), by = series),
                     
                     # set the observation family to Gaussian
                     family = gaussian(),
                     
                     # our long-format dataset, split into a training and a testing set
                     data = plankton_train,
                     newdata = plankton_test,
                     
                     # no latent trend model for now (so we can see its impact later on!)
                     trend_model = 'None',
                     
                     # compilation & sampling settings
                     use_stan = TRUE,
                     # here, we are only going to use the default sampling settings to keep the
                     # model quick for the tutorial. If you were really going to run
                     # this, you should use set the chains, samples, and burnin arguments.
)

Let’s look at the Stan code that mvgam just wrote and ran for us:

stancode(notrend_mod)
// Stan model code generated by package mvgam
functions {
vector rep_each(vector x, int K) {
int N = rows(x);
vector[N * K] y;
int pos = 1;
for (n in 1:N) {
for (k in 1:K) {
y[pos] = x[n];
pos += 1;
}
}
return y;
}
}
data {
int<lower=0> total_obs; // total number of observations
int<lower=0> n; // number of timepoints per series
int<lower=0> n_sp; // number of smoothing parameters
int<lower=0> n_series; // number of series
int<lower=0> num_basis; // total number of basis coefficients
vector[num_basis] zero; // prior locations for basis coefficients
matrix[total_obs, num_basis] X; // mgcv GAM design matrix
array[n, n_series] int<lower=0> ytimes;  // time-ordered matrix (which col in X belongs to each [time, series] observation?)
matrix[15,45] S1; // mgcv smooth penalty matrix S1
matrix[15,45] S2; // mgcv smooth penalty matrix S2
matrix[15,45] S3; // mgcv smooth penalty matrix S3
matrix[15,45] S4; // mgcv smooth penalty matrix S4
matrix[15,45] S5; // mgcv smooth penalty matrix S5
matrix[15,45] S6; // mgcv smooth penalty matrix S6
int<lower=0> n_nonmissing; // number of nonmissing observations
vector[n_nonmissing] flat_ys; // flattened nonmissing observations
matrix[n_nonmissing, num_basis] flat_xs; // X values for nonmissing observations
array[n_nonmissing] int<lower=0> obs_ind; // indices of nonmissing observations
}
parameters {
// raw basis coefficients
vector[num_basis] b_raw;
// gaussian observation error
vector<lower=0>[n_series] sigma_obs;
// smoothing parameters
vector<lower=0>[n_sp] lambda;
}
transformed parameters {
// basis coefficients
vector[num_basis] b;
b[1:num_basis] = b_raw[1:num_basis];
}
model {
// prior for (Intercept)...
b_raw[1] ~ student_t(3, -0.1, 2.5);
// prior for te(temp,month)...
b_raw[2:16] ~ multi_normal_prec(zero[2:16],S1[1:15,1:15] * lambda[1] + S1[1:15,16:30] * lambda[2] + S1[1:15,31:45] * lambda[3]);
// prior for te(temp,month):seriesBluegreens...
b_raw[17:31] ~ multi_normal_prec(zero[17:31],S2[1:15,1:15] * lambda[4] + S2[1:15,16:30] * lambda[5] + S2[1:15,31:45] * lambda[6]);
// prior for te(temp,month):seriesDiatoms...
b_raw[32:46] ~ multi_normal_prec(zero[32:46],S3[1:15,1:15] * lambda[7] + S3[1:15,16:30] * lambda[8] + S3[1:15,31:45] * lambda[9]);
// prior for te(temp,month):seriesGreens...
b_raw[47:61] ~ multi_normal_prec(zero[47:61],S4[1:15,1:15] * lambda[10] + S4[1:15,16:30] * lambda[11] + S4[1:15,31:45] * lambda[12]);
// prior for te(temp,month):seriesOther.algae...
b_raw[62:76] ~ multi_normal_prec(zero[62:76],S5[1:15,1:15] * lambda[13] + S5[1:15,16:30] * lambda[14] + S5[1:15,31:45] * lambda[15]);
// prior for te(temp,month):seriesUnicells...
b_raw[77:91] ~ multi_normal_prec(zero[77:91],S6[1:15,1:15] * lambda[16] + S6[1:15,16:30] * lambda[17] + S6[1:15,31:45] * lambda[18]);
// priors for smoothing parameters
lambda ~ normal(5, 30);
// priors for observation error parameters
sigma_obs ~ student_t(3, 0, 2.5);
{
// likelihood functions
vector[n_nonmissing] flat_sigma_obs;
flat_sigma_obs = rep_each(sigma_obs, n)[obs_ind];
flat_ys ~ normal_id_glm(flat_xs,
0.0,b,
flat_sigma_obs);
}
}
generated quantities {
vector[total_obs] eta;
matrix[n, n_series] sigma_obs_vec;
matrix[n, n_series] mus;
vector[n_sp] rho;
array[n, n_series] real ypred;
rho = log(lambda);
// posterior predictions
eta = X * b;
for (s in 1:n_series) {
sigma_obs_vec[1:n,s] = rep_vector(sigma_obs[s], n);
}
for(s in 1:n_series){
mus[1:n, s] = eta[ytimes[1:n, s]];
ypred[1:n, s] = normal_rng(mus[1:n, s], sigma_obs_vec[1:n, s]);
}
}

And finally, the model summary.

summary(notrend_mod, include_betas = FALSE)
GAM formula:
y ~ te(temp, month, k = c(4, 4)) + te(temp, month, k = c(4, 4), 
    by = series)

Family:
gaussian

Link function:
identity

Trend model:
None

N series:
5 

N timepoints:
120 

Status:
Fitted using Stan 
4 chains, each with iter = 1000; warmup = 500; thin = 1 
Total post-warmup draws = 2000


Observation error parameter estimates:
             2.5%  50% 97.5% Rhat n_eff
sigma_obs[1] 0.76 0.87  1.00    1  2620
sigma_obs[2] 0.58 0.66  0.76    1  2721
sigma_obs[3] 0.80 0.92  1.10    1  2954
sigma_obs[4] 0.66 0.75  0.87    1  2969
sigma_obs[5] 0.63 0.73  0.85    1  2457

GAM coefficient (beta) estimates:
             2.5%    50% 97.5% Rhat n_eff
(Intercept) -0.11 -0.044 0.023    1  2869

Approximate significance of GAM smooths:
                                  edf Ref.df Chi.sq p-value    
te(temp,month)                   5.76     15  52.31 2.4e-05 ***
te(temp,month):seriesBluegreens  2.86     15   3.98    1.00    
te(temp,month):seriesDiatoms     3.34     15  56.02    0.11    
te(temp,month):seriesGreens      2.69     15   1.01    1.00    
te(temp,month):seriesOther.algae 1.34     15   2.83    1.00    
te(temp,month):seriesUnicells    2.49     15   6.47    0.99    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stan MCMC diagnostics:
n_eff / iter looks reasonable for all parameters
Rhat looks reasonable for all parameters
0 of 2000 iterations ended with a divergence (0%)
0 of 2000 iterations saturated the maximum tree depth of 10 (0%)
E-FMI indicated no pathological behavior

Samples were drawn using NUTS(diag_e) at Mon Feb 24 15:41:29 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split MCMC chains
(at convergence, Rhat = 1)

Use how_to_cite(notrend_mod) to get started describing this model

Visualise the model

Let’s first plot the global smoother for all species over time:

plot_mvgam_smooth(notrend_mod)

This is the shared seasonal trend estimated across all groups at once. We can see that the model was able to capture the seasonal temporal structure of the plankton counts, based on temperature and time (months).

We can also see how each group deviates from this global smooth (i.e. the partial effects):

gratia::draw(notrend_mod$mgcv_model)

The colour shows us the strength of the partial effect of We can see that the model was able to capture some differences between each plankton group’s seasonal trend and the community’s global trend.

Inspect the model

A great test of how good a model is, is to see how well it forecasts data we already have. We split the data into a training set and a test set above. Let’s see how well the trained model predicts this test set!

sapply(1:5, function(x) plot(notrend_mod, type = 'forecast', series = x))
Out of sample CRPS:
6.76106172887687
Out of sample CRPS:
6.74284136216226
Out of sample CRPS:
4.14878078336935
Out of sample CRPS:
3.52581136668436
Out of sample CRPS:
2.83716936348026

The points are the data points, and the red line and ribbon show the forecast trend and its credible intervals. Overall, these forecasts are okay, but not perfect - the data points are often within the credible intervals, and the forecasted trend seems to follow the seasonal trend pretty well. The model seems to understand that there is a strong seasonal trend in our observations, and is trying to predict it for each plankton group.

But how’s the model doing, overall? Let’s plot the residuals:

sapply(1:5, function(x) plot_mvgam_resids(notrend_mod, series = x))

Looking at the Autocorrelation Function plots (ACF), we can see that there’s still a lot of temporal autocorrelation to deal with in the model. Let’s try to address some of this with a dynamic model!

Dynamic model

Let’s now add a dynamic trend component to the model, to capture some of the variation that isn’t captured in our previous model.

You can specify many kinds of dynamic trends with mvgam, but we won’t go into much detail here. We want to demonstrate the power of this approach, but you will need to select the dynamic trend model that works best for you. Here, we will use a Gaussian Process to model temporal autocorrelation in our data, which can specified as GP in mvgam.

gptrend_mod <- mvgam(formula=  
                        y ~ 
                        te(temp, month, k = c(4, 4)) +
                        te(temp, month, k = c(4, 4), by = series),
                      
                      # latent trend model
                      # setting to order 1, meaning the autocorrelation is assumed to be 1 time step.
                      trend_model = "GP",  
                      
                      # use dynamic factors to estimate the latent trends
                      use_lv = TRUE,
                      
                      # observation family
                      family = gaussian(),
                      
                      # our long-format datasets
                      data = plankton_train,
                      newdata = plankton_test,
                      
                      # use reduced samples for inclusion in tutorial data
                      samples = 100)

Inspect the model

This workshop is only a quick glimpse of how to build a Bayesian model. If you want to learn more about how to inspect your model and understand how “good” it is, please check out: Bayes Rules! An Introduction to Applied Bayesian Modeling, Chapter 10: Evaluating Regression Models.

Posterior predictive checks are one way to investigate a Bayesian model’s ability to make unbiased predictions. In this check, we’ll be measuring the posterior prediction error of our model - in other words, how close are our model predictions to the expected value of Y?

These plots compare the observed data (solid black line) to the posterior predictions drawn from the fitted model (red line and credible interval ribbons). If our model is unbiased, these two lines should match pretty well. If there is a bias, we would see that the model predictions are very different from the black solid line.

sapply(1:5, FUN = function(x) ppc(gptrend_mod, series = x, type = 'density'))

These plots are pretty encouraging - our model isn’t generating biased predictions compared to the observation data (black line). Our model is doing a pretty good job!

Let’s check out the model summary - how does it differ from the model without the dynamic factors?

summary(gptrend_mod, include_betas = FALSE)
GAM formula:
y ~ te(temp, month, k = c(4, 4)) + te(temp, month, k = c(4, 4), 
    by = series)

Family:
gaussian

Link function:
identity

Trend model:
GP

N latent factors:
2 

N series:
5 

N timepoints:
120 

Status:
Fitted using Stan 
4 chains, each with iter = 600; warmup = 500; thin = 1 
Total post-warmup draws = 400


Observation error parameter estimates:
             2.5%  50% 97.5% Rhat n_eff
sigma_obs[1] 0.36 0.42  0.48 1.00   568
sigma_obs[2] 0.55 0.62  0.72 1.01   390
sigma_obs[3] 0.79 0.91  1.10 1.00   595
sigma_obs[4] 0.62 0.71  0.80 1.00   600
sigma_obs[5] 0.52 0.61  0.71 1.00   499

GAM coefficient (beta) estimates:
             2.5%    50% 97.5% Rhat n_eff
(Intercept) -0.16 -0.057 0.042 1.01   528

Approximate significance of GAM smooths:
                                   edf Ref.df Chi.sq p-value  
te(temp,month)                   4.692     15  30.53   0.045 *
te(temp,month):seriesBluegreens  2.751     15   9.26   0.940  
te(temp,month):seriesDiatoms     3.755     15  50.18   0.201  
te(temp,month):seriesGreens      2.241     15   0.79   1.000  
te(temp,month):seriesOther.algae 3.877     15  10.41   0.906  
te(temp,month):seriesUnicells    0.933     15  16.65   0.984  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Latent trend length scale (rho) estimates:
          2.5% 50% 97.5% Rhat n_eff
rho_gp[1]  2.1 4.6   8.8 1.01   294
rho_gp[2]  1.4 4.3  24.0 1.02   188
Warning: Supplying trend_model as a character string is deprecated
Please use the dedicated functions (i.e. RW() or ZMVN()) instead
This warning is displayed once per session.

Stan MCMC diagnostics:
n_eff / iter looks reasonable for all parameters
Rhats above 1.05 found for 24 parameters
 *Diagnose further to investigate why the chains have not mixed
0 of 400 iterations ended with a divergence (0%)
0 of 400 iterations saturated the maximum tree depth of 10 (0%)
E-FMI indicated no pathological behavior

Samples were drawn using NUTS(diag_e) at Thu Feb 27 23:14:09 2025.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split MCMC chains
(at convergence, Rhat = 1)

Use how_to_cite(gptrend_mod) to get started describing this model

Visualise the model

Plot the global smoother for all species over time:

plot_mvgam_smooth(gptrend_mod)

This is the shared temporal trend estimated across all species at once. We can see that, overall, species’ biomassess declined during this time period.

Plot the species’ estimated trends:

sapply(1:5, function(x) plot_mvgam_trend(gptrend_mod, series = x))

And, importantly, let’s have a look at the dynamic factors estimated by the model to capture temporal dynamics in our response data:

plot(gptrend_mod, type = 'factors')

These are the famous dynamic latent variables we’ve been talking about. You can see that they are capturing some of the unexplained temporal variation in our model. The model actually makes a bunch of these, and penalized them harshly to make latent variables that have a high explanatory power. This is the result!

Partial effects

We can plot the partial effect of each variable we included in our model, to understand how they contribute to the overall model.

Let’s start with the partial effect of temperature on plankton counts across groups:

plot_predictions(gptrend_mod, condition = c("temp"), conf_level = 0.9)

Next, the partial effect of seasonality (here, months) on plankton counts across groups:

plot_predictions(gptrend_mod, condition = c("month"), conf_level = 0.9)

And finally, the effect of different plankton groups on the model:

plot_predictions(gptrend_mod, condition = c("series"), conf_level = 0.9)

Calculate the rate of change in plankton counts

We can add each group’s estimated rate of change in abundance through time to these plots. Think of this as the slope of their estimated trend at each time point - this can give us an idea of the rate at which some groups declined or grew during the time series. If the derivative trend is centred on the zero line, the response variable (here, counts) were fluctuating around their mean abundance rather than declining or growing in a consistent way through time.

As an example, let’s look at the blue-green algae trend, where the first panel is the estimated trend of the blue-green algae counts through time, and the bottom panel is the rate of change in these counts over time. In the second panel, negative values mean there was a decline in counts, and positive values mean there was an increase in counts. This can be very useful to calculate population growth rates through time:

plot_mvgam_trend(gptrend_mod, series = 1, derivatives = TRUE)

It seems that these blue-green algae were pretty stable in the beginning of the time series, then declined slightly, then declined quickly around time step 100. The counts then increaased after this point.

We made a slightly “hacked” version of this function to extract the derivatives:

source(here::here("saved-objects/custom-function.R"))
derivs = lapply(1:5, function(x) plot_mvgam_trend_custom(gptrend_mod, series = x, derivatives = TRUE))
names(derivs) = gptrend_mod$obs_data$series |> levels()

Let’s look at a histogram of the derivatives to get an idea of how this group’s counts changed over time, on average:

# reformatting the data for easier plotting
df = lapply(derivs, function(x) x[,-1]) |> 
  lapply(as.data.frame) |>
  lapply(pivot_longer, cols = everything()) |>
  bind_rows(.id = "Group")

# make a histogram of rates of changes in counts for each plankton group
(plot_trenddensity = 
   ggplot(data = df) +
   geom_histogram(aes(x = value, fill = after_stat(x)), 
                  col = "black", linewidth = .2, bins = 19) + 
   geom_vline(xintercept = mean(df$value, na.rm = TRUE)) +
   geom_vline(xintercept = mean(df$value, na.rm = TRUE) - sd(df$value, na.rm = TRUE), lty = 2) +
   geom_vline(xintercept = mean(df$value, na.rm = TRUE) + sd(df$value, na.rm = TRUE), lty = 2) +
   theme(panel.grid.major.x = element_line()) +
   scale_y_sqrt() +
   labs(x = "Rate of change", 
        y = "Frequency", 
        fill = "Rate of change") +
   scale_fill_distiller(palette = "RdYlGn", 
                        direction = 1, 
                        limits = c(-.4,.4)) +
   coord_cartesian(xlim = c(-.4, .4))) +
  facet_wrap(~Group, ncol = 2) +
  theme(legend.position = "top")

The median of the distribution of rates of change is ~0 for all the groups - but does this mean they didn’t vary throughout the time series? Not at all! Look at the difference in the spread of rates of change between Bluegreens and the other groups. Bluegreens went through several periods of large growth (green) and of large declines (red), while Diatoms and Greens varied relatively little (all of their rates of change are at or close to 0). If you scroll back up to see the predicted trends for each group, do these histograms make sense to you?

Species correlations

One way to investigate community dynamics is to check out the correlations between plankton groups. These correlations are captured with the dynamic model, which estimates unmeasured temporal processes in our data.

Let’s calculate and plot species’ temporal correlations and plot them as a heatmap:

# Calculate the correlations from the latent trend model
species_correlations = lv_correlations(gptrend_mod)

# prepare the matrix for plotting
toplot = species_correlations$mean_correlations
colnames(toplot) = gsub("_", "\n", stringr::str_to_sentence(colnames(toplot)))
rownames(toplot) = gsub("_", "\n", stringr::str_to_sentence(rownames(toplot)))

# plot as a heatmap
corrplot::corrplot(toplot, 
                   type = "lower",
                   method = "color", 
                   tl.cex = 1, cl.cex = 1, tl.col = "black", font = 3,)

Model comparison

So, which model is best?

We will use leave-one-out comparison to compare the two models we just built: notrend_mod which does not have a dynamic trend, and gptrend which includes an autoregressive latent trend model.

This workshop isn’t really about model comparison, so we will just use this approach as an example, but there are other ways you could compare these models.

One way is to compare the predictive performance of the models on new data using a metric of comparison like expected log pointwise predictive density (ELPD).

mvgam::loo_compare(notrend_mod, gptrend_mod,
                   model_names = c("No trend", "Dynamic trend"))
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
              elpd_diff se_diff
Dynamic trend    0.0       0.0 
No trend      -101.6      11.7 

The dynamic trend model performs better than the model with no trend (here, we want high - more positive- values for elpd_dif).

Another way is to consider the ecological processes that influence our response variable. Which model do you think does a better job of representing the factors that cause plankton counts to vary through time?

5 Wrap up

We hope you’ve enjoyed this short tutorial on hierarchical generalized additive models with mgcv and mvgam.

At the end of this tutorial, we hope you now feel you understand how a hierarchical model works, and how it can be used to capture nonlinear effects. The R packages mgcv and mvgam packages are great tools to build and fit hierarchical models!

We also covered dynamic modelling, and how latent variables can be used to capture dynamic processes like temporal or spatial autocorrelation. Can you use latent variables to explain variation in your own models?

6 Thank you

Thank you to BIOS2 and Gracielle Higino, Andrew Macdonald, and Kim Gauthier-Schampaert for supporting this workshop. Special thanks to Nicholas J. Clark for helpful suggestions and for providing much of the material we used to build this workshop.

If you’re interested in talking more about these models, please join our community calls during March and April 2025! We welcome anyone interested in GAMs, computational ecology, or eager to learn more about HGAMs to participate in the following sessions:

Each discussion will focus on the outstanding ecological questions that we could answer with HGAMs, highlighting a wide array of potential applications for specific types of ecological and evolutionary data. Join us in thinking about how we could use HGAMs to push ecological research forward!