Multi-species occupancy models in R

In this tutorial, I cover:

Fitting multi-species single-season occupancy models using jagsUI in R.

The code and sample data from this tutorial are on my Occupancy Workshop GitHub Repo.

Multi-Species Occupancy Models

Occupancy models use repeated visits to estimate the probability of sites being occupied by species. Originally built for one, and then a small number of dependent species, they can be extended for when you have community data that includes detections of many species. For example, data might be from a vegetation survey, a series of point-counts for birds, or a camera trap survey.

One way to consider modelling this type of data is to fit single-species occupancy models for each species or a subset of species in the data set. However, community data typically have a high proportion of rare species with very few detections. Multi-species occupancy models take advantage of the community data so that estimates of occupancy and detection probability for rare species benefit from data on more common species.

The two basic parameters in multi-species occupancy models are similar to those in single-season single-species occupancy models, but are expanded to include more than one species by treating species similar to random effects.

detection probability (\(p_{k,i,t}\)), the probability of finding species k at site i at time t
occupancy probability (\(\psi_{k,i}\)), the probability species k occurs at site i.

We’re going to start with single-season occupancy models so the occupancy status of a species does not change during the study (i.e., sites are either occupied or unoccupied and do not transition between states).

Let’s look at an example.

Loading and organizing data

To get started, we need data describing on which surveys that each species was detected or not detected. My previous tutorials used example data for one species, but for this tutorial I expanded the data set to include 18 species. The data represent auditory detection data of calling amphibians at sites visited repeatedly.

The detection history data can be values of 0, 1, or NA.

The structure of community detection data can vary. They can be stacked tables or dataframes (an array) with rows for sites, columns for surveys, and the third dimension used for different species (our example data are in this form). Alternately, the data are sometimes summarized with sites as rows, columns for each species, and the values representing the number of surveys that a species was detected at a site.

# Load libraries
library(dplyr) # for data organization
library(ggplot2) # for plotting

# Load arrary of community detection history 
# 100 x (sites) x 10 columns (surveys) x 18 species
# It is the same structure as a single species detection history, 
# but stacked with 17 additional species
load(file = "community_detection_history.RData")

# Structure
str(community_detection_history)

##  int [1:100, 1:10, 1:18] 0 1 1 0 0 1 0 1 0 0 ...

# Look at part of one species' detection history (species 10)
head(community_detection_history[,,10])

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    1    0    1    1    1    1    0    1    1     0
## [2,]    0    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    0    0    0    0    0     0
## [4,]    0    0    0    0    0    0    0    0    0     0
## [5,]    1    1    1    1    0    1    0    0    1     1
## [6,]    0    0    0    0    0    0    0    0    0     0

# Restructure detection history data to summarize how many surveys a species was detected at
y <- matrix(data = NA, 
            nrow = nrow(community_detection_history), 
            ncol = dim(community_detection_history)[3])

# Use a simple loop to fill in y
# for each site row
for(i in 1:nrow(community_detection_history)){ 
  # for each species column
  for(k in 1:dim(community_detection_history)[3]){ 
    # sum the number of times a species was detected
    y[i,k] <- sum(community_detection_history[i,,k]) 
  }
}

How Multi-Species Occupancy Models (MSOMs) Work

In previous work, we fit occupancy models using maximum likelihood, but for fitting multi-species occupancy models, we will be working in a Bayesian world. If this is your first venture into Bayesian models, I suggest reading some introductions, such as:

We’re going to use Just Another Gibbs Sampler (JAGS) via the jagsUI package. However, most of this approach translates to similar packages based on the BUGS language, such as NIMBLE, with a few adjustments.

We will build a simple multi-species occupancy model where we assume variation between species in occupancy and detection probability is normally distributed between species, which is conceptually similar to having a random effects term for species in a generalized linear mixed-effects model.

More specifically, we will build a model where:

our detection history (\(y_{i,k}\)) is binomially distributed (0’s and 1’s) based on the product of detection probability (\(p_{i,k}\)) and occupancy state (\(z_{i,k}\)).
Occupancy status (0 or 1) is a Bernoulli draw from the occupancy probability (\(\psi_k\)).
The logit of occupancy probability (logit(\(\psi_k\))) is normally distributed among species with a community mean (\(\mu_{lpsi}\)) and standard deviation (\(\sigma_{lpsi}\)).
The logit of detection probability (logit(\(p_k\))) is normally distributed among species with a community mean (\(\mu_{lp}\)) and standard deviation (\(\sigma_{lp}\)).
We use priors based on Beta distributions (\(\alpha\) = 1, \(\beta\) = 1) for the community means of detection probability and occupancy probability, which bound the estimates between 0 and 1.
We use priors based on uniform distributions between 0 and 5 for the standard deviation of species detection probability and occupancy probability.

Mathematically:

\[\begin{align*} logit(\psi_k) \sim {\sf Normal}(\mu_{lpsi}, 1/\sigma_{lpsi}^2)\\ \mu_{lpsi} \sim logit(\overline{\psi})\\ \overline{\psi} \sim {\sf Beta}(\alpha = 1,\beta = 1)\\ \sigma_{lpsi} \sim {\sf Uniform}(min = 0,max = 5)\\ \\ logit(p_k) \sim {\sf Normal}(\mu_{lp}, 1/\sigma_{lp}^2)\\ \mu_{lp} \sim logit(\overline{p})\\ \overline{p} \sim {\sf Beta}(\alpha = 1,\beta = 1)\\ \sigma_{lp} \sim {\sf Uniform}(min = 0,max = 5)\\ \end{align*}\]

I suggest you read more about how prior selection can affect parameter estimation in Bayesian occupancy models. We will leave out all covariates for now, but will add them in during the next tutorial.

# Load R to jags library
# If installing for the first time, you also need to install jags.
# https://mcmc-jags.sourceforge.io/
library(jagsUI)

# Convert detection history to numeric
z <- (y > 0)*1  # *1 converts to numeric, keeps matrix
z[z == 0] <- NA

# Pack up data for JAGS
msom_jags_data <- list(y = y, # y = the detection history matrix
                 nSobs = ncol(y), # nSobs = number of species observed
                 nSites = nrow(y),  # nSites = number of sites
                 nOcc = ncol(community_detection_history), # nOcc = number of surveys
                 z = z) # z =  input detection data

# Look at structure
str(msom_jags_data)

## List of 5
##  $ y     : int [1:100, 1:18] 5 7 6 4 4 6 5 6 6 7 ...
##  $ nSobs : int 18
##  $ nSites: int 100
##  $ nOcc  : int 10
##  $ z     : num [1:100, 1:18] 1 1 1 1 1 1 1 1 1 1 ...

# Hierarchical model
# For fitting basic multi-species occupancy models, we can think of the components as
# 1. Ecological model
# 2. Observation model
# 3. Priors that describe species level variation in psi and p
# 4. Hyperpriors that describe parameters for the community 

# Create and save file name "msom_simple.jags"
# Identify filepath of model file
msom_simple <- tempfile()

#Write model to file
writeLines("
model{
  for(k in 1:nSobs){  # Loop through species
    # Likelihood
    for(i in 1:nSites) { # Loop through sites
      # Ecological model
      z[i, k] ~ dbern(psi[k])
      # Observation model
      y[i, k] ~ dbin(p[k] * z[i, k], nOcc)
    }
    
    # Priors for Psi (species level)
    lpsi[k] ~ dnorm(mu.lpsi, tau.lpsi)
    psi[k] <- ilogit(lpsi[k])
    
    # Priors for p (species level)
    lp[k] ~ dnorm(mu.lp, tau.lp)
    p[k] <- ilogit(lp[k])
  }
  
  # Hyperpriors for Psi (community level)
  psi.mean ~ dbeta(1, 1)
  mu.lpsi <- logit(psi.mean)
  sd.lpsi ~ dunif(0, 5)
  tau.lpsi <- 1/sd.lpsi^2
  
  # Hyperpriors for p (community level)
  p.mean ~ dbeta(1, 1)
  mu.lp <- logit(p.mean)
  sd.lp ~ dunif(0, 5)
  tau.lp <- 1/sd.lp^2
}
", con = msom_simple)

# Vector of monitored parameters
wanted <- c("psi.mean", "p.mean", "psi", "p")
# Could add other parameters in model: "mu.lpsi", "sd.lpsi", "tau.lpsi",
# "mu.lp", "sd.lp", "tau.lp"

# model run takes <1 min on a macbook
msom_simple_out <- jags(msom_jags_data, NULL, wanted, msom_simple,
                        n.chains = 3, n.adapt = 100, n.iter = 10000, 
                        n.burnin = 5000, n.thin = 2, parallel = TRUE, DIC = FALSE)

## 
## Processing function input....... 
## 
## Done. 
##  
## Beginning parallel processing using 3 cores. Console output will be suppressed.
## 
## Parallel processing completed.
## 
## Calculating statistics....... 
## 
## Done.

We didn’t get any error messages, but that we need to dig deeper than that. The first thing we’ll do is check the chains for mixing and convergence. I’ve commented out the call to plot traceplots for each parameter (plot(msom_simple_out)) and instead show a screenshot for one page of plots for four parameters.

# Plot traces and density estimates 
# (not showing because it will show many pages of plots)
# plot(msom_simple_out)

What are we looking for in the plots?

For the traces (left side plots), we are looking for good mixing and convergence among chains. Each chain is displayed in a different colour. This means random paths exploring a lot of the parameter space on the y-axis without a clear pattern and each chain converging on the same value. Read more here: https://m-clark.github.io/bayesian-basics/diagnostics.html#monitoring-convergence
\(\hat{R}\) values that are very close to 1.0. This is a test statistic for testing if the variance within chains is different than the variance between chains. It is meant to test if each chain was sampling from similar distributions.
Density plots (right side plots) that are not super-wide or with irregular peaks. The more parameter space the density plots include, the higher the uncertainty in a parameter estimate. The density curves don’t have to be normal but shouldn’t have multiple peaks and each chain colour should have approximately the same peak.

Now let’s look at the summary of parameter posterior distributions (model estimates). In the summary table, each row is a parameter from the model and the columns show the mean, uncertainty, and distribution.

# Summary of results (using head() to just show higher levels)
# Full summary includes psi and p estimates for each species
head(summary(msom_simple_out))

##               mean         sd       2.5%       25%       50%       75%
## psi.mean 0.4472205 0.07805864 0.29511415 0.3948868 0.4460298 0.4986848
## p.mean   0.4587719 0.03630463 0.38821261 0.4351796 0.4585168 0.4822607
## psi[1]   0.8099448 0.03871744 0.72846716 0.7853008 0.8121030 0.8367721
## psi[2]   0.2331477 0.04250463 0.15583775 0.2036380 0.2314763 0.2607335
## psi[3]   0.1410031 0.03416597 0.08021745 0.1168710 0.1390492 0.1622683
## psi[4]   0.6896989 0.04685211 0.59387890 0.6590001 0.6912912 0.7227093
##              97.5%     Rhat n.eff overlap0 f
## psi.mean 0.6020469 1.000247  7500        0 1
## p.mean   0.5328280 1.001088  2981        0 1
## psi[1]   0.8802377 1.000907  2145        0 1
## psi[2]   0.3208978 1.000138  7154        0 1
## psi[3]   0.2137691 1.000560  4848        0 1
## psi[4]   0.7770871 1.000667  7500        0 1

What inferences can we draw from this output?

The mean community occupancy probability is 0.45 [95% CI: 0.3, 0.6].
The mean community detection probability is 0.46 [95% CI: 0.39, 0.53].

Let’s visualize the variation in occupancy and detection probability between species.

# Put psi estimates in a dataframe
psi_species_estimates <- summary(msom_simple_out) %>%
  as.data.frame(.) %>%
  mutate(parameter = row.names(.)) %>%
  # Filtering to only the estimates of psi for each species 
  # species psi parameters in our model take the form "psi[1]"
  filter(parameter %in% c(paste0("psi","[",c(1:18),"]"))) %>%
  arrange(mean) %>%
  mutate(plot_order = 1:nrow(.))
  
# Plot psi estimates
ggplot(psi_species_estimates, aes(x = as.factor(plot_order), y = mean)) +
  geom_point() +
  geom_errorbar(aes(ymin = `2.5%`, ymax = `97.5%`), width = 0) +
  labs(x = "Species", y = "Occupancy estimate") +
  scale_x_discrete(labels = psi_species_estimates$parameter, 
                   breaks = order)+ 
  scale_y_continuous(limits = c(0, 1.0), breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0)) +
  coord_flip() +
  theme_classic()

# Put p estimates in a dataframe
p_species_estimates <- summary(msom_simple_out) %>%
  as.data.frame(.) %>%
  mutate(parameter = row.names(.)) %>%
  # Filtering to only the estimates of psi for each species 
  # species psi parameters in our model take the form "psi[1]"
  filter(parameter %in% c(paste0("p","[",c(1:18),"]"))) %>%
  arrange(mean) %>%
  mutate(plot_order = 1:nrow(.))
  
# Plot p estimates
ggplot(p_species_estimates, aes(x = as.factor(plot_order), y = mean)) +
  geom_point() +
  geom_errorbar(aes(ymin = `2.5%`, ymax = `97.5%`), width = 0) +
  labs(x = "Species", y = "Detection probability estimate") +
  scale_x_discrete(labels = p_species_estimates$parameter, 
                   breaks = order)+ 
  scale_y_continuous(limits = c(0, 1.0), breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0)) +
  coord_flip() +
  theme_classic()

From the plots, we can visualize how occupancy and detection probability varies within the community.

And that’s it!

We fit a multi-species occupancy model with JAGS in R and estimated detection probability and occupancy probability for a community of 18 species. That example was simple to demonstrate the principles, but your interest in these models is almost certainly to include covariates. In the next tutorial, we’ll introduce covariates into multi-species occupancy models.

Have suggestions for a tutorial you want on spatial ecology, mark-recapture models, occupancy models, or another related topic? Let me know!

Multi-Species Occupancy Models

Loading and organizing data

How Multi-Species Occupancy Models (MSOMs) Work

Further reading and references