Multi-species occupancy models in R: covariates and estimating species richness

In this tutorial, I cover fitting multi-species single-season occupancy models with covariates and estimating species richness using JAGS in R.

The code and sample data from this tutorial are available on GitHub.

Occupancy models use repeated visits to estimate the probability of sites being occupied by species using a hierarchical modelling framework, and multi-species occupancy models treat species similar to random effects. In my previous tutorial, we built multi-species occupancy models without any covariates, and here we extend that to:

Include covariates, such as habitat variables, for occupancy and detection.
Estimate the species richness at different sites using the posterior distribution of the latent state matrix.

Covariates in Multi-Species Occupancy Models (MSOMs)

We will include covariates by extending our previous model with additional terms in the linear models (with a logit link) for occupancy probability and detection probability. For each additional covariate, we allow the effect to vary between species. For example, we will add a covariate to occupancy probability for the amount of forest within a 1 km radius (scaled to mean = 0, sd = 1). Since our community of sampled species included forest specialists, generalists, and open habitat species, we’d expect that the response to forest amount to vary with species.

The changes in the model include:

Adding coefficients for the covariates in the submodel (occupancy or detection)
Defining the species prior distributions of the coefficients. We will use normal distributions for variation between species.
Defining the community level hyperpriors. For each covariate, we will use uniform prior distributions where the mean value of coefficients on the logit scale can vary between -5 and +5, and the standard deviation is has a uniform prior distribution between 0 and 5.

Mathematically, for species k, at site i:

\[\begin{align*} {\sf Model:}\\ logit(\psi_{i,k}) \sim \beta_{0,k} + \beta_{forest,k}*forest_i + \beta_{agri,k}*agri_i \\ \\ {\sf Species priors:}\\ \\ \beta_{0, k} \sim {\sf Normal}(\mu_{intercept}, 1/\sigma_{intercept}^2)\\ \beta_{forest, k} \sim {\sf Normal}(\mu_{forest}, 1/\sigma_{forest}^2)\\ \beta_{agri, k} \sim {\sf Normal}(\mu_{agri}, 1/\sigma_{agri}^2)\\ \\ {\sf Community hyperpriors:}\\ \\ \overline{\beta_0} \sim {\sf Beta}(\alpha = 1,\beta = 1)\\ \mu_0 \sim logit(\overline{\beta_0})\\ \sigma_0 \sim {\sf Uniform}(min = 0, max = 5)\\ \\ \mu_{forest} \sim {\sf Uniform}(min = -5, max = 5)\\ \sigma_{forest} \sim {\sf Uniform}(min = 0, max = 5)\\ \\ \mu_{agri} \sim {\sf Uniform}(min = -5, max = 5)\\ \sigma_{agri} \sim {\sf Uniform}(min = 0, max = 5)\\ \end{align*}\]

Estimating the number of species at a site

To estimate the number of species at each site, we will use the site x species matrix of latent (unobserved) state for occupancy (z). By summing species across all draws for a site, we can get an estimate (with uncertainty) of species richness at each sampled site. We will use these richness estimates to see how diversity is related to a covariate. We don’t need to modify the actual model, but just need to save the z-matrix (i.e., include z in wanted).

Loading and organizing data

# Load libraries
library(dplyr) # for data organization
library(ggplot2) # for plotting
library(tidyr) # for pivoting data

# Load arrary of community detection history 
# 100 x (sites) x 10 columns (surveys) x 18 species
# It is the same structure as a single species detection history, 
# but stacked with 17 additional species
load(file = "community_detection_history_covariates.RData")

community_detection_history <- community_detection_history_covariates
# Structure
str(community_detection_history)

##  int [1:100, 1:10, 1:18] 0 0 0 1 0 0 0 0 0 1 ...

# Look at part of one species' detection history (species 10)
head(community_detection_history[,,10])

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    0    0    0    0     0
## [2,]    0    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    0    0    0    0    0     0
## [4,]    0    0    0    0    0    0    0    0    0     0
## [5,]    0    0    1    0    0    0    0    0    0     0
## [6,]    0    0    0    0    0    0    0    0    0     0

# Restructure detection history data to summarize how many surveys a species was detected at
y <- matrix(data = NA, 
            nrow = nrow(community_detection_history), 
            ncol = dim(community_detection_history)[3])

# Use a simple loop to fill in y
# for each site row
for(i in 1:nrow(community_detection_history)){ 
  # for each species column
  for(k in 1:dim(community_detection_history)[3]){ 
    # sum the number of times a species was detected
    y[i,k] <- sum(community_detection_history[i,,k]) 
  }
}

# Convert detection history to numeric
z <- (y > 0)*1  # *1 converts to numeric, keeps matrix
z[z == 0] <- NA

# Load sitecov, effort, observers, 
site.cov <- read.csv(file = "site_cov.csv") %>%
  dplyr::select(-X)

Fit Covariate Model

library(jagsUI)

# Package data for jags, which must specify covariates
msom_covariates_jags_data <- list(y = y,# y = the detection history matrix
                                  nSobs = ncol(y), # The total number of species observed
                                  nSites = nrow(y),  # The total number of sites along the route
                                  nOcc = ncol(community_detection_history), # Each site visited 10 times
                                  z = z,
                                   # Covariates (already standardized)
                                  forest = site.cov$forest,
                                  agri = site.cov$agri)


# Hierarchical model
# For fitting basic multi-species occupancy models, we can think of the components as
# 1. an ecological model
# 2. an observation model
# 3. Priors that describe species level variation in psi and p
# 4. hyperpriors that describe parameters for the community 
# Create a temporary file "msom_covariates"
#Identify filepath of model file
msom_covariates <- tempfile()

#Write model to file
writeLines("
model{
  for(k in 1:nSobs){  # Loop through species
    # Likelihood
    for(i in 1:nSites) {
       # Ecological model
      logit(psi[i, k]) <- b0[k] + bforest[k]*forest[i] + bagri[k]*agri[i]
       z[i, k] ~ dbern(psi[i, k])
       # Observation model
       y[i, k] ~ dbin(p[k] * z[i, k], nOcc)
    }

    # Priors (species level)
    b0[k] ~ dnorm(mu.b0, tau.b0)
    mu.eta[k] <- mu.lp + rho * sd.lp/sd.b0 *
        (b0[k] - mu.b0)
    lp[k] ~ dnorm(mu.eta[k], tau.eta)
    p[k] <- ilogit(lp[k])

    bforest[k] ~ dnorm(mu.forest, tau.forest)
    bagri[k] ~ dnorm(mu.agri, tau.agri)

  }

  # Hyperpriors (community level)
  
  b0.mean ~ dbeta(1, 1)
  mu.b0 <- logit(b0.mean)
  sd.b0 ~ dunif(0, 5)
  tau.b0 <- 1/sd.b0^2

  mu.forest ~ dunif(-5, 5)
  sd.forest ~ dunif(0, 5)
  tau.forest <- 1/sd.forest^2
  
  mu.agri ~ dunif(-5, 5)
  sd.agri ~ dunif(0, 5)
  tau.agri <- 1/sd.agri^2

  p.mean ~ dbeta(1, 1)
  mu.lp <- logit(p.mean)
  sd.lp ~ dunif(0, 5)
  tau.lp <- 1/sd.lp^2

  rho ~ dunif(-1, 1)
  tau.eta <- tau.lp/(1 - rho^2)
  
}
", con = msom_covariates)

# Parameters monitored
wanted <- c("b0.mean", "mu.b0", "sd.b0", 
            "psi", "b0", "bforest",
            "mu.forest", "sd.forest",
            "mu.agri", "sd.agri",
            "p.mean", "sd.lp", "z")

# model run takes under 2 min
system.time(
msom_covariates_out <- jags(msom_covariates_jags_data, NULL, wanted, msom_covariates,
             n.chains = 3, n.iter = 10000, 
             n.burnin = 1000, n.thin = 5, parallel = TRUE, DIC = FALSE)
)

## 
## Processing function input....... 
## 
## Done. 
##  
## Beginning parallel processing using 3 cores. Console output will be suppressed.
## 
## Parallel processing completed.
## 
## Calculating statistics....... 
## 
## Done.

##    user  system elapsed 
##  33.259   1.099  65.034

# Summary
modelSummary <- summary(msom_covariates_out)

Before proceeding, we will check our diagnostics

# Plot traces and density estimates 
# (not showing because it will show many pages of plots)
# plot(msom_covariates_out)

# Summary
head(summary(msom_covariates_out))

##                 mean         sd       2.5%        25%        50%       75%
## b0.mean  0.501498090 0.08019125  0.3450316  0.4485830 0.50143745 0.5528388
## mu.b0    0.006188063 0.33031784 -0.6409508 -0.2063977 0.00574983 0.2121471
## sd.b0    1.355749449 0.29004940  0.8995508  1.1491108 1.31336015 1.5229914
## psi[1,1] 0.511471143 0.15464154  0.2204757  0.3996107 0.51168666 0.6232664
## psi[2,1] 0.904556547 0.03832063  0.8167308  0.8822286 0.90918439 0.9325991
## psi[3,1] 0.373958298 0.07851627  0.2267986  0.3187199 0.37177168 0.4265520
##              97.5%      Rhat n.eff overlap0         f
## b0.mean  0.6609825 1.0005531  5400        0 1.0000000
## mu.b0    0.6676757 1.0006750  5400        1 0.5062963
## sd.b0    2.0163566 0.9999949  5400        0 1.0000000
## psi[1,1] 0.8093752 0.9999855  5400        0 1.0000000
## psi[2,1] 0.9653703 1.0023216  1201        0 1.0000000
## psi[3,1] 0.5350544 1.0002627  4346        0 1.0000000

Visualizing effects of covariates on species’ occupancy probability

So how did the covariates affect the probability of occupancy for each species? We will visualize in two ways. First, we will extract the forest coefficients for each species. We can extract the mean (and uncertainty) for the coefficients in msom_covariates_out. Second, we will plot the relationship between occupancy probability and the forest covariate for each species. To predict the relationship between the forest covariate and occupancy probability for each species, we extract the intercept and forest coefficients, and predict occupancy probability along a sequence of forest values from the minimum to maximum.

forestCoefficients <- summary(msom_covariates_out) %>%
  as.data.frame(.) %>%
  mutate(parameter = row.names(.)) %>%
  # Filtering to only the estimates of psi for each species 
  # species psi parameters in our model take the form "psi[1]"
  filter(parameter %in% c(paste0("bforest","[",c(1:18),"]"))) %>%
  arrange(mean) %>%
  mutate(plot_order = 1:nrow(.))

# Plot forest coefficient estimates
ggplot(forestCoefficients, aes(x = as.factor(plot_order), y = mean)) +
  geom_point() +
  geom_errorbar(aes(ymin = `25%`, ymax = `75%`), width = 0, size = 2) +
  geom_errorbar(aes(ymin = `2.5%`, ymax = `97.5%`), width = 0) +
  labs(x = "Species", y = "Forest Coeffiecient Estimate on Occupancy") +
  scale_x_discrete(labels = forestCoefficients$parameter, 
                   breaks = order)+ 
  scale_y_continuous(limits = c(-0.25, 1.15), breaks = c(-0.25, 0, 0.25, 0.5, 1.0)) +
  coord_flip() +
  theme_classic()

# Extract mean predicted relationships

# Create a sequence to predict on from the minimum to maximum of a covariate
forestSequence <- seq(from = range(site.cov$forest)[1], 
                      to = range(site.cov$forest)[2], 
                      length.out = 101)
  
# Get the predictions for each value of forestSequence
predForest <- matrix(NA, 101, ncol(y))
for(i in 1:101){
  predForest[i,] <- plogis(msom_covariates_out$mean$b0 +
      msom_covariates_out$mean$bforest * forestSequence[i])
}

# Add forestSequence then pivot longer
predForest<- predForest %>%
  as.data.frame(.) %>%
  mutate(forest = forestSequence) %>%
  pivot_longer(., cols = 'V1':'V18', names_to = "species", values_to = "occupancy")

# Plot relationship between species occupaancy and forest
ggplot(predForest, aes(x = forest, y = occupancy, group = species)) +
    geom_path(col = "black") +
    labs(x = "Forest amount (scaled)", y = "Predicted occupancy") +
    theme_classic()

From our two plots, we can see that:

Most species responded positively to forest. In other words, most species were more likely to occupy sites with more forest.
There was variation in the response to forest between species, but only 1 of our 18 detected species was less likely to occupy sites with more forest.

Relationship between species richness and covariates

Beyond looking at how each species respond to a covariate, we can look at how predicted species richness responds to a covariate. To do this, we use the z-matrix posterior distribution, and then plot the relationship between species richness and the forest covariate.

# Species occupancy state at each site
z_results <- msom_covariates_out$sims.list$z

# Calculate average species richness per site
ndraws <- dim(z_results)[1] # Sample size (number of draws from posterior distribution)
richnessMatrix <- matrix(nrow = 100, ncol = ndraws) # Matrix to hold results

# For each draw from the posterior distribution
for(i in 1:ndraws){
  # Calculate the species richness by summing species
  # Each row is a site, each column is one draw
  richnessMatrix[,i] <- rowSums(z_results[i,,])
}

richnessDF <- richnessMatrix %>%
  as.data.frame(.) %>%
  # Calculate the average, SD, lower (2.5%) and upper (95.5%) of the distribution per site
  mutate(meanRichness = rowMeans(dplyr::select(., starts_with("V")), na.rm = TRUE),
         sdRichness = apply(X = richnessMatrix, MARGIN = 1, FUN = sd),
         lowerRichness = apply(X = richnessMatrix, MARGIN = 1, 
                               FUN = quantile, probs = 0.025),
         upperRichness = apply(X = richnessMatrix, MARGIN = 1, 
                               FUN = quantile, probs = 0.975),
         sites = as.character(1:100)) %>%
  # Join to the site dataframe to have the covariate
  left_join(., site.cov %>%
              mutate(sites = as.character(sites)),
            by = "sites") %>%
  # Limit the data frame to just the variables we need
  dplyr::select(sites, meanRichness, sdRichness,
                lowerRichness, upperRichness,
                forest, agri)


# Plot richness vs forest
ggplot(richnessDF, aes(x = forest, y = meanRichness)) +
  geom_point(pch = 16) +
  geom_errorbar(aes(ymin = lowerRichness, ymax = upperRichness), width = 0) +
  labs(y = "Predicted species richness", x = "Forest amount (scaled)") +
  theme_classic()

From our plot, there is a clear pattern where species richness is positively associated with forest.

Conclusion

That’s it! To summarize, we:

Built multi-species occupancy models with continuous covariates for occupancy
Tested how the relationship between occupancy probability and a covariate differ between species
Estimated the number of species at each site, and explored how species richness was related to covariates.

In the next tutorial, we’ll expand our model to estimate how many species our sampling missed by augmenting our detection history data. In our current example the maximum species richness is bound at the upper end by the number of species in the detection history (18), but its feasible that there are other species in the community that were present but not detected.

Have suggestions for a tutorial you want on spatial ecology, mark-recapture models, occupancy models, or another related topic? Let me know!