Introduction to occupancy models in R

In this tutorial, I cover:

What occupancy models are useful for,
Fitting single-season occupancy models with the R package unmarked
Testing goodness-of-fit for single-season occupancy models

The code (and sample data) from this tutorial are available on GitHub.

Let’s get started!

What are occupancy models?

Occupancy models estimate where species occur (a biological process) while accounting for imperfect detection (an observation process). This is important because the probability of detecting a species during any survey is almost always substantially below 1. Without accounting for imperfect detection, estimates of species distribution are considerably biased.

Occupancy models are useful for estimating species’ distributions and relationships between species occurrence and habitat or landscape variables (among many other things!).

A basic occupancy model estimates two parameters:

detection probability (\(p_i,_t\), the probability of finding a species at time t if it is present at site i)
occupancy (\(\Psi_i\), the probability that a species occurs at site i).

Building single-season occupancy models

Let’s build some basic single-season occupancy models with the unmarked package!

The basic single-season form assumes that occupancy status (presence or absence) does not change during the study. That is, a species is either always present or always absent at each site, even if the species is not always detected. Dynamic occupancy models allow occupancy status at a site to change through colonization and extinction.

To get started, we need data describing on which surveys that a species was detected. We will use some sample data I simulated. The unmarked package uses a special object type to organize detection data and covariates (an unmarkedFrameOccu).

# install.packages("unmarked") # first time only
library(unmarked)

# Load detection history (100 sites with 10 visits each)
detection_history <- read.csv("detection_history.csv", 
                              # First variable ("X") has row.names but not data
                              row.names = "X") 

# Examine data
head(detection_history)

##   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1  0  0  0  0  0  0  0  0  0   0
## 2  1  1  1  0  0  1  1  1  0   1
## 3  1  0  0  0  1  1  0  1  0   0
## 4  1  0  1  1  0  1  0  0  0   1
## 5  0  0  0  0  0  0  0  0  0   0
## 6  0  0  0  0  0  0  0  0  0   0

# Create unmarkedFrameOccu that holds the data
sample.unmarkedFrame_simple <- unmarkedFrameOccu( # y is a matrix with observed detection history 
                                          # (0's and 1's, one row per site, one column per survey)
                                      y = as.matrix(detection_history)) 

# S4 class for occupancy model data
summary(sample.unmarkedFrame_simple)

## unmarkedFrame Object
## 
## 100 sites
## Maximum number of observations per site: 10 
## Mean number of observations per site: 10 
## Sites with at least one detection: 62 
## 
## Tabulation of y observations:
##   0   1 
## 703 297

Remember there are two parameters (detection and occupancy). Each parameter is estimated using a linear model on the logit scale, which bounds the real estimates to be between 0 and 1. We need to bound the estimates because probabilities need to fall between 0 and 1.

In the call to build our occupancy model, we will start with an intercept-only model where there is only one estimate for detection probability and one estimate for occupancy probability. An intercept-only model in this case is saying the probability of a species being present is the same at each site, and the probability of detecting a species is the same during each survey.

# Build basic single-season occupancy model with intercepts only (one estimate for detection, one for occupancy)
occu.m1 <- occu(formula = ~1 # detection formula first
                          ~1, # occupancy formula second, 
                  data = sample.unmarkedFrame_simple)

summary(occu.m1) # Show AIC, estimates (on logit scale), SE, z-scores

## 
## Call:
## occu(formula = ~1 ~ 1, data = sample.unmarkedFrame_simple)
## 
## Occupancy (logit-scale):
##  Estimate    SE    z P(>|z|)
##     0.493 0.207 2.39  0.0169
## 
## Detection (logit-scale):
##  Estimate     SE     z P(>|z|)
##   -0.0868 0.0809 -1.07   0.283
## 
## AIC: 995.0407 
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 18 
## Bootstrap iterations: 0

# To get real estimate of occupancy (with 95% CI)
predict(occu.m1, 
        newdata = data.frame(site = 1),
        type = "state")

##   Predicted         SE     lower     upper
## 1 0.6209272 0.04861255 0.5221585 0.7105956

# To get real estimate of detection (with 95% CI)
predict(occu.m1, 
        newdata = data.frame(site = 1),
        type = "det")

##   Predicted         SE     lower    upper
## 1  0.478317 0.02018546 0.4389719 0.517933

# Equivalent to inverse logit
boot::inv.logit(coef(occu.m1)[1]) # Real estimate of occupancy

##  psi(Int) 
## 0.6209272

boot::inv.logit(coef(occu.m1)[2]) # Real estimate of detection

##   p(Int) 
## 0.478317

So, we have built a very simple single season occupancy model, and extracted the real parameters predicting occupancy probability (~0.62) and detection probability (~0.48). In other words, we predict the species to occur in about 62% of sites, and to detect the species (when present) about 48% of the time.

However, it is reasonable that some sites are more likely to have the target species than others. And, it is reasonable that some surveys are more likely to detect the target species than others.

How do we add covariates to the model to make it more realistic?

Adding covariates for occupancy and detection

To add covariates to occupancy and detection parts of our model, we need to include data in the call to unmarkedFrameOccu.

There are two types of covariates:

Observation-level covariates, which are different for every combination of site and survey (e.g. search effort per visit per site). Each one of these variables should be included as an element of a list, and each element of the list should have a row for every site and a column for every survey. In our example, that would be 100 rows and 10 columns.
Site-level covariates do not change through time, but differ between each site. For example, measures of habitat type or other landscape metrics like the straight-line distance to a road. Site covariates should be included in a dataframe with a row for every site and a column for each site variable.

# Load covariate data
effort <- read.csv("effort.csv",
                   # First variable ("X") has row.names but not data
                   row.names = "X") 
observers <- read.csv("observers.csv",
                      # First variable ("X") has row.names but not data
                      row.names = "X") 
site_cov <- read.csv("site_cov.csv",
                     # First variable ("X") has row.names but not data
                     row.names = "X") 

# Build a new unmarkedFramOccu
sample.unmarkedFrame_cov <- unmarkedFrameOccu( # y is a matrix with observed detection history 
                                          # (0's and 1's, one row per site, one column per survey)
                                      y = as.matrix(detection_history),
                                      # obsCovs = observation covariates in a list, 
                                      # each variable has site rows x survey columns
                                      obsCovs = list(effort = effort,
                                                     observers = observers),
                                      # siteCovs = dataframe with site rows x column variables
                                      siteCovs = site_cov) 

# S4 class for occupancy model data
summary(sample.unmarkedFrame_cov)

## unmarkedFrame Object
## 
## 100 sites
## Maximum number of observations per site: 10 
## Mean number of observations per site: 10 
## Sites with at least one detection: 62 
## 
## Tabulation of y observations:
##   0   1 
## 703 297 
## 
## Site-level covariates:
##      sites            forest              agri         
##  Min.   :  1.00   Min.   :-12.1549   Min.   :-11.3766  
##  1st Qu.: 25.75   1st Qu.: -3.5126   1st Qu.: -2.3297  
##  Median : 50.50   Median : -0.1068   Median :  0.5792  
##  Mean   : 50.50   Mean   : -0.1729   Mean   :  0.3977  
##  3rd Qu.: 75.25   3rd Qu.:  3.5805   3rd Qu.:  3.1949  
##  Max.   :100.00   Max.   : 12.7956   Max.   : 12.1053  
## 
## Observation-level covariates:
##      effort             observers       
##  Min.   :-17.284178   Min.   :-15.4163  
##  1st Qu.: -3.390206   1st Qu.: -2.7600  
##  Median :  0.204287   Median :  0.5063  
##  Mean   : -0.007534   Mean   :  0.3759  
##  3rd Qu.:  3.200970   3rd Qu.:  3.4080  
##  Max.   : 17.784615   Max.   : 15.1235

Now that the data are organized, we can build a model using occu() that includes covariates.

The names in our formulae must be exactly the same as the names in sample.unmarkedFrame_cov. The formulae work like other linear model building tools in R (e.g. + for additive effects, : for an interaction, and * for main effects plus an interaction).

occu.m2 <- occu(formula = ~effort + observers # detection formula first
                          ~forest + agri, # occupancy formula second,
                data = sample.unmarkedFrame_cov)

# Summarize
summary(occu.m2)

## 
## Call:
## occu(formula = ~effort + observers ~ forest + agri, data = sample.unmarkedFrame_cov)
## 
## Occupancy (logit-scale):
##             Estimate     SE     z  P(>|z|)
## (Intercept)    0.842 0.2695  3.13 1.77e-03
## forest         0.227 0.0552  4.11 4.01e-05
## agri          -0.249 0.0714 -3.48 4.93e-04
## 
## Detection (logit-scale):
##             Estimate     SE     z  P(>|z|)
## (Intercept)   -0.261 0.0992 -2.63 8.44e-03
## effort         0.140 0.0208  6.73 1.69e-11
## observers      0.293 0.0262 11.16 6.67e-29
## 
## AIC: 750.4784 
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 40 
## Bootstrap iterations: 0

From the coefficients, we see that:

detection probability increases with more search effort (positive estimate) and more observers per survey (positive estimate)
occupancy probability increases with forest cover (positive estimate) and decreases with agricultural cover (negative estimate)

We can use the predict() function and ggplot() to visualize the relationships (code not shown).

Great! But, how do we know our model fits our data?

Goodness-of-fit Tests

For single-season occupancy models, the most common method to test the goodness-of-fit is to use a MacKenzie-Bailey test. This approach simulates detection history sets, assuming the detection model fits appropriately, and then compares the binomial distribution of simulated data to the real data.

This test is useful to:

Estimate the variance inflation factor (\(\hat{c}\) or “c-hat”).
Test, using a Chi-square statistic, whether the detection data used to fit the model has more extra-binomial noise than expected.
Identify which sites contributed most to the Chi-square statistic. This is useful for diagnosing places where the data do not meet assumptions (e.g. Is there unmodelled heterogeneity in capture probability?).

The AICcmodavg package implements the test in the mb.gof.test function.

# install.packages("AICcmodavg") # First time only
library(AICcmodavg)

# Do Mackenzie-Bailey goodness of fit test for single-season occupancy model
m2_mb.gof.boot <- mb.gof.test(occu.m2,
                                # Demonstrate with small number of sims (10), 
                                # but then change to large number (e.g. 1000)
                                nsim = 50)

# View Results
m2_mb.gof.boot

## 
## MacKenzie and Bailey goodness-of-fit for single-season occupancy model
## 
## Pearson chi-square table:
## 
##            Cohort Observed Expected Chi-square
## 0000000000      0       38    37.95       0.00
## 0000001011      0        1     0.05      17.85
## 0000011001      0        1     0.09       9.48
## 0000011100      0        1     0.08      10.36
## 0000100100      0        1     0.07      12.14
## 0000100111      0        1     0.05      19.92
## 0001000010      0        1     0.10       8.14
## 0001001010      0        1     0.04      20.32
## 0001010000      0        1     0.10       7.65
## 0001011001      0        1     0.06      15.55
## 0001011100      0        1     0.08      10.99
## 0001101100      0        1     0.04      22.92
## 0001111111      0        1     0.07      12.13
## 0010001100      0        1     0.11       7.29
## 0010001101      0        1     0.12       6.78
## 0010011101      0        1     0.10       8.18
## 0011011100      0        1     0.11       7.47
## 0011011101      0        1     0.10       7.71
## 0100011001      0        2     0.06      67.23
## 0100100001      0        1     0.05      18.88
## 0100111111      0        1     0.05      19.71
## 0101000000      0        1     0.09       9.14
## 0101010010      0        1     0.07      12.43
## 0110010000      0        1     0.07      11.74
## 0110010110      0        1     0.08       9.91
## 0110101010      0        1     0.05      18.43
## 0111011101      0        1     0.05      19.87
## 0111110011      0        1     0.02      41.39
## 1000000010      0        1     0.12       6.58
## 1000000100      0        1     0.07      12.79
## 1000010010      0        1     0.10       8.18
## 1000010110      0        1     0.09       9.46
## 1000110010      0        1     0.06      13.88
## 1000110100      0        1     0.05      17.40
## 1000111101      0        1     0.06      14.16
## 1001010100      0        1     0.11       7.50
## 1001010110      0        1     0.09       9.25
## 1001101101      0        2     0.03     115.21
## 1001110001      0        1     0.07      12.67
## 1001110011      0        1     0.07      12.48
## 1010000110      0        1     0.06      15.51
## 1010001011      0        1     0.08      10.82
## 1010001101      0        1     0.05      19.78
## 1010110010      0        1     0.06      15.09
## 1011000011      0        1     0.09       9.48
## 1011010001      0        1     0.09       9.33
## 1011011010      0        1     0.07      13.39
## 1011011110      0        1     0.07      12.29
## 1011100010      0        1     0.07      12.89
## 1011111101      0        1     0.06      14.44
## 1011111111      0        1     0.04      20.37
## 1100001101      0        1     0.10       8.32
## 1100001110      0        1     0.06      14.49
## 1101000100      0        1     0.09       9.55
## 1101011001      0        1     0.07      12.87
## 1101011110      0        1     0.06      15.24
## 1110000110      0        1     0.04      23.30
## 1110011011      0        1     0.07      11.64
## 1110011101      0        1     0.05      19.88
## 1111011011      0        1     0.16       4.28
## 1111101001      0        1     0.06      15.32
## 
## Chi-square statistic = 1019.125 
## Number of bootstrap samples = 50
## P-value = 0.62
## 
## Quantiles of bootstrapped statistics:
##   0%  25%  50%  75% 100% 
##  937 1000 1033 1076 1275 
## 
## Estimate of c-hat = 0.98

In this example, I created 50 simulated data sets, but I’d recommend a much higher number (e.g. 1000) when actually performing this test. In our example, we see that the detection model fits the data appropriately because the Chi-square test indicates the variance inflation factor is not different than 1.0 (p = 0.4 - 0.5; exact number depends on your simulated dataset; it varies because we only simulate a low number of data sets).

Alright! We now have gone through building single-season occupancy models, adding covariates, and testing the goodness-of-fit of models. In the future, I’ll cover model selection (and model averaging) and dynamic occupancy models where occupancy status changes through time via colonizatin and extinction.

What are occupancy models?

Building single-season occupancy models

Adding covariates for occupancy and detection

Goodness-of-fit Tests

Further reading