In this tutorial, I cover:
- What occupancy models are useful for,
- Fitting single-season occupancy models with the R package
unmarked
- Testing goodness-of-fit for single-season occupancy models
The code (and sample data) from this tutorial are available on GitHub.
Let’s get started!
What are occupancy models?
Occupancy models estimate where species occur (a biological process) while accounting for imperfect detection (an observation process). This is important because the probability of detecting a species during any survey is almost always substantially below 1. Without accounting for imperfect detection, estimates of species distribution are considerably biased.
Occupancy models are useful for estimating species’ distributions and relationships between species occurrence and habitat or landscape variables (among many other things!).
A basic occupancy model estimates two parameters:
- detection probability (\(p_i,_t\), the probability of finding a species at time t if it is present at site i)
- occupancy (\(\Psi_i\), the probability that a species occurs at site i).
Building single-season occupancy models
Let’s build some basic single-season occupancy models with the unmarked
package!
The basic single-season form assumes that occupancy status (presence or absence) does not change during the study. That is, a species is either always present or always absent at each site, even if the species is not always detected. Dynamic occupancy models allow occupancy status at a site to change through colonization and extinction.
To get started, we need data describing on which surveys that a species was detected. We will use some sample data I simulated. The unmarked
package uses a special object type to organize detection data and covariates (an unmarkedFrameOccu
).
# install.packages("unmarked") # first time only
library(unmarked)
# Load detection history (100 sites with 10 visits each)
detection_history <- read.csv("detection_history.csv",
# First variable ("X") has row.names but not data
row.names = "X")
# Examine data
head(detection_history)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## 1 0 0 0 0 0 0 0 0 0 0
## 2 1 1 1 0 0 1 1 1 0 1
## 3 1 0 0 0 1 1 0 1 0 0
## 4 1 0 1 1 0 1 0 0 0 1
## 5 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0 0 0
# Create unmarkedFrameOccu that holds the data
sample.unmarkedFrame_simple <- unmarkedFrameOccu( # y is a matrix with observed detection history
# (0's and 1's, one row per site, one column per survey)
y = as.matrix(detection_history))
# S4 class for occupancy model data
summary(sample.unmarkedFrame_simple)
## unmarkedFrame Object
##
## 100 sites
## Maximum number of observations per site: 10
## Mean number of observations per site: 10
## Sites with at least one detection: 62
##
## Tabulation of y observations:
## 0 1
## 703 297
Remember there are two parameters (detection and occupancy). Each parameter is estimated using a linear model on the logit scale, which bounds the real estimates to be between 0 and 1. We need to bound the estimates because probabilities need to fall between 0 and 1.
In the call to build our occupancy model, we will start with an intercept-only model where there is only one estimate for detection probability and one estimate for occupancy probability. An intercept-only model in this case is saying the probability of a species being present is the same at each site, and the probability of detecting a species is the same during each survey.
# Build basic single-season occupancy model with intercepts only (one estimate for detection, one for occupancy)
occu.m1 <- occu(formula = ~1 # detection formula first
~1, # occupancy formula second,
data = sample.unmarkedFrame_simple)
summary(occu.m1) # Show AIC, estimates (on logit scale), SE, z-scores
##
## Call:
## occu(formula = ~1 ~ 1, data = sample.unmarkedFrame_simple)
##
## Occupancy (logit-scale):
## Estimate SE z P(>|z|)
## 0.493 0.207 2.39 0.0169
##
## Detection (logit-scale):
## Estimate SE z P(>|z|)
## -0.0868 0.0809 -1.07 0.283
##
## AIC: 995.0407
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 18
## Bootstrap iterations: 0
# To get real estimate of occupancy (with 95% CI)
predict(occu.m1,
newdata = data.frame(site = 1),
type = "state")
## Predicted SE lower upper
## 1 0.6209272 0.04861255 0.5221585 0.7105956
# To get real estimate of detection (with 95% CI)
predict(occu.m1,
newdata = data.frame(site = 1),
type = "det")
## Predicted SE lower upper
## 1 0.478317 0.02018546 0.4389719 0.517933
## psi(Int)
## 0.6209272
## p(Int)
## 0.478317
So, we have built a very simple single season occupancy model, and extracted the real parameters predicting occupancy probability (~0.62) and detection probability (~0.48). In other words, we predict the species to occur in about 62% of sites, and to detect the species (when present) about 48% of the time.
However, it is reasonable that some sites are more likely to have the target species than others. And, it is reasonable that some surveys are more likely to detect the target species than others.
How do we add covariates to the model to make it more realistic?
Adding covariates for occupancy and detection
To add covariates to occupancy and detection parts of our model, we need to include data in the call to unmarkedFrameOccu
.
There are two types of covariates:
Observation-level covariates, which are different for every combination of site and survey (e.g. search effort per visit per site). Each one of these variables should be included as an element of a list, and each element of the list should have a row for every site and a column for every survey. In our example, that would be 100 rows and 10 columns.
Site-level covariates do not change through time, but differ between each site. For example, measures of habitat type or other landscape metrics like the straight-line distance to a road. Site covariates should be included in a dataframe with a row for every site and a column for each site variable.
# Load covariate data
effort <- read.csv("effort.csv",
# First variable ("X") has row.names but not data
row.names = "X")
observers <- read.csv("observers.csv",
# First variable ("X") has row.names but not data
row.names = "X")
site_cov <- read.csv("site_cov.csv",
# First variable ("X") has row.names but not data
row.names = "X")
# Build a new unmarkedFramOccu
sample.unmarkedFrame_cov <- unmarkedFrameOccu( # y is a matrix with observed detection history
# (0's and 1's, one row per site, one column per survey)
y = as.matrix(detection_history),
# obsCovs = observation covariates in a list,
# each variable has site rows x survey columns
obsCovs = list(effort = effort,
observers = observers),
# siteCovs = dataframe with site rows x column variables
siteCovs = site_cov)
# S4 class for occupancy model data
summary(sample.unmarkedFrame_cov)
## unmarkedFrame Object
##
## 100 sites
## Maximum number of observations per site: 10
## Mean number of observations per site: 10
## Sites with at least one detection: 62
##
## Tabulation of y observations:
## 0 1
## 703 297
##
## Site-level covariates:
## sites forest agri
## Min. : 1.00 Min. :-12.1549 Min. :-11.3766
## 1st Qu.: 25.75 1st Qu.: -3.5126 1st Qu.: -2.3297
## Median : 50.50 Median : -0.1068 Median : 0.5792
## Mean : 50.50 Mean : -0.1729 Mean : 0.3977
## 3rd Qu.: 75.25 3rd Qu.: 3.5805 3rd Qu.: 3.1949
## Max. :100.00 Max. : 12.7956 Max. : 12.1053
##
## Observation-level covariates:
## effort observers
## Min. :-17.284178 Min. :-15.4163
## 1st Qu.: -3.390206 1st Qu.: -2.7600
## Median : 0.204287 Median : 0.5063
## Mean : -0.007534 Mean : 0.3759
## 3rd Qu.: 3.200970 3rd Qu.: 3.4080
## Max. : 17.784615 Max. : 15.1235
Now that the data are organized, we can build a model using occu()
that includes covariates.
The names in our formulae must be exactly the same as the names in sample.unmarkedFrame_cov
. The formulae work like other linear model building tools in R (e.g. +
for additive effects, :
for an interaction, and *
for main effects plus an interaction).
occu.m2 <- occu(formula = ~effort + observers # detection formula first
~forest + agri, # occupancy formula second,
data = sample.unmarkedFrame_cov)
# Summarize
summary(occu.m2)
##
## Call:
## occu(formula = ~effort + observers ~ forest + agri, data = sample.unmarkedFrame_cov)
##
## Occupancy (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) 0.842 0.2695 3.13 1.77e-03
## forest 0.227 0.0552 4.11 4.01e-05
## agri -0.249 0.0714 -3.48 4.93e-04
##
## Detection (logit-scale):
## Estimate SE z P(>|z|)
## (Intercept) -0.261 0.0992 -2.63 8.44e-03
## effort 0.140 0.0208 6.73 1.69e-11
## observers 0.293 0.0262 11.16 6.67e-29
##
## AIC: 750.4784
## Number of sites: 100
## optim convergence code: 0
## optim iterations: 40
## Bootstrap iterations: 0
From the coefficients, we see that:
- detection probability increases with more search effort (positive estimate) and more observers per survey (positive estimate)
- occupancy probability increases with forest cover (positive estimate) and decreases with agricultural cover (negative estimate)
We can use the predict()
function and ggplot()
to visualize the relationships (code not shown).
Great! But, how do we know our model fits our data?
Goodness-of-fit Tests
For single-season occupancy models, the most common method to test the goodness-of-fit is to use a MacKenzie-Bailey test. This approach simulates detection history sets, assuming the detection model fits appropriately, and then compares the binomial distribution of simulated data to the real data.
This test is useful to:
- Estimate the variance inflation factor (\(\hat{c}\) or “c-hat”).
- Test, using a Chi-square statistic, whether the detection data used to fit the model has more extra-binomial noise than expected.
- Identify which sites contributed most to the Chi-square statistic. This is useful for diagnosing places where the data do not meet assumptions (e.g. Is there unmodelled heterogeneity in capture probability?).
The AICcmodavg
package implements the test in the mb.gof.test
function.
# install.packages("AICcmodavg") # First time only
library(AICcmodavg)
# Do Mackenzie-Bailey goodness of fit test for single-season occupancy model
m2_mb.gof.boot <- mb.gof.test(occu.m2,
# Demonstrate with small number of sims (10),
# but then change to large number (e.g. 1000)
nsim = 50)
##
## MacKenzie and Bailey goodness-of-fit for single-season occupancy model
##
## Pearson chi-square table:
##
## Cohort Observed Expected Chi-square
## 0000000000 0 38 37.95 0.00
## 0000001011 0 1 0.05 17.85
## 0000011001 0 1 0.09 9.48
## 0000011100 0 1 0.08 10.36
## 0000100100 0 1 0.07 12.14
## 0000100111 0 1 0.05 19.92
## 0001000010 0 1 0.10 8.14
## 0001001010 0 1 0.04 20.32
## 0001010000 0 1 0.10 7.65
## 0001011001 0 1 0.06 15.55
## 0001011100 0 1 0.08 10.99
## 0001101100 0 1 0.04 22.92
## 0001111111 0 1 0.07 12.13
## 0010001100 0 1 0.11 7.29
## 0010001101 0 1 0.12 6.78
## 0010011101 0 1 0.10 8.18
## 0011011100 0 1 0.11 7.47
## 0011011101 0 1 0.10 7.71
## 0100011001 0 2 0.06 67.23
## 0100100001 0 1 0.05 18.88
## 0100111111 0 1 0.05 19.71
## 0101000000 0 1 0.09 9.14
## 0101010010 0 1 0.07 12.43
## 0110010000 0 1 0.07 11.74
## 0110010110 0 1 0.08 9.91
## 0110101010 0 1 0.05 18.43
## 0111011101 0 1 0.05 19.87
## 0111110011 0 1 0.02 41.39
## 1000000010 0 1 0.12 6.58
## 1000000100 0 1 0.07 12.79
## 1000010010 0 1 0.10 8.18
## 1000010110 0 1 0.09 9.46
## 1000110010 0 1 0.06 13.88
## 1000110100 0 1 0.05 17.40
## 1000111101 0 1 0.06 14.16
## 1001010100 0 1 0.11 7.50
## 1001010110 0 1 0.09 9.25
## 1001101101 0 2 0.03 115.21
## 1001110001 0 1 0.07 12.67
## 1001110011 0 1 0.07 12.48
## 1010000110 0 1 0.06 15.51
## 1010001011 0 1 0.08 10.82
## 1010001101 0 1 0.05 19.78
## 1010110010 0 1 0.06 15.09
## 1011000011 0 1 0.09 9.48
## 1011010001 0 1 0.09 9.33
## 1011011010 0 1 0.07 13.39
## 1011011110 0 1 0.07 12.29
## 1011100010 0 1 0.07 12.89
## 1011111101 0 1 0.06 14.44
## 1011111111 0 1 0.04 20.37
## 1100001101 0 1 0.10 8.32
## 1100001110 0 1 0.06 14.49
## 1101000100 0 1 0.09 9.55
## 1101011001 0 1 0.07 12.87
## 1101011110 0 1 0.06 15.24
## 1110000110 0 1 0.04 23.30
## 1110011011 0 1 0.07 11.64
## 1110011101 0 1 0.05 19.88
## 1111011011 0 1 0.16 4.28
## 1111101001 0 1 0.06 15.32
##
## Chi-square statistic = 1019.125
## Number of bootstrap samples = 50
## P-value = 0.62
##
## Quantiles of bootstrapped statistics:
## 0% 25% 50% 75% 100%
## 937 1000 1033 1076 1275
##
## Estimate of c-hat = 0.98
In this example, I created 50 simulated data sets, but I’d recommend a much higher number (e.g. 1000) when actually performing this test. In our example, we see that the detection model fits the data appropriately because the Chi-square test indicates the variance inflation factor is not different than 1.0 (p = 0.4 - 0.5; exact number depends on your simulated dataset; it varies because we only simulate a low number of data sets).
Alright! We now have gone through building single-season occupancy models, adding covariates, and testing the goodness-of-fit of models. In the future, I’ll cover model selection (and model averaging) and dynamic occupancy models where occupancy status changes through time via colonizatin and extinction.
Further reading
Don’t already have your data in capture-history format? Check-out my past post on creating capture histories.
MacKenzie, D. I., and L. L. Bailey. 2004. Assessing the fit of site occupancy models. J. Agric. Biol. Environ. Stat. 9: 300– 318.
unmarked Package Vignette https://cran.r-project.org/web/packages/unmarked/vignettes/unmarked.pdf
AICcmodavg Package documentation https://cran.r-project.org/web/packages/AICcmodavg/index.html