Jolly-Seber models to estimate population size in R

Estimating population size is one of the primary purposes we collect mark-recapture data, so how can we estimate that?

In this tutorial, I cover:

Basics of Jolly-Seber models (using POPAN formulation),
Fitting POPAN Jolly-Seber models with the R package marked
Deriving population size using estimated parameters from POPAN Jolly-Seber models

The code from this tutorial is available on my GitHub page for mark-recapture models.

Let’s get started!

What are Jolly-Seber models?

Jolly-Seber models are a class of open population mark-recapture models that can be used to estimate population size, population growth, and recruitment (depending on the formulation).

In this tutorial, I will consider the POPAN formulation, which can be used to estimate population size. The POPAN Jolly-Seber model estimates four parameters:

detection probability (\(p_t\), the probability of detecting an individual at time t if it is alive)
apparent survival (\(\Phi_t\)) is the probability of an individual surviving to the next time step
Super population size (\(N_{super}\)) is the total number of individuals available to enter the study
Probability of new individuals from the super-population entering at time t (through births and immigration \(pent_t\))

The p and \(\Phi\) parameters are estimated using the same approach as in Cormack-Jolly-Seber models. The additional two parameters estimate how many new individuals enter the population between each time period.

New individuals enter the population between time steps. The number of individuals that enter at time t is the product of \(N_{super}\) and \(pent_t\).

Since the super-population size is the total number of individuals that can enter the study, then the sum of all \(pent_t\) values must be 1. Thus, the model estimates one less \(pent\) value and the last estimate is calculated by subtraction. This constraint also means the parameters need a special link function (the multinomial logit, mlogit, instead of the logit link used for \(p\) and \(\Phi\) parameters).

We will need to remember this when we estimate the population size in the first capture event.

Let’s load the dipper data set and then fit some Jolly-Seber models.

# install.packages("marked") # first time only
library(marked)

# Load dipper data (included in "marked")
data(dipper, package = "marked")

# Examine data structure
head(dipper)

##        ch    sex
## 1 0000001 Female
## 2 0000001 Female
## 3 0000001 Female
## 4 0000001 Female
## 5 0000001 Female
## 6 0000001 Female

For the model fitting, we will use a function to fit a series of models that might be feasible (e.g. time-dependent differences in survival or constant survival, time-dependent entry probability or constant).

# Jolly-Seber models (POPAN formulation) are open population models, and 
# can be used to estimate abundance by including two more parameters than the CJS

# Additional parameters:
# Nsuper (or "superpopulation") = total number of individuals available to enter population throughout study
# pent ("probability of entry") =  the rate at which individuals enter the population from Nsuper (via births and immigration)

# WARNING: there is no adequate GOF tests for Jolly-Seber models. 
# One common method: Test equivalent structure of CJS model with R2ucare (previous tutorials).

# This tests *some* assumptions of Phi and p.
# Jolly-Seber models have an additional assumption:
# marked AND unmarked animals have same p (R2ucare doesn't test this)
# This assumption is required to estimate total abundance (sum of marked and unmarked animals in population)

# First, process data (Notice model = "JS", previous version = "CJS")
dipper.js.proc <- process.data(dipper, 
                               model = "JS", 
                               groups = "sex")

# Second, make design data (from processed data)
dipper.js.ddl <- make.design.data(dipper.js.proc)

fit.js.dipper.models <- function(){
  # Phi formulas
  Phi.dot <- list(formula=~1)
  Phi.time <- list(formula=~time)
  # p formulas
  p.dot <- list(formula=~1)
  # pent formulas. pent estimates MUST SUM to 1 (for each group).
  # This is constained using a Multinomial Logit link
  pent.time <- list(formula=~time)
  pent.sex <- list(formula=~sex)
  pent.dot <- list(formula=~1)
  # Nsuper formulas. Don't confuse "N" from model with predicted population size
  N.sex <- list(formula=~sex)
  N.dot <- list(formula=~1)
  cml <- create.model.list(c("Phi","p", "pent", "N"))
  results <- crm.wrapper(cml, data = dipper.js.proc, ddl = dipper.js.ddl,
                         external = FALSE, accumulate = FALSE, hessian = TRUE)
  
  return(results)
}

# Run function
dipper.js.models <- fit.js.dipper.models()

# Display model table
dipper.js.models

##                                model npar      AIC  DeltaAIC       weight
## 1          Phi(~1)p(~1)pent(~1)N(~1)    4 742.0136  0.000000 0.4077537375
## 3        Phi(~1)p(~1)pent(~sex)N(~1)    5 743.6466  1.632999 0.1802173625
## 2        Phi(~1)p(~1)pent(~1)N(~sex)    5 743.9901  1.976459 0.1517802792
## 7       Phi(~time)p(~1)pent(~1)N(~1)    9 745.0847  3.071070 0.0878058694
## 4      Phi(~1)p(~1)pent(~sex)N(~sex)    6 745.6211  3.607445 0.0671507949
## 9     Phi(~time)p(~1)pent(~sex)N(~1)   10 746.7239  4.710281 0.0386877491
## 8     Phi(~time)p(~1)pent(~1)N(~sex)   10 747.0611  5.047469 0.0326854214
## 10  Phi(~time)p(~1)pent(~sex)N(~sex)   11 748.6968  6.683216 0.0144263381
## 5       Phi(~1)p(~1)pent(~time)N(~1)    9 749.1438  7.130150 0.0115373337
## 6     Phi(~1)p(~1)pent(~time)N(~sex)   10 751.1204  9.106768 0.0042942617
## 11   Phi(~time)p(~1)pent(~time)N(~1)   14 752.0725 10.058881 0.0026677167
## 12 Phi(~time)p(~1)pent(~time)N(~sex)   15 754.0487 12.035103 0.0009931358
##     neg2lnl convergence
## 1  734.0136           0
## 3  733.6466           0
## 2  733.9901           0
## 7  727.0847           0
## 4  733.6211           0
## 9  726.7239           0
## 8  727.0611           0
## 10 726.6968           0
## 5  731.1438           0
## 6  731.1204           0
## 11 724.0725           0
## 12 724.0487           0

The model table (dipper.js.models) suggests the most supported model has constant survival, constant detection probability, and constant probability of entry (~1 = intercept only model). Although we can model-average the results, for simplicity we will just continue with the most-supported model.

Let’s look at the model output of the most supported model, and then estimate real parameters (remember the coefficients are on different scales depending on the link function).

We can use the predict function to estimate the real parameters, or calculate them ‘by hand’ using the link functions for each parameter.

# Look at estimates of top model (row number on left of model table, or using name)
dipper.js.models[[1]]  # or dipper.js.models[["Phi.dot.p.dot.pent.dot.N.dot"]] or dipper.js.models$Phi.dot.p.dot.pent.dot.N.dot

## 
## crm Model Summary
## 
## Npar :  4
## -2lnL:  734.0136
## AIC  :  742.0136
## 
## Beta
##                   Estimate        se        lcl      ucl
## Phi.(Intercept)  0.2361657 0.1014104 0.03740139 0.434930
## p.(Intercept)    2.3455039 0.3412612 1.67663204 3.014376
## pent.(Intercept) 0.6703973 0.2232187 0.23288867 1.107906
## N.(Intercept)    1.9348641 0.4583174 1.03656213 2.833166

# The estimates above are not on probability scale (or in individuals for N)
# (e.g. Phi, p on logit scale, pent on mlogit scale)
# Predict (real) values using top model
dipper.js.predicted <- predict(dipper.js.models[[1]]) # [[1]] just calls the model row according to the model table.

# Look at predictions of real parameters
dipper.js.predicted

## $Phi
##   occ  estimate         se       lcl       ucl
## 1   1 0.5587685 0.02500235 0.5093493 0.6070503
## 
## $p
##   occ  estimate         se      lcl       ucl
## 1   1 0.9125762 0.02722612 0.842458 0.9532194
## 
## $pent
##   time occ  estimate          se       lcl       ucl
## 1    2   2 0.1535743 0.002692886 0.1483701 0.1589269
## 2    3   3 0.1535743 0.002692886 0.1483701 0.1589269
## 3    4   4 0.1535743 0.002692886 0.1483701 0.1589269
## 4    5   5 0.1535743 0.002692886 0.1483701 0.1589269
## 5    6   6 0.1535743 0.002692886 0.1483701 0.1589269
## 6    7   7 0.1535743 0.002692886 0.1483701 0.1589269
## 
## $N
##   estimate       se      lcl     ucl
## 1 6.923103 3.172978 2.819507 16.9992

The output shows that survival between capture events is 0.56, detection probability is 0.91, pent is 0.15 each capture event, and the number of unmarked individuals in the superpopulation is about 7 (so the super population is ~7 + 294 marked individuals =~ 301). Usually the estimated number of unmarked individuals will be higher, but since our detection probability is so high (0.91), there are few individuals that are not captured during the study.

So what about population size?

There is no direct estimate of population size in the model. The esitmate of “N” in the model output is for the number of unmarked individuals in the superpopulation.

To estimate population size, we can derive it using the model estimates.

\(N_{t+1}\) = (\(N_t\) * \(\Phi_t\) )+(\(N_{super}\) * \(pent_t\))

In words, this equation says that the population size is the previous population size times the survival rate (to calculate how many individuals survive) plus the number of new individuals from births and immigration.

To get the inital population size, we estimate \(pent_1\) by the constraint:

\(pent_1\) = 1 - \(\sum (pent_{2..t})\)

# Abundance (N) is derived from the estimated parameters
# We will estimate population size at each time by making a dataframe of estimates and calculating N
# We will use the predicted estimates from the top-performing model (in this case: "dipper.js.predicted")

# NOTE: the below method will have to be adjusted based on your final model and the number of capture events
N.derived <- data.frame(occ = c(1:7), # 7 events
                        Phi = c(rep(dipper.js.predicted$Phi$estimate, 6), NA),   # 6 survival estimates all the same
                        Nsuper = rep(dipper.js.predicted$N$estimate + nrow(dipper), 7), # Nsuper estimate + number of marked animals
                        pent = c(1-sum(dipper.js.predicted$pent$estimate), dipper.js.predicted$pent$estimate)) # Sum of all pent must be 1

# Set-up empty vector for calculating N
N.derived$N <- NA

# The inital population size (N[1]) = Nsuper * (1 - sum(all other pent estimates))
# This is because of the link function for estimating pent.
# The sum of all pent parameters MUST equal 1 (therefore, one less must be estimated)
N.derived$N[1] <- (N.derived$Nsuper[1] * N.derived$pent[1])

# Subsequent population sizes are estimated by calculating surviving individuals (N[t-1] * Phi[t]), and
# Adding new births (Nsuper * pent[t])
for(i in 2:nrow(N.derived)){
  N.derived$N[i] <- (N.derived$N[i-1]*N.derived$Phi[i-1]) + (N.derived$Nsuper[i] * N.derived$pent[i])
}

# Look at what we did
N.derived

##   occ       Phi   Nsuper       pent         N
## 1   1 0.5587685 300.9231 0.07855408  23.63874
## 2   2 0.5587685 300.9231 0.15357432  59.42264
## 3   3 0.5587685 300.9231 0.15357432  79.41756
## 4   4 0.5587685 300.9231 0.15357432  90.59010
## 5   5 0.5587685 300.9231 0.15357432  96.83296
## 6   6 0.5587685 300.9231 0.15357432 100.32127
## 7   7        NA 300.9231 0.15357432 102.27043

We have now derived the population size (N) using the estimates form our Jolly-Seber model. But, how do we estimate standard error and confidence intervals?

How confident are we about these estimates?

To add 95% confidence intervals on these estimates, we will (for convenience) switch to another R package. Since the error from each parameter (\(\Phi\), pent, \(N_{super}\)) propagates throught the derived population size, it is not straight forward to estimate the standard error (and 95% CI) using marked.

Let’s use the RMark package, which makes computing the standard error and 95% CI much easier using the delta method. To do this by hand using our model results from the marked package is frustrating since it is difficult to get the variance-covariance matrix for our real estimtes (not the variance-covariance of the beta estimates). (if you find an easy way to do this, let me know!)

To use RMark you need to have the package installed and have the program MARK installed.

detach("package:marked", unload=TRUE) # many of the function names are the same. unload `marked`

# install.packages("RMark") # first time only
# For RMark to work, you also need mark.exe installed separately
# http://www.phidot.org/software/mark/downloads/
# this may not be easy on a Mac OS (http://www.phidot.org/software/mark/rmark/linux/)
# RMark calls "mark" to do all the work outside of R
library(RMark)

## This is RMark 2.2.6
##  Documentation available at http://www.phidot.org/software/mark/rmark/RMarkDocumentation.zip

# We will use the same data but will just create the same top model (not all the other subsets)
dipper.rmark.processed <- RMark::process.data(dipper,
                                              model = "POPAN")

# Formulae for model
Phi.dot <- list(formula=~1)
p.dot <- list(formula=~1)
pent.dot <- list(formula=~1)
N.dot <- list(formula=~1)

# The argument names are similar but a little different (notice "POPAN" instead of "js")
dipper.rmark <- mark(dipper, model = "POPAN", 
                     model.parameters = list(Phi = Phi.dot, p= p.dot, 
                                                     pent = pent.dot, N = N.dot),
                     realvcv = TRUE)


# The popan.derived function of RMark estimates N 
# (plus estimates SE and 95% CI using the delta method)
dipper.derived.rmark <- popan.derived(dipper.rmark.processed,
                                      dipper.rmark)$N

# Look at results
dipper.derived.rmark

##   Occasion         N       se      LCL       UCL
## 1        1  24.33323 5.060686 14.41429  34.25217
## 2        2  61.08258 2.529976 56.12383  66.04133
## 3        3  81.63592 2.694208 76.35527  86.91657
## 4        4  93.13108 3.849887 85.58530 100.67686
## 5        5  99.56014 4.849041 90.05602 109.06426
## 6        6 103.15582 5.568716 92.24113 114.07050
## 7        7 105.16682 6.056553 93.29598 117.03766

How close to the results from RMark line-up with our by-hand estimates from the marked package results?

Below, I plot the marked estimates in black, and the RMark estimates (and confidence intervals) in blue.

They are very similar, but not exactly the same. What gives?

The difference is because the estimates for \(N_{super}\) are slightly different in the models fit with different R packages (marked estimate = 300.9231, RMark estimate = 309.1735). The other parameter estimates are almost identical.

To review, we covered:

the basics of Jolly-Seber POPAN models
fitting POPAN models in R using marked (and RMark)
estimating population size (‘by-hand’ and using RMark’s popan.derived function)

What are Jolly-Seber models?

So what about population size?

How confident are we about these estimates?

Further reading