Variance inflation factors (c-hat) for mark-recapture models in R

Mark-recapture models are a useful framework to test hypotheses about what drives differences in wildlife survival and detection probability. However, it is important to assess the goodness-of-fit (GOF) for these models before we make inferences.

What causes lack of fit?

assumption violations (see previous post for testing assumptions),
model misspecification (e.g. missing important variables), or
extrabinomial noise or overdispersion

In this tutorial, I deal with overdispersion by:

Calculating the variance inflation factor (\(\hat{c}\), “c-hat”).
Adjusting model selection when there is evidence of overdispersion.

The code from this tutorial is available on my GitHub page for mark-recapture workshops.

Let’s get started!

Review of Cormack-Jolly-Seber mark-recapture models

Cormack-Jolly-Seber models are mark-recapture models used to estimate two parameters:

detection probability (\(p_t\), the probability of encountering a live animal at time t)
apparent survival (\(\Phi_t\), the probability of an animal surviving and remaining in the study area between time t and t + 1).

# install.packages("R2ucare") # first time only
library(R2ucare) # For Goodness of fit tests
library(dplyr) # for tidy data
library(magrittr) # for pipes

Let’s load the dipper data set that we will use for fitting CJS models.

# Load dipper data (in "marked", but also in R2ucare package in different format)
# I'm loading `marked` version to be consistent with other tutorials/workshops
# and for fitting models below in "marked".
data(dipper, package = "marked")

# Full data
dipper.ch.gof <- dipper$ch %>%
  strsplit('') %>%
  sapply(`[`) %>%
  t() %>%
  unlist() %>%
  as.numeric %>%
  matrix(nrow = nrow(dipper))

# Females only
dipper.fem.ch.gof <- dipper$ch[dipper$sex == "Female"] %>%
  strsplit('') %>%
  sapply(`[`) %>%
  t() %>%
  unlist() %>%
  as.numeric %>%
  matrix(nrow = nrow(dipper[dipper$sex == "Female",]))

# Males only
dipper.mal.ch.gof <- dipper$ch[dipper$sex == "Male"] %>%
  strsplit('') %>%
  sapply(`[`) %>%
  t() %>%
  unlist() %>%
  as.numeric %>%
  matrix(nrow = nrow(dipper[dipper$sex == "Male",]))

What is \(\hat{c}\) (“c-hat”) or the variance inflation factor?

The variance inflation factor (\(\hat{c}\)) can help adjust model output based on extra-binomial noise or overdispersion. Extra-binomial noise refers to more variance in the capture history data (binomial strings of 0 and 1) than expected.

The variance inflation factor is also an estimate of the fit of the model to the observed data.

One (simple!) estimate is: \(\hat{c}\) = \(\chi\)²/df (in maths)

Or (in words): the sum from Test2 and Test3 divided by the degrees of freedom given in Test1 from U-CARE components.

This method of \(\hat{c}\) estimation only applies if you’re using a time-dependent model as the general model. By time-dependent, I mean that there are separate estimates of p and \(\Phi\) for each time period. If you’re using a simpler general model, you will want to use a different estimate of \(\hat{c}\).

For other approaches, look up the median or bootstrap methods in Gentle Introduction to MARK.

Since we have two groups in our dipper example (females and males), our estimate of \(\hat{c}\) = (male test stat + female test stat) / (male df + female df)

This represents a test for the model: Phi.sex* time.p.sex* time

library(R2ucare)

# Male overall
overall_CJS(dipper.mal.ch.gof, rep(1,nrow(dipper[dipper$sex == "Male",])))

##                           chi2 degree_of_freedom p_value
## Gof test for CJS model: 11.062                 9   0.271

# Female overall
overall_CJS(dipper.fem.ch.gof, rep(1,nrow(dipper[dipper$sex == "Female",])))

##                           chi2 degree_of_freedom p_value
## Gof test for CJS model: 10.276                12   0.592

# Overall chat for dipper (male test stat + female test stat) / (male df + female df)
dipper.chat <- (10.276 + 11.062) / (12 + 9)

dipper.chat # 1.016095, still very close to 1 and indicates relatively good fit

## [1] 1.016095

A simple guide to interpreting \(\hat{c}\) values:

\(\hat{c}\) < 1: generally we ignore slight deviations with underdispersion,
\(\hat{c}\) = 1: variance in data is as expected, given the model
\(\hat{c}\) > 1: there is extra binomial noise in the data. We can adjust model selection to use QAIC and increase variance around parameter estimates (widens confidence intervals).
\(\hat{c}\) > 3: generally represents poor fit and suggests model structure should be adjusted (is there an important part of the system we haven’t modelled?)

So, now we have an estimate of \(\hat{c}\), but what do we do with that?

Adjusting model selection when there is evidence of overdispersion

We can adjust our model-selection to account for this extra noise and to adjust (increase) the standard error in our predictions.

detach("package:R2ucare", unload=TRUE) # Have to be careful with mutliple packages with "dipper" data set
# Here I detach R2ucare to avoid confusion

library(marked)
# Process data (and set grouping variables)
dipper.proc <- process.data(dipper, 
                            group = "sex") # group variable has to be in the data

# Make design data (from processed data)
dipper.ddl <- make.design.data(dipper.proc)

fit.dipper.cjs.models <- function(){
    # Apparent survival (Phi) formula
    Phi.sex.time <- list(formula=~sex*time)  # Just like in other linear models "*" includes main effects and an interaction
    Phi.time <- list(formula=~time) # differs between discrete times
    Phi.sex <- list(formula=~sex) # differs between males and females
    Phi.dot <- list(formula=~1) # constant survival
    
    # Detection probability (p) formula
    p.sex <- list(formula=~sex)  # differs between males and females
    p.time <- list(formula=~time)  # one discrete estimate of p per capture event
    p.dot <- list(formula=~1) # constant detection
    
    # Construct all combinations and put into one model table
    cml <- create.model.list(c("Phi","p")) # makes all possibile combinations of those parameter formulas
    results <- crm.wrapper(cml, 
                           data = dipper.proc, 
                           ddl = dipper.ddl,
                           external = FALSE, 
                           accumulate = FALSE, 
                           hessian = TRUE)
    return(results)
}

# Run function
dipper.cjs.models <- fit.dipper.cjs.models()

# Display model table
dipper.cjs.models

##                       model npar      AIC  DeltaAIC       weight  neg2lnl
## 1              Phi(~1)p(~1)    2 670.8377  0.000000 4.021185e-01 666.8377
## 2            Phi(~1)p(~sex)    3 672.1934  1.355702 2.041583e-01 666.1934
## 4            Phi(~sex)p(~1)    3 672.6762  1.838535 1.603693e-01 666.6762
## 10          Phi(~time)p(~1)    7 673.7301  2.892416 9.468339e-02 659.7301
## 5          Phi(~sex)p(~sex)    4 674.1518  3.314144 7.668259e-02 666.1518
## 11        Phi(~time)p(~sex)    8 675.1583  4.320639 4.635954e-02 659.1583
## 3           Phi(~1)p(~time)    7 678.4802  7.642502 8.806549e-03 664.4802
## 6         Phi(~sex)p(~time)    8 680.3043  9.466601 3.537592e-03 664.3043
## 12       Phi(~time)p(~time)   12 680.9502 10.112543 2.561198e-03 656.9502
## 7     Phi(~sex * time)p(~1)   13 684.2409 13.403249 4.941690e-04 658.2409
## 8   Phi(~sex * time)p(~sex)   14 685.8996 15.061922 2.156250e-04 657.8996
## 9  Phi(~sex * time)p(~time)   18 691.4733 20.635616 1.328578e-05 655.4733
##    convergence
## 1            0
## 2            0
## 4            0
## 10           0
## 5            0
## 11           0
## 3            0
## 6            0
## 12           0
## 7            0
## 8            0
## 9            0

We have a set of models ranked by AIC, but this ranking assumes that there is no overdispersion (\(\hat{c}\) = 1).

Let’s adjust our model table (dipper.cjs.models) to use QAIC with our \(\hat{c}\) estimate.

# Adjust our CJS model table to account for chat different than 1 (chat = 1.016095)
# QAIC = (-2lnLik / chat) + 2K
# This favours simpler models as chat gets larger than 1
dipper.cjs.models$model.table$chat <- dipper.chat

dipper.cjs.models$model.table$QAIC <- (dipper.cjs.models$model.table$neg2lnl/dipper.chat) + (2*dipper.cjs.models$model.table$npar)

# Note: this doesn't autmatically update the order of the table! 
# In our case, the order of supported models did not change
# To sort:
dipper.cjs.models$model.table <- dipper.cjs.models$model.table %>%
  arrange(QAIC)

dipper.cjs.models # display models

##                       model npar      AIC  DeltaAIC       weight  neg2lnl
## 1              Phi(~1)p(~1)    2 670.8377  0.000000 4.021185e-01 666.8377
## 2            Phi(~1)p(~sex)    3 672.1934  1.355702 2.041583e-01 666.1934
## 3            Phi(~sex)p(~1)    3 672.6762  1.838535 1.603693e-01 666.6762
## 4           Phi(~time)p(~1)    7 673.7301  2.892416 9.468339e-02 659.7301
## 5          Phi(~sex)p(~sex)    4 674.1518  3.314144 7.668259e-02 666.1518
## 6         Phi(~time)p(~sex)    8 675.1583  4.320639 4.635954e-02 659.1583
## 7           Phi(~1)p(~time)    7 678.4802  7.642502 8.806549e-03 664.4802
## 8         Phi(~sex)p(~time)    8 680.3043  9.466601 3.537592e-03 664.3043
## 9        Phi(~time)p(~time)   12 680.9502 10.112543 2.561198e-03 656.9502
## 10    Phi(~sex * time)p(~1)   13 684.2409 13.403249 4.941690e-04 658.2409
## 11  Phi(~sex * time)p(~sex)   14 685.8996 15.061922 2.156250e-04 657.8996
## 12 Phi(~sex * time)p(~time)   18 691.4733 20.635616 1.328578e-05 655.4733
##    convergence     chat     QAIC
## 1            0 1.016095 660.2748
## 2            0 1.016095 661.6407
## 3            0 1.016095 662.1159
## 4            0 1.016095 663.2798
## 5            0 1.016095 663.5998
## 6            0 1.016095 664.7171
## 7            0 1.016095 667.9546
## 8            0 1.016095 669.7815
## 9            0 1.016095 670.5439
## 10           0 1.016095 673.8142
## 11           0 1.016095 675.4783
## 12           0 1.016095 681.0904

Model selection will favour simpler models more as \(\hat{c}\) increases. In the current case, the models were already mostly ordered from most simple to more complicated.

The end

To review, we covered:

Estimating the variance inflation factor (\(\hat{c}\)), and
Adjusting model selection approach when \(\hat{c}\) is greater than 1.

In general, the other approaches for estimating \(\hat{c}\), especially the bootstrap and median \(\hat{c}\) approach, are recommended more than the sums of Test2 and Test3, as used here. Unfortunately, there is no simple approach to do this within R. I recommend exporting the mark-recapture models and estimating \(\hat{c}\) in MARK.

Review of Cormack-Jolly-Seber mark-recapture models

What is \(\hat{c}\) (“c-hat”) or the variance inflation factor?

Adjusting model selection when there is evidence of overdispersion

The end

Further reading