9 Modelling observation noise

Code

using Plots, Measures
include("../src/LoadData.jl")
include("../src/FitModel.jl")

Y = loadData("NZCOVID")

Observation noise is frequently highlighted as a key concern when fitting models to epidemiological data. We saw the impact of ignoring observation noise in Chapter 8. In this chapter, we examine a few possible models for obesrvation noise.

All five models in this chapter use the same state-space model:

\[ \begin{align} \log R_t | \log R_{t-1} &\sim \text{Normal}(\log R_{t-1}, \sigma) \\ I_t | R_t, I_{1:t-1} &\sim \text{Poisson}(R_t \Lambda_t) \end{align} \tag{9.1}\]

and differ only in the observation model.

9.1 Underreporting

If we assume that each infected individual has an independent probability of being reported as a case, we can model \(C_t\) using a binomial distributon: \[ C_t | I_t \sim \text{Binomial}\left(I_t, \ \rho\right) \tag{9.2}\]

The reporting rate \(\rho\) is not identifiable from reported case data alone, so must be set by the user. We use \(\rho = 0.5\) in this example. This model has the same number of fixed parameters as the basic model (one parameter, \(\sigma\)). A Julia implementation of this hidden-state model is provided in the following drop-down box:

Code

function underreportingModel(σ, Y::DataFrame, opts::Dict)

    # Specify reporting rate
    ρ = 0.5

    # Extract frequently used options
    T = opts["T"]
    N = opts["N"]
    L = opts["L"]
    ω = opts["ω"]

    # Initialise output matrices
    R = zeros(N, T)
    I = zeros(N, T)
    W = zeros(N, T)

    # Sample from initial distributions
    R[:,1] = rand.(opts["pR0"], N)
    I[:,1] = rand.(opts["pI0"], N)

    # Run the filter
    for tt = 2:T

        # Project according to the state-space model
        R[:,tt] = exp.(rand.(Normal.(log.(R[:,tt-1]), σ)))
        Λ = sum(I[:, (tt-1):-1:1] .* ω[1:(tt-1)]', dims=2)
        I[:,tt] = rand.(Poisson.(R[:,tt] .* Λ))

        # Weight according to the observation model
        W[:,tt] = pdf.(Binomial.(I[:,tt], ρ), Y.Ct[tt])

        # Resample
        inds = wsample(1:N, W[:,tt], N; replace=true)
        R[:, max(tt - L, 1):tt] = R[inds, max(tt - L, 1):tt]
        I[:, max(tt - L, 1):tt] = I[inds, max(tt - L, 1):tt]

    end

    # Store output as three-dimensional array
    X = zeros(N, T, 2)
    X[:,:,1] = R
    X[:,:,2] = I

    return(X, W)

end

we also specify the standard options dictionary:

Code

opts = Dict(

    # Bootstrap filter options
    "T" => 100, # Number of time-steps
    "N" => 1000, # Number of particles
    "L" => 50, # Fixed-lag resampling length
    "pR0" => Uniform(0, 10), # Prior on Rt at t = 0
    "pI0" => DiscreteUniform(1, 5), # Prior on It at t = 0
    "ω" => pdf.(Gamma(2.36, 2.74), 1:100) / sum(pdf.(Gamma(2.36, 2.74), 1:100)), # Serial interval

    # PMMH options
    "nChains" => 3,
    "chunkSize" => 100,
    "maxChunks" => 50,
    "maxRhat" => 1.05,
    "minESS" => 100,
    "showChunkProgress" => false,
    "propStdDevInit" => [0.1],
    "paramPriors" => [Uniform(0, 1)],
    "initialParamSamplers" => [Uniform(0.1, 0.3)],
    "paramNames" => ["σ"],

    # Marginal posterior options
    "posteriorNumberOfParticles" => 1000,
    "posteriorParamSamples" => 100,
    "stateNames" => ["Rt", "It"]
);

Now we can use fitModel() to run the PMMH algorithm and return the marginalised estimates of \(R_t\). We fit this model to the same data used in Chapter 8.

(df_states, df_params, θ, diag) = fitModel(underreportingModel, Y, opts)

Finally, we extract the hidden-state estimates (both \(R_t\) and \(I_t\)) and plot them:

Code

df_R = df_states[df_states.variable .== "Rt",:]
df_I = df_states[df_states.variable .== "It",:]

plotR = plot(df_R.date, df_R.mean, ribbon=(df_R.mean-df_R.lower, df_R.upper-df_R.mean), label=false, color=:darkgreen, xlabel="Date", ylabel="Reproduction number")
plotI = scatter(df_I.date, df_I.mean, yerror=(df_I.mean-df_I.lower, df_I.upper-df_I.mean), label=false, color=:darkgreen, xlabel="Date", ylabel="Infection incidence", markersize=2)
plot(plotR, plotI, layout=(2,1), size=(800,400), margins=3mm)

Reproduction number and infection incidence estimates from fitting a binomial observation model to data from the first 100 days of the COVID-19 pandemic in New Zealand. A fixed reporting rate of \(\rho = 0.5\) is assumed - results are heavily dependent on this assumption.

9.2 Underreporting and overdispersion

Case reporting can often be over-dispersed, exhibiting greater variance than implied by the binomial observation distribution above. Instead, we can use a beta-binomial distribution:

\[ C_t | I_t \sim \text{Beta-binomial}\left(I_t, \alpha_t, \beta_t\right) \tag{9.3}\]

where:

\[ \alpha_t = \rho \left(\frac{1}{\phi} - 1\right), \ \ \beta_t = (1-\rho) \left(\frac{1}{\phi} - 1\right) \tag{9.4}\]

This model also assumes a reporting rate of \(\rho\), but allows for additional variance through parameter \(\phi\)¹. We now must estimate two parameters: \(\sigma\) and \(\phi\).

Code

function underreportedAndOverdispersedModel(θ, Y::DataFrame, opts::Dict)

    # Specify reporting rate
    ρ = 0.5

    # Extract frequently used options
    T = opts["T"]
    N = opts["N"]
    L = opts["L"]
    ω = opts["ω"]

    # Initialise output matrices
    R = zeros(N, T)
    I = zeros(N, T)
    W = zeros(N, T)

    # Sample from initial distributions
    R[:,1] = rand.(opts["pR0"], N)
    I[:,1] = rand.(opts["pI0"], N)

    # Run the filter
    for tt = 2:T

        # Project according to the state-space model
        R[:,tt] = exp.(rand.(Normal.(log.(R[:,tt-1]), θ[1])))
        Λ = sum(I[:, (tt-1):-1:1] .* ω[1:(tt-1)]', dims=2)
        I[:,tt] = rand.(Poisson.(R[:,tt] .* Λ))

        # Weight according to the observation model
        α = ρ * (1/θ[2] - 1)
        β = (1 - ρ) * (1/θ[2] - 1)
        W[:,tt] = pdf.(BetaBinomial.(I[:,tt], α, β), Y.Ct[tt])

        # Resample
        inds = wsample(1:N, W[:,tt], N; replace=true)
        R[:, max(tt - L, 1):tt] = R[inds, max(tt - L, 1):tt]
        I[:, max(tt - L, 1):tt] = I[inds, max(tt - L, 1):tt]

    end

    # Store output as three-dimensional array
    X = zeros(N, T, 2)
    X[:,:,1] = R
    X[:,:,2] = I

    return(X, W)

end

opts = Dict(

    # Bootstrap filter options
    "T" => 100, # Number of time-steps
    "N" => 1000, # Number of particles
    "L" => 50, # Fixed-lag resampling length
    "pR0" => Uniform(0, 10), # Prior on Rt at t = 0
    "pI0" => DiscreteUniform(1, 5), # Prior on It at t = 0
    "ω" => pdf.(Gamma(2.36, 2.74), 1:100) / sum(pdf.(Gamma(2.36, 2.74), 1:100)), # Serial interval

    # PMMH options
    "nChains" => 3,
    "chunkSize" => 100,
    "maxChunks" => 50,
    "maxRhat" => 1.05,
    "minESS" => 100,
    "showChunkProgress" => false,
    "propStdDevInit" => [0.05, 0.01],
    "paramPriors" => [Uniform(0, 1), Uniform(0, 1)],
    "initialParamSamplers" => [Uniform(0.1, 0.3), Uniform(0.01, 0.02)],
    "paramNames" => ["σ", "ϕ"],

    # Marginal posterior options
    "posteriorNumberOfParticles" => 5000,
    "posteriorParamSamples" => 100,
    "stateNames" => ["Rt", "It"]
)

(df_states, df_params, θ, diag) = fitModel(underreportedAndOverdispersedModel, Y, opts; verbose=false);

Investigating the parameter estimates and plotting the samples of \(\phi\) suggest that overdispersion may be present, although we cannot rule out binomial noise (\(\phi = 0\)), conditional on \(\rho = 0.5\).

θl = resizeParams(θ)
plt = histogram(θl[:,2], bins=50, xlabel="ϕ", ylabel="Proportion of samples", color=:darkorange, legend=false, normalize=:probability, size = (600,400), margins=3mm)

Estimates of \(R_t\) are similar, but slightly smoother, than the binomial model above.

Code

df_R = df_states[df_states.variable .== "Rt",:]
df_I = df_states[df_states.variable .== "It",:]

plotR = plot(df_R.date, df_R.mean, ribbon=(df_R.mean-df_R.lower, df_R.upper-df_R.mean), label=false, color=:darkgreen, xlabel="Date", ylabel="Reproduction number")
plotI = scatter(df_I.date, df_I.mean, yerror=(df_I.mean-df_I.lower, df_I.upper-df_I.mean), label=false, color=:darkgreen, xlabel="Date", ylabel="Infection incidence", markersize=2)
plot(plotR, plotI, layout=(2,1), size=(800,400), margins=3mm)

9.3 Observation noise without underreporting

In Section 9.1, we assumed a pre-determined reporting rate \(\rho\). Without additional information this parameter is not identifiable. If we do not want to assume a value of \(\rho\), a popular distribution for modelling (potentially) overdispersed data is the negative binomial:

\[ C_t | I_t \sim \text{Negative binomial}\left(r = \frac{I_t}{k}, p=\frac{1}{1 + k} \right) \]

which has mean \(I_t\) and variance \((1+k) I_t\), where \(k\) is a dispersion parameter. This results in two parameters to be estimated: \(\sigma\) and \(k\).

Code

function overdispersedModel(θ, Y::DataFrame, opts::Dict)

    # Extract frequently used options
    T = opts["T"]
    N = opts["N"]
    L = opts["L"]
    ω = opts["ω"]

    # Initialise output matrices
    R = zeros(N, T)
    I = zeros(N, T)
    W = zeros(N, T)

    # Sample from initial distributions
    R[:,1] = rand.(opts["pR0"], N)
    I[:,1] = rand.(opts["pI0"], N)

    # Run the filter
    for tt = 2:T

        # Project according to the state-space model
        R[:,tt] = exp.(rand.(Normal.(log.(R[:,tt-1]), θ[1])))
        Λ = sum(I[:, (tt-1):-1:1] .* ω[1:(tt-1)]', dims=2)
        I[:,tt] = rand.(Poisson.(R[:,tt] .* Λ))

        # Weight according to the observation model
        r = I[:,tt] / θ[2]
        p = 1 / (1 + θ[2])
        W[:,tt] = fastNegativeBinomialPDF(Y.Ct[tt], r, p)

        # Resample
        inds = wsample(1:N, W[:,tt], N; replace=true)
        R[:, max(tt - L, 1):tt] = R[inds, max(tt - L, 1):tt]
        I[:, max(tt - L, 1):tt] = I[inds, max(tt - L, 1):tt]

    end

    # Store output as three-dimensional array
    X = zeros(N, T, 2)
    X[:,:,1] = R
    X[:,:,2] = I

    return(X, W)

end


opts = Dict(

    # Bootstrap filter options
    "T" => 100, # Number of time-steps
    "N" => 1000, # Number of particles
    "L" => 50, # Fixed-lag resampling length
    "pR0" => Uniform(0, 10), # Prior on Rt at t = 0
    "pI0" => DiscreteUniform(1, 5), # Prior on It at t = 0
    "ω" => pdf.(Gamma(2.36, 2.74), 1:100) / sum(pdf.(Gamma(2.36, 2.74), 1:100)), # Serial interval

    # PMMH options
    "nChains" => 3,
    "chunkSize" => 100,
    "maxChunks" => 50,
    "maxRhat" => 1.05,
    "minESS" => 100,
    "showChunkProgress" => false,
    "propStdDevInit" => [0.05, 0.1],
    "paramPriors" => [Uniform(0, 1), Uniform(0, 10)],
    "initialParamSamplers" => [Uniform(0.1, 0.3), Uniform(0.1, 0.5)],
    "paramNames" => ["σ", "k"],

    # Marginal posterior options
    "posteriorNumberOfParticles" => 5000,
    "posteriorParamSamples" => 100,
    "stateNames" => ["Rt", "It"]

)

(df_states, df_params, θ, diag) = fitModel(overdispersedModel, Y, opts; verbose=false)

θl = resizeParams(θ)
plt = histogram(θl[:,2], bins=50, xlabel="k", ylabel="Proportion of samples", color=:darkorange, legend=false, normalize=:probability, size = (800,300), margins=3mm)

Code

df_R = df_states[df_states.variable .== "Rt",:]
df_I = df_states[df_states.variable .== "It",:]

plotR = plot(df_R.date, df_R.mean, ribbon=(df_R.mean-df_R.lower, df_R.upper-df_R.mean), label=false, color=:darkgreen, xlabel="Date", ylabel="Reproduction number", size=(800,300), margins=3mm)

9.4 Reporting delays

Reporting delays are a common feature of epidemiological data, infections are vary rarely reported on the same day they occur. Accounting for these delays is important if we want our model to reflect the true underlying transmission dynamics. For this example, we follow Hendy et al. (2021) and assume a (discretised) gamma-distributed delay with a mean of 5.5 days and standard deviation of 2.3 days. The expected number of reported cases at time \(t\) is given by:

\[ \mu_t = \sum_{u=1}^{u_{max}} I_{t-u} d_u \]

where \(d_u = \text{Gamma PDF}(u; \mu=5.5, \sigma=2.3)\). Using the same negative binomial distribution as above, we can model the number of reported cases as: \[ C_t | \mu_t, \phi \sim \text{Negative Binomial}\left(r = \frac{\mu_t}{k}, p=\frac{1}{1 + k} \right) \]

function delayedModel(θ, Y::DataFrame, opts::Dict)

    # Extract frequently used options
    T = opts["T"]
    N = opts["N"]
    L = opts["L"]
    ω = opts["ω"]
    delayDist = opts["delayDist"]

    # Initialise output matrices
    R = zeros(N, T)
    I = zeros(N, T)
    W = zeros(N, T)

    # Sample from initial distributions
    R[:,1] = rand.(opts["pR0"], N)
    I[:,1] = rand.(opts["pI0"], N)

    # Run the filter
    for tt = 2:T

        # Project according to the state-space model
        R[:,tt] = exp.(rand.(Normal.(log.(R[:,tt-1]), θ[1])))
        Λ = sum(I[:, (tt-1):-1:1] .* ω[1:(tt-1)]', dims=2)
        I[:,tt] = rand.(Poisson.(R[:,tt] .* Λ))

        # Weight according to the observation model
        μt = sum(I[:,(tt-1):-1:1] .* delayDist[1:(tt-1)]', dims=2) ./ sum(delayDist[1:(tt-1)])
        r = vec(μt) ./ θ[2]
        p = 1 / (1 + θ[2])
        W[:,tt] = fastNegativeBinomialPDF(Y.Ct[tt], r, p)

        # Resample
        inds = wsample(1:N, W[:,tt], N; replace=true)
        R[:, max(tt - L, 1):tt] = R[inds, max(tt - L, 1):tt]
        I[:, max(tt - L, 1):tt] = I[inds, max(tt - L, 1):tt]

    end

    # Store output as three-dimensional array
    X = zeros(N, T, 2)
    X[:,:,1] = R
    X[:,:,2] = I

    return(X, W)

end

opts = Dict(

    # Bootstrap filter options
    "T" => size(Y, 1), # Number of time-steps
    "N" => 2000, # Number of particles
    "L" => 50, # Fixed-lag resampling length
    "ω" => pdf.(Gamma(2.36, 2.74), 1:128), # Serial interval
    "delayDist" => pdf.(Gamma(5.72, 0.96), 1:200), # Observation delay distribution
    "pR0" => Uniform(0, 10), # Prior on Rt at t = 0
    "pI0" => DiscreteUniform(1, 5), # Prior on I at t = 0

    # PMMH options
    "nChains" => 3, # Number of chains
    "chunkSize" => 100, # Number of iterations
    "maxChunks" => 50, # Maximum number of chunks
    "maxRhat" => 1.05,  # Stopping criterion: maximum Rhat value
    "minESS" => 100, # Stopping criterion: minimum effective sample size
    "showChunkProgress" => true, # Whether to show progress of each chunk
    "propStdDevInit" => [0.05, 0.1], # Initial proposal standard deviation (this is adaptively fit)
    "paramPriors" => [Uniform(0, 1), Uniform(0, 10)],
    "initialParamSamplers" => [Uniform(0.1, 0.3), Uniform(0.1, 0.5)],
    "paramLimits" => [(0, 1), (0, 10)],
    "paramNames" => ["σ", "k"],

    # Marginal posterior options
    "posteriorNumberOfParticles" => 10000,
    "posteriorParamSamples" => 100,
    "stateNames" => ["Rt", "It"]

)

(df_states, df_params, θ, diag) = fitModel(delayedModel, Y, opts; verbose=false, checkStdDevLogLik=false);

df_R_del = df_states[df_states.variable .== "Rt",:]

plotR = plot(df_R.date, df_R.mean, ribbon=(df_R.mean-df_R.lower, df_R.upper-df_R.mean), color=:lightgreen, xlabel="Date", ylabel="Reproduction number", label="Without delay", size=(800,400), margins=3mm)
plotR = plot!(plotR, df_R_del.date, df_R_del.mean, ribbon=(df_R_del.mean-df_R_del.lower, df_R_del.upper-df_R_del.mean), color=:darkgreen, xlabel="Date", ylabel="Reproduction number", label="With delay")

\(R_t\) estimates that account for delayed reporting (dark green) are shifted earlier and feature more uncertainty than those that do not (light green). Accounting for these delays is important if we want our model to reflect the true underlying transmission dynamics, for example, if the model were to be used to evaluate the effectiveness of non-pharmaceutical interventions (NPIs) or to inform public health policy.

9.5 Day-of-the-week effects

Finally, day-of-the-week effects can have a large impact on epidemic models, if these effects are not appropriately treated as noise. Defining \(c_i\) as the relative reporting rate on day \(i\) (where \(c_i > 1\) suggests a greater-than-average reporting rate), we update our model to allow for a day-of-the-week effect:

\[ C_t | \mu_t, \phi, \{c_i\} \sim \text{Negative Binomial}\left(r = \frac{c_{\text{mod}(t,7)}\mu_t}{k}, p=\frac{1}{1 + k} \right) \]

where \(c_i, i = 1, \ldots, 7\) are the relative reporting rates on each day of the week. Parameters \(c_1\) through \(c_6\) are estimated, with \(c_7 = 7 - \sum_{i=1}^6 c_i\).

When fitting this, we use data from the COVID-19 pandemic in Aotearoa New Zealand between 1 April 2024 and 9 July 2024, a 100-day period in which a clear day-of-the-week effect was observed.

Yin = loadData("NZCOVID_1APR2024");
Y = Yin[1:100,:];

function dayofweekModel(θ, Y::DataFrame, opts::Dict)

    # Check that θ are valid
    if sum(θ[3:8]) > 7
        error("Day of week multipliers must sum to less than or equal to 7")
    end

    # Extract frequently used options
    T = opts["T"]
    N = opts["N"]
    L = opts["L"]
    ω = opts["ω"]
    delayDist = opts["delayDist"]

    # Extract frequently used parameters
    day_of_week_mult = vcat(θ[3:8], 7 - sum(θ[3:8]))

    # Initialise output matrices
    R = zeros(N, T)
    I = zeros(N, T)
    W = zeros(N, T)

    # Sample from initial distributions
    R[:,1] = rand(opts["pR0"], N)
    I[:,1] = rand(opts["pI0"], N)

    # Run the filter
    for tt = 2:T

        # Project according to the state-space model
        R[:,tt] = exp.(rand.(Normal.(log.(R[:,tt-1]), θ[1])))
        Λ = sum(I[:, (tt-1):-1:1] .* ω[1:(tt-1)]', dims=2) ./ sum(ω[1:(tt-1)])
        I[:,tt] = rand.(Poisson.(R[:,tt] .* Λ))

        # Weight according to the observation model
        μt = day_of_week_mult[1 + tt % 7] * sum(I[:,(tt-1):-1:1] .* delayDist[1:(tt-1)]', dims=2) ./ sum(delayDist[1:(tt-1)])
        r = vec(μt)/θ[2]
        p = 1 / (1  + θ[2])
        W[:,tt] = pdf.(NegativeBinomial.(r, p), Y.Ct[tt])

        # Resample
        inds = wsample(1:N, W[:,tt], N; replace=true)
        R[:, max(tt - L, 1):tt] = R[inds, max(tt - L, 1):tt]
        I[:, max(tt - L, 1):tt] = I[inds, max(tt - L, 1):tt]

    end


    # Store output as three-dimensional array
    X = zeros(N, T, 2)
    X[:,:,1] = R
    X[:,:,2] = I

    return(X, W)

end

opts = Dict(

    # Bootstrap filter options
    "T" => size(Y, 1), # Number of time-steps
    "N" => 1000, # Number of particles
    "L" => 50, # Fixed-lag resampling length
    "ω" => pdf.(Gamma(2.36, 2.74), 1:128), # Serial interval
    "delayDist" => pdf.(Gamma(5.72, 0.96), 1:200), # Observation delay distribution
    "pR0" => Uniform(0, 10), # Prior on Rt at t = 0
    "pI0" => DiscreteUniform(200, 600), # Prior on I at t = 0

    # PMMH options
    "nChains" => 3, # Number of chains
    "chunkSize" => 100, # Number of iterations
    "maxChunks" => 100, # Maximum number of chunks
    "maxRhat" => 1.05,  # Stopping criterion: maximum Rhat value
    "minESS" => 100, # Stopping criterion: minimum effective sample size
    "showChunkProgress" => true, # Whether to show progress of each chunk
    "propStdDevInit" => sqrt.([0.0002, 7e-6, 0.0005, 0.001, 0.001, 0.001, 0.001, 0.0005]), # Initial proposal standard deviation (this is adaptively fit)
    "paramPriors" => vcat([Uniform(0, 1), Uniform(0, 10)], repeat([Uniform(0.1, 2)], 6)),
    "initialParamSamplers" => [Uniform(0.07, 0.09), Uniform(4, 6), Uniform(0.75, 0.80), Uniform(1.3, 1.35), Uniform(1.2, 1.25), Uniform(1.1, 1.2), Uniform(0.95, 1.00), Uniform(0.9, 0.95)],
    "paramLimits" => vcat([(0, 1), (0, 10)], repeat([(0.1, 2)], 6)),
    "paramNames" => ["sigma", "k", "Day1", "Day2", "Day3", "Day4", "Day5", "Day6"],

    # Marginal posterior options
    "posteriorNumberOfParticles" => 10000,
    "posteriorParamSamples" => 100,
    "stateNames" => ["Rt", "It"]

)

(df_states, df_params, θ, diag) = fitModel(dayofweekModel, Y, opts; verbose=false, checkStdDevLogLik=false);

θl = resizeParams(θ)
C = zeros(size(θl, 1), 7)
C[:,1:6] = θl[:,3:8]
C[:,7] = 7 .- sum(C, dims=2)
m = vec(mean(C, dims=1))
l = [quantile(C[:,i], 0.025) for i in 1:7]
u = [quantile(C[:,i], 0.975) for i in 1:7]

df_R = df_states[df_states.variable .== "Rt",:]

day_of_week_label = ["Sun", "Mon", "Tue", "Wed", "Thur", "Fri", "Sat"]

plt_cases = bar(Y.date, Y.Ct, color=:darkblue, size=(800,400), xlabel="Date", ylabel="Reported cases", label=false, margins=3mm)
plt_Rt = plot(df_R.date, df_R.mean, ribbon=(df_R.mean-df_R.lower, df_R.upper-df_R.mean), color=:darkgreen, xlabel="Date", ylabel="Reproduction number", label=false)

plt_dow = scatter(day_of_week_label, m, errorbar=(m-l, u-m),
    xlabel="Day of week", ylabel="Relative reporting rate", legend=false,
    linewidth=3, color=:darkred, lc=:darkred, markercolor=:darkred, markersize=5,
    yguidefont=font(10), xguidefont=font(10), size=(400,300))

plt = plot(plt_cases, plt_Rt, plt_dow, layout=(3,1), size=(800,600), legend=false)

Demonstrating a clear day-of-the-week effect, with reporting more likely to occur on weekdays than weekends.

9.6 Supplementary material

9.6.1 “Estimating” the underreporting rate

We state in Section 9.1 that the reporting rate \(\rho\) is not identifiable from these data. Actually, as \(\rho\) determines the observation variance, the data likely contain some information about this term.

The observation variance of \(C_t\) is given by:

\[ Var(C_t) = I_t \rho (1 - \rho) \]

Holding \(E[C_t] = I_t \rho\) fixed, we can see that the observation variance is a decreasing function of \(\rho\). That is, the smaller the estimated value of \(\rho\), the greater the estimated observation variance. This is somewhat approximate, as the model must choose a temporally fixed value of \(\rho\), but is free to vary \(I_t\) (within the constraints of the epidemc model).

A combination of model misspecification (reported cases \(C_t\) are unlikely to perfectly follow a binomial distribution with mean \(I_t \rho\)) and limited data (even if the model was correctly specified, a large quantity of data would be required to estimate this parameter) limit the real-world applicability of this approach for inference. In terms of predictions, where identifiabilitiy is not required, this may be a useful approach.

Code

Y = loadData("NZCOVID")

function underreportingModel2(θ, Y::DataFrame, opts::Dict)

    # Extract frequently used options
    T = opts["T"]
    N = opts["N"]
    L = opts["L"]
    ω = opts["ω"]

    # Initialise output matrices
    R = zeros(N, T)
    I = zeros(N, T)
    W = zeros(N, T)

    # Sample from initial distributions
    R[:,1] = rand.(opts["pR0"], N)
    I[:,1] = rand.(opts["pI0"], N)

    # Run the filter
    for tt = 2:T

        # Project according to the state-space model
        R[:,tt] = exp.(rand.(Normal.(log.(R[:,tt-1]), θ[1])))
        Λ = sum(I[:, (tt-1):-1:1] .* ω[1:(tt-1)]', dims=2)
        I[:,tt] = rand.(Poisson.(R[:,tt] .* Λ))

        # Weight according to the observation model
        W[:,tt] = pdf.(Binomial.(I[:,tt], θ[2]), Y.Ct[tt])

        # Resample
        inds = wsample(1:N, W[:,tt], N; replace=true)
        R[:, max(tt - L, 1):tt] = R[inds, max(tt - L, 1):tt]
        I[:, max(tt - L, 1):tt] = I[inds, max(tt - L, 1):tt]

    end

    # Store output as three-dimensional array
    X = zeros(N, T, 2)
    X[:,:,1] = R
    X[:,:,2] = I

    return(X, W)

end

opts = Dict(

    # Bootstrap filter options
    "T" => 100, # Number of time-steps
    "N" => 1000, # Number of particles
    "L" => 50, # Fixed-lag resampling length
    "pR0" => Uniform(0, 10), # Prior on Rt at t = 0
    "pI0" => DiscreteUniform(1, 5), # Prior on It at t = 0
    "ω" => pdf.(Gamma(2.36, 2.74), 1:100) / sum(pdf.(Gamma(2.36, 2.74), 1:100)), # Serial interval

    # PMMH options
    "nChains" => 3,
    "chunkSize" => 100,
    "maxChunks" => 50,
    "maxRhat" => 1.05,
    "minESS" => 500,
    "showChunkProgress" => false,
    "propStdDevInit" => [0.1, 0.1],
    "paramPriors" => [Uniform(0, 1), Uniform(0, 1)],
    "initialParamSamplers" => [Uniform(0.1, 0.3), Uniform(0.1, 0.9)],
    "paramNames" => ["σ", "ρ"],

    # Marginal posterior options
    "posteriorNumberOfParticles" => 10000,
    "posteriorParamSamples" => 100,
    "stateNames" => ["Rt", "It"]
);

(df_states, df_params, θ, diag) = fitModel(underreportingModel2, Y, opts)

#| code-fold: true
#| 
df_R = df_states[df_states.variable .== "Rt",:]
df_I = df_states[df_states.variable .== "It",:]

plotR = plot(df_R.date, df_R.mean, ribbon=(df_R.mean-df_R.lower, df_R.upper-df_R.mean), label=false, color=:darkgreen, xlabel="Date", ylabel="Reproduction number", size=(800,300), margins=3mm)


display(df_params)

2 rows × 6 columns

	param	m	l	u	rhat	ess
	String	Float64	Float64	Float64	Float64	Float64
1	σ	0.222293	0.151875	0.32569	1.00763	517.501
2	ρ	0.305748	0.0387135	0.864569	1.00584	527.193

The model has estimated \(\rho = 0.31 \ (0.04, 0.86)\). Plotting the samples of \(\rho\) below demonstrates a clear non-uniform posterior distribution, so there is some information about this parameter in the data, even if we are not necessarily estimating an interpretable quantity.

θl = resizeParams(θ)
plt = histogram(θl[:,2], bins=50, xlabel="ρ", ylabel="Proportion of samples", color=:darkorange, legend=false, normalize=:probability, size = (600,400), margins=3mm)

The beta-binomial explicitly assumes that \(I_t\) are binomially distributed with random probability \(\rho \sim Beta(\alpha_t, \beta_t)\). Under this interpretation, \((1/\phi - 1)\) can be thought of as the number of “prior trials”, and \(\alpha_t\) the number of “prior successes”.↩︎