The stats
module uses the Stan statistical programming language to fit a Bayesian model of viral prevalences and metagenomic sequencing read counts.
Our objective is to estimate a coefficient that connects a viral clade's estimated prevalence to the number of metagenomic reads assigned to that clade across a set of samples.
For each virus and study, we estimate both an overall coefficient and a coefficient that is specific to each stampling location in the study.
The stan code defining the model is contained in model.stan
.
In the following sections, we will show and explain each block of the code in turn.
The data
block defines the data that we provide to the model:
data {
int<lower=1> J; // number of samples
array[J] int<lower=0> y; // viral read counts
array[J] int<lower=0> n; // total read counts
vector[J] x; // estimated predictor (prevalence or incidence)
int<lower=1> L; // number of sampling locations
array[J] int<lower=1, upper=L> ll; // sampling locations
}
In particular, we must provide:
- The number metagenomic samples,
J
- The number of reads assigned to our focal virus in each sample,
$y_j$ with$j = 1, \ldots, J$ . - The total number of reads in each sample,
$n_j$ . - The estimated public health predictor (prevalence or incidence) of the virus in the population contributing to each sample,
$\mu_j$ . - The number of sampling locations in the study,
L
- The location of each sample
ll
. These are provided as integer indexes ranging from 1 toL
.
The transformed data
block defines quantities derived from the data that are used in the model
block:
transformed data {
vector[J] x_std = log(x) - mean(log(x));
real log_mean_y = 0;
if (sum(y) > 0) // can't normalize by this if there are no viral reads
log_mean_y = log(mean(y));
real log_mean_n = log(mean(n));
}
Because our logistic regression model works with the log predictors, we log-transform and mean-center the predictors to get x_std
.
In order to center the coefficient, we also need a central value for the read counts. We use the log of the average viral and total read counts, because the viral read counts have a lot of zeros.
When we run the program, Stan uses a Hamiltonian Monte Carlo algorithm to sample from the posterior distribution of the model parameters, defined in the parameters
block:
parameters {
vector[J] theta_std; // standardized true predictor for each sample
real<lower=0> sigma; // standard deviation of true predictors
real mu; // mean P2RA coefficient (on standardized scale)
real<lower=0> tau; // std of P2RA coefficients per location
vector[L] b_l; // P2RA coefficient per location
}
The model
block defines the model to be fit:
model {
sigma ~ gamma($sigma_alpha, $sigma_beta);
theta_std ~ normal(x_std, sigma);
mu ~ normal(0, $mu_sigma);
tau ~ gamma($tau_alpha, $tau_beta);
b_l ~ normal(mu, tau);
for (j in 1:J){
y[j] ~ binomial_logit(n[j], b_l[ll[j]] + theta_std[j] + log_mean_y - log_mean_n);
}
}
The first five lines give prior distributions for the parameters:
- The variance parameteters
sigma
andtau
are given gamma priors with hyperparameters supplied at runtime. - The true standardized predictor for each sample
theta_std
is given a normal prior centered on the estimated value - The coefficients linking predictors to relative abundance are given a hierarchical model, where the overall coefficient
mu
has a prior centered at zero (because of the mean-centering) and the location-specific coefficients are centered atmu
.
The last two lines define the likelihood. We assume that the read counts follow a binomial distribuion and are independent, conditional on the parameters. The expected relative abundance in each sample is given by the inverse logit of the sum of:
b_l[ll[j]]
, the coefficient for the location of the sampletheta_std[j]
, the true value of the (standardized) public health predictorlog_mean_y - log_mean_n
, included to center the coefficients near zero.
The generated quantities
block defines quantities to simulate from the fitted model:
generated quantities {
// posterior predictive viral read counts
array[J] int<lower=0> y_tilde;
for (j in 1:J){
y_tilde[j] =
binomial_rng(
n[j],
inv_logit(b_l[ll[j]] + theta_std[j] + log_mean_y - log_mean_n)
);
}
// posterior true prevalence for each sample
vector[J] theta = theta_std + mean(log(x));
// for convenience, a single vector with the location coefficients and
// overall coefficient in the final position
vector[L + 1] b;
b[:L] = b_l;
b[L + 1] = mu;
// location-specific expected relative abundance
// last element is the overall coefficient
// Converting from 1:100K to 1:100 means multiplying by 1000
vector[L + 1] ra_at_1in100 = inv_logit(
b - mean(log(x)) + log_mean_y - log_mean_n + log(1000)
);
}
This block generates two important quantities:
y_tilde
, posterior predictive samples of the viral read count in each sample for model checkingb
, a vector of coefficients transformed to give the expected relative abundance when the public health predictor is 1 in 100 people.
Running ./fit.py
also generates a number of diagnostic figures in the fig/
directory.
Each filename follows the pattern fig/{pathogen_name}-{public_health_predictor_type}-{study}-{figure_type}.pdf
The figure types are:
datascatter
, scatterplot of the input relative abundance vs the input public health predictor for each samplepredictor_vs_date
, scatterplot of the public health predictor vs. sample date for the data and eight draws from the posterior distributionreads_vs_date
, scatterplot of the viral read counts vs. sample date for the data and eight draws from the posterior predictive distributionviral_reads_vs_predictor
, scatterplot of viral reads vs. public health predictor for the data and eight draws from the posterior (predictive) distributionviolin
, violin plot of the marginal posteriors of the standardized coefficients (overall and for each sample location)posthist
, histograms of the marginal posteriors ofsigam
,mu
, andtau
sigma_vs_mu
, density plot of the joint posterior ofsigma
andmu
tau_vs_mu
, density plot of the joint posterior oftau
andmu
tau_vs_sigma
, density plot of the joint posterior oftau
andsigma
- Currently, we model the true prevalences as independent of one another. It would be more realistic for them to have shared as well as independent components of variation. For example, our prevalence estimates may be systematically biased in one direction or another and that would induce a correlated effect on all of the samples. The amount of shared variation between samples might also depend on features they have in common such as a shared sampling location or similar sampling times. (Issue #59 outlines a simple way forward here.)
- A related issue is that we will usually have prevalence estimates for multiple times and places that don't line up exactly with the sampling times and locations. In this case, we may want to have a stochastic process model (e.g., a Gaussian process) of the true prevalence that can be updated with our prevalence estimates and then emit the parameters for the negative binomial regressions.
- Our model only represents sample preparation methods indirectly through the sample-location terms, and so does not provide an explicit estimate of its effect. However, such an estimate would be difficult to obtain reliably from so few studies.
- Similarly, we do not directly infer any effects of viral properties (DNA vs RNA, enveloped vs not, etc). In principle, we could create a larger model that estimates RA(1‰) for all viruses simultaneously and includes some of these properties as factors. Because we only have public health estimates for about a dozen viruses and those are split between incidence and prevalence, however, we decided that fitting such a model would not be worthwhile with this data set.