5.10 theta

Estimating heterozygosity

theta infers the stationary base frequencies \((\boldsymbol{\pi} = \{\pi_A, \pi_C, \pi_G, \pi_T\})\), along with the rate of substitutions \((\theta=2T\mu)\) along the genealogy connecting the two alleles of an individual within a genomic window (default=non overlapping windows of 1Mbp). Here, T corresponds to the time to the most recent common ancestor of the two lineages and \(\mu\) to the mutation rate per base pair per generation. It is not possible to infer (T) and \(\mu\) independently, and therefore only estimate the compound substitution rate \(\theta\) is estimated from the data. To estimate \(\theta\), Felsenstein’s 1981 model of substitutions is used:

\[P(g_i = kl | \theta, \boldsymbol{\pi}) = \begin{cases} \pi_k (e^{-\theta} + \pi_k (1-e^{-\theta})) & \text{if } k=l,\\ \pi_k \pi_l (1-e^{-\theta}) & \text{if } k \neq l, \end{cases} \]

where \(k\) and \(l\) are the two alleles of a diploid individual and \(e^{-\theta}\) is the probability of no mutation having happened in the time to their coalescence. See the methods section below for details on how Felsensteins model is extended to account for the uncertainty in the local genotypes. Base-specific rates of sequencing errors and PMD are taken into account as known constants.

Please note, the method results in slightly different estimates for \(\theta\) than the \(\theta\) statistics calculated based on the infinite sites model (implemented e.g. in ANGSD), as Felsenstein’s model allows for back mutations. You can transform our \(\theta\) estimate into e.g. Watterson’s \(\theta_W\) statistic with

\[ \theta_W = 1-\sum_k{\pi_k}(e^{-\theta} +{\pi_k}(1-e^{-\theta}) )\] For samples with very low sequencing depth, we implemented a way to consider all sites with data to produce a single genome-wide estimate for \(\theta\). Pass the argument genomeWide to use this functionality. We generally use this in combination with depth=2, i.e. only use sites that contain information about \(\theta\) to speed up the process and limit the resource requirements. In order to provide some information about the variability of the genome-wide estimate, we have implemented bootstrapping. First we sample from a binomial distribution how many sites there are with data, and then we randomly select as many sites as had data originally to provide bootstrapped genome-wide estimates.

5.10.1 Input

Required inputs :

--bam Input_bam_file.bam Input bam file

Optional inputs :

`--pmd "library_type:model_for_5'_read:model_for_3'_read"` or `--RGInfo`	Library type followed by the model to be used for the 5-prime read-end and the 3-prime read-end which can be either “Exponential” or “Empiric”. All three arguments must be provided as a string, divided by colons (:). e.g. : –pmdModels “doubleStrand:Exponential:Exponential”. Used to specify Post-mortem damage parameters. Can also be provided as a `.txt` file (see PMDfor generating such a file). Default = Will assume there is no PMD in the data.
`--recal recal_model` or `--RGInfo`	A common recal model for all read groups. Used to specify Quality score recalibration parameters. Can also be provided as a .txt file (see recal for further information). Default = ‘-’, no recalibration is performed.
`--mask file.bed`	To specify a 0-based bed file with regions to be masked. Default = no region is masked.

Specific Parameters :

`--equalBaseFreqs`	To specify equal base frequencies. Default = Will estimate base frequencies.
`--printAll`	A common recal model for all read groups. Used to specify Quality score recalibration parameters. Can also be provided as a .txt file (see recal for further information). Default = Will only print windows for which estimation was possible.
`--prob float_values`	To specify theta estimation after downsampling reads with indicated probabilities. The float-values need to be between 0 and 1. Multiple float values separated by comma can be provided. Default = Will not downsample reads.
`--minSitesWithData integer_value`	To specify the minimum amount of sites with data required to run theta estimation. Default =?
`--extraVerbose`	To print current \(\theta\) estimates for each iteration of EM-algorithm. Default = Current \(\theta\) estimates for each iteration of EM-algorithm are not printed
`--iterations integer_value`	To specify maximum number of EM-iterations to be performed. Default = 100.
`--NRiterations integer_value`	To specify maximum number of Newton-Raphson iterations per EM-iteration. Default = 20.
`--iterationsThetaOnly integer_value`	To specify number of iterations within each full iteration where only \(\theta\) is updated. Default = 10)
`--genomeWide`	To estimate genome-wide heterozygosity \((\theta)\). Default = Will estimating heterozygosity \((\theta)\) per Window.
`--regions`	To specify custom regions for which one single \(\theta\) is to be estimated when `--genomeWide` is used . Default = \(\theta\) is estimated genome-wide.
`--bootstraps integer_value`	To specify number of bootstrap replicates for which \(\theta\) is estimated when `--genomeWide` is used . Default = Will not conduct any bootstrap replicates.

See Filter parameters to apply specific filters for bases, reads and parsing window setting.

Engine parameters that are common to all tasks can be found here.

5.10.2 Output

*_RGInfo.json	File containing read group info.
*_filterSummary.txt	File containing filter summary of general filter counts.
*_theta.txt.gz	File containing theta estimates and some additional data including the Fisher confidence intervals for \(\theta\).

Note: Fisher CI: The confidence intervals are calculated by using the second derivatives of the log-likelihood function around the MLE to fit a quadratic function to the peak. The CI are then the theta values found when descending by 2-log-likelihood units on either side of the MLE.

5.10.3 Usage Example

#! /bin/bash

. $(dirname $0)/find_atlas
. $(dirname $0)/simulate

rm ATLAS_simulations.bam.bai # will automatically recreate it

# theta
$atlas --task theta --bam ATLAS_simulations.bam --printAll --extraVerbose --fixedSeed 0 --out standard --logFile standard.out

# theta with downsample
$atlas --task theta --bam ATLAS_simulations.bam --prob 1,0.5 --extraVerbose --fixedSeed 0 --out downsample --logFile downsample.out 

# theta genomewide
$atlas --task theta --genomeWide --bootstraps 3 --bam ATLAS_simulations.bam --extraVerbose --fixedSeed 0 --out genomewide --logFile genomewide.out

# theta genomewide with downsample
$atlas --task theta  --prob 1,0.5 --genomeWide --bootstraps 3 --bam ATLAS_simulations.bam --extraVerbose --fixedSeed 0 --out genomewide_downsample --logFile genomewide_downsample.out

5.10.4 Additional Information

Method

Consider that at each site \(i\) there are \(n_i\) reads. We denote by \(d_{ij}\), \(j=1,\ldots, n_i\) the base of read \(j\) covering site \(i\).

The likelihood function for estimating \(\theta\) in a genomic window of length \(I\) is the following: \[ \mathbb{P}(\boldsymbol{d}|\theta, \boldsymbol{\pi}) = \prod_{i=1}^I \sum_g \prod\limits_{j=1}^{n_i} \mathbb{P}(d_{ij}|g_i=g)\mathbb{P}(g_i=g|\theta,\boldsymbol{\pi}), \] where \(g_i\) is the genotype at that site and \(g_i\in\{AA,AC,...,TT\}\) is one of all possible 10 diploid genotypes.

\(\mathbb{P}(g_i|\theta,\boldsymbol{\pi})\) is given by Felsenstein’s substitution model 1981 (see above) and the probabilities \(\mathbb{P}(d_i|g_i=g)\) are given by our Genotyping Model, which takes into account the recalibrated error rates and the post-mortem damage patterns (if the necessary parameter values are provided).

\(\theta\) and \(\boldsymbol{\pi}\) are inferred using a standard EM-algorithm, described in detail in Kousathanas, A. et al. (2017). Inferring Heterozygosity from Ancient and Low Coverage Genomes. Genetics, 205(1), 317–332 (page 3: Inference using Expectation-Maximization)