5.4 estimateErrors
Estimating PMD pattern and Sequencing Errors
estimateErrors
uses genotype likelihood model to estimate the likelihood of sequencing errors and PMD using Expectation Maximization (EM) algorithm. The sequencing errors are recalibrated based on quality score of reads, fragment length, mapping quality, position in the read and context.
See Additional information for detailed description of the model.
5.4.1 Input
Required inputs :
--bam Input_bam_file.bam |
Input bam file. |
--fasta Input_refrence_genome_file.fasta |
Reference genome. |
Optional inputs :
--recalModel recal_model |
Semicolon separated recalibration model indicating which covariates and which regression model (empiric or polynomial) to use, e.g."intercept;quality:polynomial3;position:polynomial3;fragmentLength:polynomial3;mappingQuality:polynomial3;context;" . Default = all five covariates are used and empiric regression is used for all. |
--pmdModel pmd_model |
pmd model to be used, e.g."CT5:0.2*exp(-0.3*p)+0.01;GA3:0.5*exp(-0.2*p)+0.01" . Default = CT5;GA5;CT3;GA3 . |
Specific Parameters :
- See Filter parameters to apply specific filters for bases, reads and parsing window setting.
Engine parameters that are common to all tasks can be found here.
5.4.3 Usage Example
#! /bin/bash
. $(dirname $0)/find_atlas
recal="intercept[.0];quality:polynomial[0.95,0.01];position:polynomial[0.02]"
pmd="CT5:0.2*exp(-0.3*p)+0.01;GA3:0.2*exp(-0.3*p)+0.01"
delta=10
k="111"
L="100$k"
. $(dirname $0)/simulate --recal $recal --pmd $pmd \
--type "HKY85" --mu 0.55 --thetaG 0.00033 --thetaR 0.015 \
--chrLength $L,$L --depth 9 --ploidy 2,1 --numReadGroups 3 \
--baseQuality 'categorical(12:1237,22:845,27:1912,32:21069,37:34246,41:339557)' \
--fragmentLength 'normal(40,10)[30,101]' \
--mappingQuality 'normal(60,10)[30,60]' --fixedSeed 80
echo "chr1 0 $L" > bed1.bed
echo "chr2 0 99$k" > bed2.bed
echo "readGroup poolWith" > recal.pool
echo "SimReadGroup2 SimReadGroup1" >> recal.pool
echo "readGroup poolWith" > pmd.pool
echo "SimReadGroup3 SimReadGroup1" >> pmd.pool
out="estimateErrors"
$atlas --task estimateErrors --minDeltaLL $delta \
--bam simulate.bam --fasta simulate.fasta \
--poolRecal "all" --poolPMD "all" \
--fixedSeed 81 --out $out --logFile $out.out 2> $out.eout
recalModel="intercept;quality;position:polynomial1;fragmentLength:polynomial2;mappingQuality;context"
out="specificModel"
$atlas --task estimateErrors --minDeltaLL $delta --recalModel $recalModel \
--bam simulate.bam --fasta simulate.fasta \
--regions bed1.bed,bed2.bed --ploidy 2,1 --window 45672 --writeRestart \
--poolRecal "recal.pool" --poolPMD "pmd.pool" \
--fixedSeed 82 --out $out --logFile $out.out 2> $out.eout
Additional Information
Genotype likelihood model for PMD and sequencing error recalibration
The model implemented in ATLAS estimates three parameters in parallel, namely genotype distribution, PMD and recalibration parameter which is described below.
The figure above shows the directed acyclic graph (DAG) of the model, where the hidden genotype at site are given by , where is one of the 10 possible diploid genotypes; the sequencing data of a single individual at site consist of reads that cover this site and the base of read covering site is denoted by .
The data observed at a site with true genotype is a function of four processes:
The distribution of genotypes within the genome, and let denote the parameters governing that distribution.
The random sampling of alleles of from genotypes. We will denote by the sampled allele in read covering site . Given a genotype with at that site we have
The subset of bases that make up genotype are denoted by .
PMD that potentially modifies . We will denote the resulting base by and by the parameters governing PMD. In the case of no PMD, .
Sequencing errors may result in . The parameters governing sequencing errors is denoted by .
Assuming independence among sequencing reads when conditioned on , we then have,
where denotes the full data with .
This formulation is very general and allows for different models regarding the distribution of genotypes , PMD and sequencing errors .
Implemented models regarding sequencing errors
During implementation of the model regarding sequencing errors i.e. the probability function , ATLAS assumes that,
where is the relevant sequencing error rate and reflects the relative rate for a sequencing error at a particular base to result in any of the three alternative bases .
The error rates themselves are functions of a given external vector of information $q_{ij} with parameters $ ϵ_{ij} = ϵ(q_{ij},) = {0}+ {l=k}^{K} {k}(q{ijk},_{k}) $$
with covariates $q_{ij} = {q_{ij1}, . . . . . . , q_{ijK} of which each has a function with parameters . Thus the total parameter vector is $ {ϵ} = (,)$ with $ = ({AC},{AG}, . . . . . ,{TG})$ and .
- The parameters is estimated using the general EM algorithm introduced above.
ATLAS considers the following potential covariates:
- The transformed machine-reported quality score . We transform the original quality score as
to ensure that using results in no transformation.
The position within the read , either as with parameter .
The sequencing context , i.e. the base that was reported just previously of the current base in 5’–>3’ direction (i.e. for forward mapping reads and(i.e. for reverse mapping reads).
Importantly, we will also infer an independent error model (i.e. independent ) for each read group.
Implemented models regarding PMD
A characteristic feature of ancient DNA is post-mortem damage (PMD). The most common form of PMD is C-deamination, which leads to a C → T transition on the affected strand and a G → A transition on the complimentary strand. These deaminations do not occur randomly along the whole read, but instead are observed much more frequently at the beginning of a read. This is due to fragment ends being more often single-stranded and thus subject to a much higher rate of deamination.
During implementation of the model regarding PMD i.e. the probability function , In case of C → T damage,
where denotes the probability that the base at site was affected by PMD. Analogously, in case of G → A damage,
ATLAS assumes that the rate $D_{ij} = D(q_{ij} , θ_{D}) as well as the type of PMD damage to occur to be a function of a given external vector of information . Specifically, it models as a function of the distance in bases from the nearest end of the fragment (in case of paired-end sequencing) or read (in case of single end sequencing).
Let be the vector of PMD probabilities for distances up to a maximum distance . We then have,
ATLAS considers the following additional categorical information:
It fits independent patterns for each read group.
It uses different PMD patterns depending on whether a single-strand or double strand sequencing library was prepared:
- For single-strand libraries, ATLAS assumes that for forward mapping reads C → T damage occurs on both ends while for reverse mapping reads G → A damage occurs at both ends. It further assumes that damages occur at the same rates (i.e. at the same ) for both types of reads at both ends.
- For double-strand libraries, ATLAS assumes that for forward mapping reads C → T damage occurs at 5’ ends and G → A damage at 3’ ends, and the inverse for reverse mapping reads. It further assumes that damages occur at the same rates at 5’ ends of both forward and reverse mapping reads, and, equivalently, at the same rates at 3’ ends of both types of reads.
ATLAS distinguishes between single-end and paired-end sequencing:
- For single-end sequencing, the 5’ end of the read always marks the 5’ end of the fragment, and the distance measured from it thus reflects the distance relevant for PMD. ATLAS will thus infer a common pattern for all single-end read lengths at the 5’ end. The sequencing read may not reach the 3’ end of the fragment, however. Among the reads with lengths matching the maximum number of sequencing cycles , for which measured from the 3’ end underestimates the true distance to the 3’ end of the fragment. This leads to a reduced PMD rate compared to the 5’ end at the same distance . Due to errors during adapter removal, this issue may also affect reads a few bases shorter than . ATLAS will thus infer a common pattern for all reads shorter than and a different pattern for all reads of each of the lengths $ C_{max},C_{max} − 1, . . . ,C_{max} − S$, where is a user-defined threshold (typically 5 bases).
- For paired-end sequencing, both ends are known and the distances properly reflect the distance to those ends. ATLAS thus uses a single PMD pattern for all fragment lengths.