5.11 thetaRatio

Estimate the ratio in heterozygosity (theta) between genomic regions

thetaRatio uses an MCMC (Metropolis-Hastings Algorithm) to estimate the ratio of the expected heterozygosity from sites in two different regions.

$Φ = l o g (\frac{θ_{1}}{θ_{2}})$

5.11.1 Input

Required inputs :

`--bam Input_bam_file.bam`	Input bam file
`--region1 region_1 --region2 region_2`	Two files, region_1 and region_2 that define the regions of the two thetas.

Optional inputs :

--pmd "library_type:model_for_5'_read:model_for_3'_read" or --RGInfo Library type followed by the model to be used for the 5-prime read-end and the 3-prime read-end which can be either “Exponential” or “Empiric”. All three arguments must be provided as a string, divided by colons (:). e.g. : –pmdModels “doubleStrand:Exponential:Exponential”. Used to specify Post-mortem damage parameters. Can also be provided as a .txt file (see PMDfor generating such a file). Default = Will assume there is no PMD in the data.

--recal recal_model or --RGInfo A common recal model for all read groups.Used to specify Quality score recalibration parameters. Can also be provided as a .txt file (see recal for further information). Default = ‘-’, no recalibration is performed.

Specific Parameters :

See Filter parameters to apply specific filters for bases, reads and parsing window setting.

Engine parameters that are common to all tasks can be found here.

5.11.2 Output

*_RGInfo.json	File containing read group info.
*_filterSummary.txt	File containing filter summary of general filter counts.
*_thetaRatioMCMC.txt.gz	File containing MCMC chain.

5.11.3 Usage Example

#! /bin/bash

. $(dirname $0)/find_atlas

echo "chr1 1 10000" > region1.bed
echo "chr2 1 10000" > region2.bed

. $(dirname $0)/simulate --fixedSeed 250

out="thetaRatio"
$atlas --task thetaRatio --bam simulate.bam \
       --region1 region1.bed --region2 region2.bed \
       --fixedSeed 255 --out $out --logFile $out.out 2> $out.eout

5.11.4 Additional Information

Method

The likelihood function is:

$P (d_{1}, d_{2} | θ_{1}, θ_{2}, π_{1}, π_{2}) = \frac{\prod_{i = 1}^{I} \sum_{g} \prod_{j = 1}^{n_{i}} P (d_{1_{i j}} | g_{i} = g) P (g_{i} = g | θ_{1}, π_{1})}{\prod_{i = 1}^{I} \sum_{g} \prod_{j = 1}^{n_{i}} P (d_{2_{i j}} | g_{i} = g) P (g_{i} = g | θ_{2}, π_{2})}$ An MCMC is used to infer the posterior distribution for all parameters. Updates are performed for all: $(π)$ and $(\log (θ_{1}))$ and $(\log (θ_{2}))$ .

$U [0, 1]$ prior is used for $(\log (θ_{1}))$ and $(\log (θ_{1}))$ and $N (0, 1)$ . is used for $(π)$ .