5.11 thetaRatio
Estimate the ratio in heterozygosity (theta) between genomic regions
thetaRatio
uses an MCMC (Metropolis-Hastings Algorithm) to estimate the ratio of the expected heterozygosity from sites in two different regions.
\[\Phi = log(\frac{\theta_1}{\theta_2})\]
5.11.1 Input
Required inputs :
--bam Input_bam_file.bam |
Input bam file |
--region1 region_1 --region2 region_2 |
Two files, region_1 and region_2 that define the regions of the two thetas. |
Optional inputs :
--pmd "library_type:model_for_5'_read:model_for_3'_read" or --RGInfo |
Library type followed by the model to be used for the 5-prime read-end and the 3-prime read-end which can be either “Exponential” or “Empiric”. All three arguments must be provided as a string, divided by colons (:). e.g. : –pmdModels “doubleStrand:Exponential:Exponential”. Used to specify Post-mortem damage parameters. Can also be provided as a .txt file (see PMDfor generating such a file). Default = Will assume there is no PMD in the data. |
--recal recal_model or --RGInfo |
A common recal model for all read groups.Used to specify Quality score recalibration parameters. Can also be provided as a .txt file (see recal for further information). Default = ‘-’, no recalibration is performed. |
Specific Parameters :
- See Filter parameters to apply specific filters for bases, reads and parsing window setting.
Engine parameters that are common to all tasks can be found here.
5.11.2 Output
*_RGInfo.json | File containing read group info. |
*_filterSummary.txt | File containing filter summary of general filter counts. |
*_thetaRatioMCMC.txt.gz | File containing MCMC chain. |
5.11.3 Usage Example
#! /bin/bash
. $(dirname $0)/find_atlas
echo "chr1 1 10000" > region1.bed
echo "chr2 1 10000" > region2.bed
. $(dirname $0)/simulate --fixedSeed 250
out="thetaRatio"
$atlas --task thetaRatio --bam simulate.bam \
--region1 region1.bed --region2 region2.bed \
--fixedSeed 255 --out $out --logFile $out.out 2> $out.eout
5.11.4 Additional Information
Method
The likelihood function is:
\[ {P}(\boldsymbol{d}_1,\boldsymbol{d}_2|\theta_1,\theta_2, \boldsymbol{\pi}_1, \boldsymbol{\pi}_2) = \dfrac{\prod\limits_{i=1}^I \sum\limits_g \prod\limits_{j=1}^{n_i} \mathbb{P}(d_{1_{ij}}|g_i=g)\mathbb{P}(g_i=g|\theta_1,\boldsymbol{\pi}_1)}{\prod\limits_{i=1}^I \sum\limits_{g} \prod\limits_{j=1}^{n_i} \mathbb{P}(d_{2_{ij}}|g_i=g)\mathbb{P}(g_i=g|\theta_2,\boldsymbol{\pi}_2)} \] An MCMC is used to infer the posterior distribution for all parameters. Updates are performed for all: \((\boldsymbol{\pi})\) and \((\log(\theta_1))\) and \((\log(\theta_2))\).
\(U[0,1]\) prior is used for \((\log(\theta_1))\) and \((\log(\theta_1))\) and \(N(0, 1)\). is used for \((\boldsymbol{\pi})\).