6.7 majorMinor
Estimating major and minor alles
majorMinor
infers the major and minor alleles from a population sample and outputs the genotype likelihoods in a vcf file. This task requires the sample-specific genotype likelihoods in glf format, which can be created with the ATLAS task GLF
. The resulting vcf file can be used as an input to ANGSD.
The major and minor alleles can be estimated using the method described by Skotte et al. (2012) or using the MLE method. The MLE method estimates the genotype frequencies simultaneously with the two alleles present at a site. The variant quality is the likelihood ratio of a model with variants and a model without variants.
6.7.1 Input
Required inputs :
--glf glf_file1.glf.gz,glf_file2.glf. or --glf glf_file.txt |
Input glf files for every sample of the population. Can be provided on the command line or with an input text file (one file name per line). |
Example text file:
glf_file_1.glf.gz
glf_file_2.glf.gz
glf_file_3.glf.gz
glf_file_4.glf.gz
Optional inputs :
--sampleNames sample_name1,sample_name2 |
Provide alternative sample names.Number of provided sample names needs to match number of GLF files. Default= will deduce sample names from GLF file names. |
Specific Parameters :
--method method_name |
Estimates major/minor alleles using the indicated method. Two options available Skotte and MLE . Default = MLE. |
--maxF numeric_value |
maximum value of the likelihood function with respect to the parameter theta . Defualt = 1e-07. |
--phredLik |
To transform the likelihood onto the Phred quality score scale.This will save space but lead to loss of precision and thus power. Default = raw likelhood without any adjustment/transformation. |
--minSamplesWithData integer_value |
To keep only sites for which at least 1indicated number of samples have data. Default = 1. |
--minMAF numeric_value |
To keep only sites for which minor allele frequency is at the least the indicated number. Default = all sites are kept regardless of minor allele frequency. |
--limitSites integer_value |
To write likelihoods only up to the indicated input position. Default = disabled. |
Engine parameters that are common to all tasks can be found here.
6.7.2 Output
*_majorMinor.vcf.gz | One multi-sample VCF-file, containing the likelihoods of the genotypes consisting of the major and minor allele. |
6.7.3 Usage Example
#! /bin/bash
. $(dirname $0)/find_atlas
N=97
if [[ "$#" -eq 1 ]]; then
N=$1
fi
echo "doing $N samples"
. $(dirname $0)/simulate --type HW --F 0.1 --chrLength 1432 \
--sampleSize $N --fracPoly 0.1 --alpha 2.0 --beta 2.0 --fixedSeed 133
for f in *.bam; do
out=GLF_${f%.bam}
$atlas --task GLF --bam $f --fasta simulate.fasta \
--fixedSeed 131 --out $out --logFile $out.out 2> $out.eout
done
allSamples=`find . -path '*_ind*.glf.gz' | paste -s -d ',' -`
out="majorMinor"
$atlas --task majorMinor --glf $allSamples --method Skotte \
--minMAF 0.05 --maxThreads 1 --bgz --minSamplesWithData 83 \
--fixedSeed 132 --out $out --logFile $out.out 2> $out.eout
out="Skotte_fasta"
$atlas --task majorMinor --method Skotte --minSamplesWithData 83 \
--glf $allSamples --fasta simulate.fasta \
--minMAF 0.05 --maxThreads 1 --bgz \
--fixedSeed 134 --out $out --logFile $out.out 2> $out.eout
out="MLE_fasta"
$atlas --task majorMinor --method MLE --minSamplesWithData 83 \
--glf $allSamples --fasta simulate.fasta \
--minMAF 0.05 --maxThreads 1 --bgz \
--fixedSeed 135 --out $out --logFile $out.out 2> $out.eout
echo "chr pos ref alt" > alleles.txt
gunzip -c Skotte_fasta.vcf.gz | grep -v "^##" | awk '{if (NR % 3 == 0) {print $1, $2, $4, $5}}' >> alleles.txt
out="alleles"
$atlas --task majorMinor --method Skotte --minSamplesWithData 83 \
--glf $allSamples --alleles alleles.txt \
--minMAF 0.05 --maxThreads 1 \
--fixedSeed 136 --out $out --logFile $out.out 2> $out.eout