6.7 majorMinor

Estimating major and minor alles

majorMinor infers the major and minor alleles from a population sample and outputs the genotype likelihoods in a vcf file. This task requires the sample-specific genotype likelihoods in glf format, which can be created with the ATLAS task GLF. The resulting vcf file can be used as an input to ANGSD.

The major and minor alleles can be estimated using the method described by Skotte et al. (2012) or using the MLE method. The MLE method estimates the genotype frequencies simultaneously with the two alleles present at a site. The variant quality is the likelihood ratio of a model with variants and a model without variants.

6.7.1 Input

Required inputs :

--glf glf_file1.glf.gz,glf_file2.glf. or --glf glf_file.txt Input glf files for every sample of the population. Can be provided on the command line or with an input text file (one file name per line).

Example text file:

glf_file_1.glf.gz

glf_file_2.glf.gz

glf_file_3.glf.gz

glf_file_4.glf.gz

Optional inputs :

--sampleNames sample_name1,sample_name2 Provide alternative sample names.Number of provided sample names needs to match number of GLF files. Default= will deduce sample names from GLF file names.

Specific Parameters :

--method method_name Estimates major/minor alleles using the indicated method. Two options available Skotte and MLE. Default = MLE.
--maxF numeric_value maximum value of the likelihood function with respect to the parameter theta. Defualt = 1e-07.
--phredLik To transform the likelihood onto the Phred quality score scale.This will save space but lead to loss of precision and thus power. Default = raw likelhood without any adjustment/transformation.
--minSamplesWithData integer_value To keep only sites for which at least 1indicated number of samples have data. Default = 1.
--minMAF numeric_value To keep only sites for which minor allele frequency is at the least the indicated number. Default = all sites are kept regardless of minor allele frequency.
--limitSites integer_value To write likelihoods only up to the indicated input position. Default = disabled.

Engine parameters that are common to all tasks can be found here.

6.7.2 Output

*_majorMinor.vcf.gz One multi-sample VCF-file, containing the likelihoods of the genotypes consisting of the major and minor allele.

6.7.3 Usage Example

#! /bin/bash

. $(dirname $0)/find_atlas

N=97
if [[ "$#" -eq 1 ]]; then
    N=$1
fi

echo "doing $N samples"

. $(dirname $0)/simulate --type HW --F 0.1 --chrLength 1432 \
    --sampleSize $N --fracPoly 0.1 --alpha 2.0 --beta 2.0 --fixedSeed 133

for f in *.bam; do
    out=GLF_${f%.bam}
    $atlas --task GLF --bam $f --fasta simulate.fasta \
           --fixedSeed 131 --out $out --logFile $out.out 2> $out.eout
done

allSamples=`find . -path '*_ind*.glf.gz' | paste -s -d ',' -`

out="majorMinor"
$atlas --task majorMinor --glf $allSamples --method Skotte \
       --minMAF 0.05 --maxThreads 1 --bgz --minSamplesWithData 83 \
       --fixedSeed 132 --out $out --logFile $out.out 2> $out.eout

out="Skotte_fasta"
$atlas --task majorMinor --method Skotte --minSamplesWithData 83 \
       --glf $allSamples --fasta simulate.fasta \
       --minMAF 0.05 --maxThreads 1 --bgz \
       --fixedSeed 134 --out $out --logFile $out.out 2> $out.eout

out="MLE_fasta"
$atlas --task majorMinor --method MLE --minSamplesWithData 83 \
       --glf $allSamples --fasta simulate.fasta \
       --minMAF 0.05 --maxThreads 1 --bgz \
       --fixedSeed 135 --out $out --logFile $out.out 2> $out.eout

echo "chr pos ref alt" > alleles.txt
gunzip -c Skotte_fasta.vcf.gz | grep -v "^##" | awk '{if (NR % 3 == 0) {print $1, $2, $4, $5}}' >> alleles.txt

out="alleles"
$atlas --task majorMinor --method Skotte --minSamplesWithData 83 \
       --glf $allSamples --alleles alleles.txt \
       --minMAF 0.05 --maxThreads 1 \
       --fixedSeed 136 --out $out --logFile $out.out 2> $out.eout