6.4 calculateF2

Calculate F2 between different samples, and within and between populations

calculateF2 estimates F2 based on the number of different sites and the total number of compared sites for each pairwise comparison from a multi-sample VCF. It can calculate F2 between different samples, and within and between populations. A VCF file used as input for calculateF2 can be created with the ATLAS task majorMinor or call.

6.4.1 Input

Required inputs :

--vcf Input_VCF.bam Input VCF file (see majorMinor or call for generating such a file).

Optional inputs :

--samples samples_Populations.txt Text file containing the samples to be used and their population affiliation. Different values will be estimated for different populations. If no populations are provided, all samples are considered to come from the same population.

Example samples_Populations.txt file:

sample1 1

sample2 1

sample5 2

sample8 2

Specific Parameters :

--limitLines integer_value To limit amount of lines to be read from VCF file. Default = Will parse entire VCF.
--regions \*.bed To limit analysis to regions defined in BED file. Default = Will parse entire VCF.
--filterDepth integer_value,integer_value To keep only the samples with indicated sample depth (inclusive). Default = Will keep all sites regardless of depth.
--maxMissing numeric_value To filter out sites which has more than the indicated data fraction missing. numeric_value must be between 0 and 1 (inclusive). Default = keep sites regardless of missingness.
--minMAF numeric_value To keep only sites for which minor allele frequency is at the least the indicated number. Default = all sites are kept regardless of minor allele frequency.
--minVarQual numeric_value To only store sites with minimum variant quality as indicated or more. Default = Will keep sites regardless of their variant quality.
--chr or --limitChr To keep only specified chromosomes. Default = Will keep all chromosomes.

Engine parameters that are common to all tasks can be found here.

6.4.2 Output

*_counts.txt A n*n matrix containing the counts of different sites in the upper triangle and the total number of compared sites in the lower triangle for all possible pair of samples.
*_sampleF2.txt A n*n matrix containing the pairwise sample F2 (#diff Sites/#compared Sites) for all possible pair of samples.
*_popF2.txt A p*p matrix containing the average F2 within and between populations for all possible pairs.

6.4.3 Usage Example

#! /bin/bash

. $(dirname $0)/find_atlas
. $(dirname $0)/simulate_vcf --sampleSize 11 --fixedSeed 35

out="calculateF"
$atlas --task calculateF2 --vcf simulate.vcf.gz \
       --filterDepth [2,] --maxMissing 0.1 --minMAF 0.01 \
       --fixedSeed 36 --out $out --logFile $out.out 2> $out.eout