6.5 geneticDist
Estimating the genetic distance between individuals**
geneticDist
estimates the pairwise distances per genomic window between at least two genomes in glf format. This format can be created with the atlas task glf
.
In the first step, the frequencies of nine genotype configurations aa/aa, aa/ab, ab/aa, aa/bb, ab/ab, ab/ac, aa/bc, ab/cc and ab/cd are estimated. The genotype configuration aa/aa e.g. corresponds to a locus where both individuals are homozygous for the same allele and the configuration ab/cc corresponds to a locus where one individual is heterozygous and the other is homozygous for a different allele.
In the second step, these genotype configuration frequencies are multiplied by the user-specified distance weights and summed up to produce the genetic distance. Depending on the genetic distance you want to use, you will give different weights to the genotype configurations.
Predefined distance types and weights :
- squaredDiff : This distance measure corresponds to the amount of alleles that differ between the genotypes. The distance weights in this case are: 0,1,1,4,0,1,4,4,4.
- euclidian: This distance measure corresponds to the square root of the squaredDiff. If this measure is used in a metric PCA, the MDS will be the same as a PCA.
- probMismatch : This distance measure corresponds to the probability that a random allele chosen at a random position differs between two individuals. The distance weights in this case are: 0,0.5,0.5,1,0.5,0.75,1,1,1.
6.5.1 Input
Required inputs :
--glf glf_file1.glf.gz,glf_file2.glf. or --glf glf_file.txt |
At least two glf files separated by comma. |
Optional inputs :
none
Specific Parameters :
--distWeights 9xnumeric_values |
A comma-separated vector of 9 weights, to be assigned to the genotype configurations in the following order: aa/aa,aa/ab,ab/aa,aa/bb,ab/ab,ab/ac,aa/bc,ab/cc,ab/cd. These weights represent how distant from each other you consider the two genotypes of each genotype configuration to be. Default = Predefined depending on type of distance type |
--distType type_of_distance |
To set one of the predefined distance types distance weights given in the distance types listed above. Default = squaredDiff . |
--iterations integer_value |
To set the maximum amount of EM iterations. Default = 100. |
Engine parameters that are common to all tasks can be found here.
6.5.2 Output
*_distanceEstimates.txt.gz | .txt.gz file with distance estimates for each pair of glf files. The columns correspond to the position of the window, the four base frequencies, the nine genotype configuration frequencies and the genetic distance |
*_distanceMatrix.txt | .txt file containing genetic distance for each pair of glf files in a matrix form. |
6.5.3 Usage Example
. $(dirname $0)/find_atlas
. $(dirname $0)/simulate --type HW --sampleSize 2 --fixedSeed 101
out="GLF1"
$atlas --task GLF --bam simulate_ind1.bam --printAll \
--fixedSeed 102 --out $out --logFile $out.out 2> $out.eout
out="GLF2"
$atlas --task GLF --bam simulate_ind2.bam --printAll \
--fixedSeed 103 --out $out --logFile $out.out 2> $out.eout
out="geneticDist"
$atlas --task geneticDist --glf GLF1.glf.gz,GLF2.glf.gz \
--fixedSeed 104 --out $out --logFile $out.out 2> $out.eout