6.5 geneticDist

Estimating the genetic distance between individuals**

geneticDist estimates the pairwise distances per genomic window between at least two genomes in glf format. This format can be created with the atlas task glf.

In the first step, the frequencies of nine genotype configurations aa/aa, aa/ab, ab/aa, aa/bb, ab/ab, ab/ac, aa/bc, ab/cc and ab/cd are estimated. The genotype configuration aa/aa e.g. corresponds to a locus where both individuals are homozygous for the same allele and the configuration ab/cc corresponds to a locus where one individual is heterozygous and the other is homozygous for a different allele.

In the second step, these genotype configuration frequencies are multiplied by the user-specified distance weights and summed up to produce the genetic distance. Depending on the genetic distance you want to use, you will give different weights to the genotype configurations.

Predefined distance types and weights :

  • squaredDiff : This distance measure corresponds to the amount of alleles that differ between the genotypes. The distance weights in this case are: 0,1,1,4,0,1,4,4,4.
  • euclidian: This distance measure corresponds to the square root of the squaredDiff. If this measure is used in a metric PCA, the MDS will be the same as a PCA.
  • probMismatch : This distance measure corresponds to the probability that a random allele chosen at a random position differs between two individuals. The distance weights in this case are: 0,0.5,0.5,1,0.5,0.75,1,1,1.

6.5.1 Input

Required inputs :

--glf glf_file1.glf.gz,glf_file2.glf. or --glf glf_file.txt At least two glf files separated by comma.

Optional inputs :

  • none

Specific Parameters :

--distWeights 9xnumeric_values A comma-separated vector of 9 weights, to be assigned to the genotype configurations in the following order: aa/aa,aa/ab,ab/aa,aa/bb,ab/ab,ab/ac,aa/bc,ab/cc,ab/cd. These weights represent how distant from each other you consider the two genotypes of each genotype configuration to be. Default = Predefined depending on type of distance type
--distType type_of_distance To set one of the predefined distance types distance weights given in the distance types listed above. Default = squaredDiff.
--iterations integer_value To set the maximum amount of EM iterations. Default = 100.

Engine parameters that are common to all tasks can be found here.

6.5.2 Output

*_distanceEstimates.txt.gz .txt.gz file with distance estimates for each pair of glf files. The columns correspond to the position of the window, the four base frequencies, the nine genotype configuration frequencies and the genetic distance
*_distanceMatrix.txt .txt file containing genetic distance for each pair of glf files in a matrix form.

6.5.3 Usage Example

. $(dirname $0)/find_atlas
. $(dirname $0)/simulate --type HW --sampleSize 2 --fixedSeed 101

out="GLF1"
$atlas --task GLF --bam simulate_ind1.bam --printAll \
       --fixedSeed 102 --out $out --logFile $out.out 2> $out.eout

out="GLF2"
$atlas --task GLF --bam simulate_ind2.bam --printAll \
       --fixedSeed 103 --out $out --logFile $out.out 2> $out.eout

out="geneticDist"
$atlas --task geneticDist --glf GLF1.glf.gz,GLF2.glf.gz \
       --fixedSeed 104 --out $out --logFile $out.out 2> $out.eout