9.5 genfile

This format creates a genotype truth set that can be used as an input (genfile) for the imputation tool STITCH. This file contains the genotypes of individuals with high coverage extracted directly from the VCF file. The file has a header with the sample names from the VCF file. The rows correspond to the rows of the posfile. The genotypes are encoded as the number of counts of the alternative allele, with NA for missing genotypes. In addition, for each individual, a BED file is created. This BED file has three columns: chromosome, start and end (1-based) position.

This format accepts the following command line arguments:

**Argument options/format** default description
numSamples 5 number of samples to keep per locus
minDepth 3 minimal depth
minDistance 100 minimal distance to previous locus

The algorithm to find high-coverage sites starts with the first locus of the chromosome. It finds 0-numSamples samples that have a higher depth than minDepth, writes the position of this locus into the bed files of these individuals, and writes the genotypes of these individuals to the genfile. The genotypes of all other individuals are written as NA into the genfile.

For each subsequent locus, the algorithm then checks if the distance to the previous locus is more than minDistance basepairs. If yes, it stores the genotypes as described above in the gen- and bedfile. If not, the genotypes of all individuals are written as NA to the genfile. Here is an example for three individuals and four loci, with numSamples = 2 and minDistance = 3:

Ind0 Ind1 Ind2
CC AC NA
NA NA NA
NA NA NA
TT NA TT

The bed file for the first individual (Ind0) would look like this:

chr1 1 2
chr1 4 5