9.5 genfile
This format creates a genotype truth set that can be used as an input (genfile) for the imputation tool STITCH. This file contains the genotypes of individuals with high coverage extracted directly from the VCF file. The file has a header with the sample names from the VCF file. The rows correspond to the rows of the posfile. The genotypes are encoded as the number of counts of the alternative allele, with NA for missing genotypes. In addition, for each individual, a BED file is created. This BED file has three columns: chromosome, start and end (1-based) position.
This format accepts the following command line arguments:
**Argument | options/format** | default | description |
numSamples | 5 | number of samples to keep per locus | |
minDepth | 3 | minimal depth | |
minDistance | 100 | minimal distance to previous locus |
The algorithm to find high-coverage sites starts with the first locus of the chromosome. It finds 0-numSamples
samples that have a higher depth than minDepth
, writes the position of this locus into the bed files of these individuals, and writes the genotypes of these individuals to the genfile. The genotypes of all other individuals are written as NA into the genfile.
For each subsequent locus, the algorithm then checks if the distance to the previous locus is more than minDistance
basepairs. If yes, it stores the genotypes as described above in the gen- and bedfile. If not, the genotypes of all individuals are written as NA to the genfile. Here is an example for three individuals and four loci, with numSamples = 2
and minDistance = 3
:
Ind0 | Ind1 | Ind2 | |
CC | AC | NA | |
NA | NA | NA | |
NA | NA | NA | |
TT | NA | TT |
The bed file for the first individual (Ind0) would look like this:
chr1 | 1 | 2 | |
chr1 | 4 | 5 |