7.3 VCFCompare
Comparing genotype calls in two VCF files
VCFCompare
compares the calls in two Variant Calling Format (VCF) files generated by call. VCF files contain information about differences in aligned reads from the sample as compared to a reference genome. During VCF file generation, the sample sequences are aligned to a reference genome, creating BAM files. Then the aligned reads that differ from the reference genome are identified/called and written to a VCF file.
7.3.1 Input
Required inputs :
--vcf example1_trueGenotypes.vcf.gz, example2_trueGenotypes.vcf.gz |
Comma-separated list of VCF file names to be compared. At the moment, it is only possible to provide two. |
--samples name_of_Sample_in_example1_trueGenotypes.vcf.gz, name_of_Sample_in_example_trueGenotypes.vcf.gz |
Comma-separated list of sample names in respective VCF for which calls should be compared. |
Optional inputs :
None
Specific Parameters :
--limitLines integer_value |
To limit amount of lines to be read from VCF file. Default = Will parse entire VCF. |
Engine parameters that are common to all tasks can be found here.
7.3.3 Usage Example
#! /bin/bash
# `--fixedSeed = N` is needed to have reproducable results in regression test
. $(dirname $0)/find_atlas
. $(dirname $0)/simulate_vcf --sampleSize 2 --chrLength 2321 --ploidy 2 \
--out simulate1 --logFile simulate1.out --fixedSeed 1
. $(dirname $0)/simulate_vcf --sampleSize 2 --chrLength 2321 --ploidy 2 \
--out simulate2 --logFile simulate2.out --fixedSeed 2
out="VCFCompare_f_ss"
$atlas --task VCFCompare --vcf simulate1.vcf.gz --samples ind_0,ind_1 \
--fixedSeed 3 --out $out --logFile $out.out 2> $out.eout
out="VCFCompare_ff_s"
$atlas --task VCFCompare --vcf simulate1.vcf.gz,simulate2.vcf.gz --samples ind_0 \
--fixedSeed 4 --out $out --logFile $out.out 2> $out.eout
out="VCFCompare_ff_ss"
$atlas --task VCFCompare --vcf simulate1.vcf.gz,simulate2.vcf.gz \
--samples ind_0,ind_1 \
--fixedSeed 5 --out $out --logFile $out.out 2> $out.eout
7.3.4 Additional Information
VCF files are tab delimited text files. These contain meta-information lines, a header line, and then data lines, each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.
VCF file structure :
Column | Column Content Description |
#CHROM | Chromosome |
POS | Coordinate - The start coordinate of the variant. |
ID | Identifier |
REF | Reference allele - The reference allele is whatever is found in the reference genome. It is not necessarily the major allele. |
ALT | Alternative allele - The alternative allele is the allele found in the sample you are studying. |
QUAL | Score - Quality score out of 100. |
FILTER | Pass/fail - If it passed quality filters. |
INFO | Further information - Allows you to provide further information on the variants. Keys in the INFO field can be defined in header lines above the table. |
FORMAT | Information about the following columns - The GT in the FORMAT column tells us to expect genotypes in the following columns. |
NA19909 | Individual identifier (optional) - The previous column told us to expect to see genotypes here. The genotype is in the form 0 |
https://samtools.github.io/hts-specs/VCFv4.2.pdf
https://www.ebi.ac.uk/training/online/courses/human-genetic-variation-introduction/