7.3 VCFCompare

Comparing genotype calls in two VCF files

VCFCompare compares the calls in two Variant Calling Format (VCF) files generated by call. VCF files contain information about differences in aligned reads from the sample as compared to a reference genome. During VCF file generation, the sample sequences are aligned to a reference genome, creating BAM files. Then the aligned reads that differ from the reference genome are identified/called and written to a VCF file.

7.3.1 Input

Required inputs :

--vcf example1_trueGenotypes.vcf.gz, example2_trueGenotypes.vcf.gz Comma-separated list of VCF file names to be compared. At the moment, it is only possible to provide two.
--samples name_of_Sample_in_example1_trueGenotypes.vcf.gz, name_of_Sample_in_example_trueGenotypes.vcf.gz Comma-separated list of sample names in respective VCF for which calls should be compared.

Optional inputs :

  • None

Specific Parameters :

--limitLines integer_value To limit amount of lines to be read from VCF file. Default = Will parse entire VCF.

Engine parameters that are common to all tasks can be found here.

7.3.2 Output

`` Table summarizing the amount of calls for every combination of two genotypes.

7.3.3 Usage Example

#! /bin/bash

# `--fixedSeed = N` is needed to have reproducable results in regression test

. $(dirname $0)/find_atlas

. $(dirname $0)/simulate_vcf --sampleSize 2 --chrLength 2321 --ploidy 2 \
    --out simulate1 --logFile simulate1.out --fixedSeed 1

. $(dirname $0)/simulate_vcf --sampleSize 2 --chrLength 2321 --ploidy 2 \
    --out simulate2 --logFile simulate2.out --fixedSeed 2

out="VCFCompare_f_ss"
$atlas --task VCFCompare --vcf simulate1.vcf.gz --samples ind_0,ind_1 \
       --fixedSeed 3 --out $out --logFile $out.out 2> $out.eout

out="VCFCompare_ff_s"
$atlas --task VCFCompare --vcf simulate1.vcf.gz,simulate2.vcf.gz --samples ind_0 \
       --fixedSeed 4 --out $out --logFile $out.out 2> $out.eout

out="VCFCompare_ff_ss"
$atlas --task VCFCompare --vcf simulate1.vcf.gz,simulate2.vcf.gz \
       --samples ind_0,ind_1 \
       --fixedSeed 5 --out $out --logFile $out.out 2> $out.eout

7.3.4 Additional Information

VCF files are tab delimited text files. These contain meta-information lines, a header line, and then data lines, each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

VCF file structure :

Column Column Content Description
#CHROM Chromosome
POS Coordinate - The start coordinate of the variant.
ID Identifier
REF Reference allele - The reference allele is whatever is found in the reference genome. It is not necessarily the major allele.
ALT Alternative allele - The alternative allele is the allele found in the sample you are studying.
QUAL Score - Quality score out of 100.
FILTER Pass/fail - If it passed quality filters.
INFO Further information - Allows you to provide further information on the variants. Keys in the INFO field can be defined in header lines above the table.
FORMAT Information about the following columns - The GT in the FORMAT column tells us to expect genotypes in the following columns.
NA19909 Individual identifier (optional) - The previous column told us to expect to see genotypes here. The genotype is in the form 0

https://samtools.github.io/hts-specs/VCFv4.2.pdf

https://www.ebi.ac.uk/training/online/courses/human-genetic-variation-introduction/