8.1 simulate

Generating simulations

simulate is used to simulate BAM files.

8.1.1 Input

Required inputs :

  • None

Optional inputs :

  • None

Specific Parameters :

--chrLength length1,length2,length3 to provide length of chromosomes(3 in this case) to be simulated (Default = data is simulated for 3 chromosomes with names chr1, chr2 and chr3 with length [50000, 40000, 60000])
--depth depth1 depth2 depth3 to provide average sequencing depth for each chromosome (Default = data is simulated for 3 chromosomes with names chr1, chr2 and chr3 with depth [10, 8, 12] )
--ploidy ploidy1,ploidy2,ploidy3 to provide ploidy for each chromosome. ploidy 1=haploid, ploidy2 = diploid.(Default = data is simulated for 3 chromosomes with names chr1, chr2 and chr3 with ploidy [2, 1, 2]).
--refDiv float_value To simulate data with indicated reference divergence. The float_value should be between 0 and 1 (inclusive). Default = 0.01.
--baseFreq float_value1,float_value2,float_value3,float_value4 To indicate base frequencies for the simulation. The four values should add up-to 1. Default = 0.25 for all bases.
--type pair or --type HW To simulate data for a pair of individuals or to simulate data under Hardy-Weinberg. Default = Data for single individual is simulated.
--phi float_values To specify specific genetic distance between individuals when data for a pair of individual is simulated. This is a required parameter when --type pair is used for simulation.
--sampleSize integer_value To specify the number of individuals for whom data is to be simulated when --type HWis used for simulation. Default = 10 (data will be simulated for 10 individuals).
--fracPoly value To specify the number of sites that are to be simulated as polymorphic when --type HW is used for simulation. Default = 0.1
--master value To specify the master parameter for the beta distribution that describes the derived allele frequencies of the polymorphic sites when --type HW is used for simulation. Default = 0.5.
--beta value To specify the beta parameter for the beta distribution that describes the derived allele frequencies of the polymorphic sites when --type HW is used for simulation. Default = 0.5.
--F float_value To specify an inbreeding coefficient when --type HW is used for simulation. Default = 0 (Will assume no inbreeding).
--numReadGroup integer_value To specify the number of read groups (libraries) for the simulation. Default = Data from one read group is simulated.
--seqType paired or single To specify the type or sequencing: paired-end or single-end. Default = ‘single’ for all read groups.
--seqCycles integer_value To specify the number of sequencing cycles. Default = 100 for all read groups.
--fragmentLength 'model(parameters)' To specify model for fragment length distribution. Default = ‘gamma(10,0.2)[30,200]’ for all read groups.
--baseQuality 'model(parameters)' To specify model for base quality distribution. Default = ‘normal(30,10)[0,93]’ for all read groups.
--mappingQuality 'model(parameters)' To specify model for mapping quality distribution. Default = ‘normal(60,10)[1,255]’ for all read groups.
--softClipping 'model(parameters)' To specify model for soft clipping distribution. Default = ‘poisson(0.5)[0,20]’ for all read groups.
--pmd 'model(parameters)' To specify postmortem damage model. Default = ‘-’ for all read groups.
--recal 'model(parameters)' To specify base quality score recalibration model. Default = ‘-’ for all read groups.
--frequency float_value To specify read group frequency. Default = 1 for all read groups.

--seqType,--seqCycles,--fragmentLength, --baseQuality,--mappingQuality',--softClipping,--pmd,--recal can also be provided together in a separate file, Readgroupinfo.txt using the following parameter: --RGInfo Readgroupinfo.txt. See here for such an example.

Note :The number of chromosomes implied by chrLength, ploidy and depth need to match for the simulation to run! If only one value is given for --chrlength all the simulated chromosomes (implied by number of values for depth and ploidy) will have the same chromosome length.

  • See Filter parameters to apply specific filters for bases, reads and parsing window setting.

Engine parameters that are common to all tasks can be found here.

Optional Parameters :

--writeTrueGenotypes to write genotype of the simulated data (Default = true genotypes are NOT written to file)
--writeVariantBED to create BED file for the simulated data (Default = BED files with variant and invariant positions are NOT written)

8.1.2 Output

*.bam A Simulated bam file.
*.bam.bai Index file for the simulated bam file.
*.fasta A simulated reference sequence file.
*.fasta.fai Index file for the simulated reference sequence file.
*_trueGenotypes.vcf.gz (optional) vcf file with genotypes of the simulated data. By default genotypes are not written to a file
*_invariantSites.bed.gz (optional) BED file with invariant site positions and genotypes. By default BED files with invariant positions are not created.

8.1.3 Usage Example

#! /bin/bash

k="321"
. $(dirname $0)/find_atlas

. $(dirname $0)/simulate --type pair --seqType paired --seqCycles 200 \
    --chrLength 50$k{2},30$k,40$k,60$k --ploidy 2{3},1,2 --depth 10,8{2},5{2} \
    --phi 0.1{8},0.2 --baseFreq 0.5,0.3,0.2,0 --refDiv 0.5 \
    --fragmentLength 'fixed(500)' --baseQuality 'binomial(95,0.01)[0,93]' \
    --mappingQuality 'normal(60,10)[1,255]' --softClipping 'poisson(20)[0,50]' \
    --pmd 'CT5:0.2*exp(-0.2*p)+0.02;GA3:0.3' --frequency 0.2 --fixedSeed 234