8.1 simulate
Generating simulations
simulate
is used to simulate BAM files.
8.1.1 Input
Required inputs :
- None
Optional inputs :
- None
Specific Parameters :
--chrLength length1,length2,length3 |
to provide length of chromosomes(3 in this case) to be simulated (Default = data is simulated for 3 chromosomes with names chr1, chr2 and chr3 with length [50000, 40000, 60000]) |
--depth depth1 depth2 depth3 |
to provide average sequencing depth for each chromosome (Default = data is simulated for 3 chromosomes with names chr1, chr2 and chr3 with depth [10, 8, 12] ) |
--ploidy ploidy1,ploidy2,ploidy3 |
to provide ploidy for each chromosome. ploidy 1=haploid, ploidy2 = diploid.(Default = data is simulated for 3 chromosomes with names chr1, chr2 and chr3 with ploidy [2, 1, 2]). |
--refDiv float_value |
To simulate data with indicated reference divergence. The float_value should be between 0 and 1 (inclusive). Default = 0.01. |
--baseFreq float_value1,float_value2,float_value3,float_value4 |
To indicate base frequencies for the simulation. The four values should add up-to 1. Default = 0.25 for all bases. |
--type pair or --type HW |
To simulate data for a pair of individuals or to simulate data under Hardy-Weinberg. Default = Data for single individual is simulated. |
--phi float_values |
To specify specific genetic distance between individuals when data for a pair of individual is simulated. This is a required parameter when --type pair is used for simulation. |
--sampleSize integer_value |
To specify the number of individuals for whom data is to be simulated when --type HW is used for simulation. Default = 10 (data will be simulated for 10 individuals). |
--fracPoly value |
To specify the number of sites that are to be simulated as polymorphic when --type HW is used for simulation. Default = 0.1 |
--master value |
To specify the master parameter for the beta distribution that describes the derived allele frequencies of the polymorphic sites when --type HW is used for simulation. Default = 0.5. |
--beta value |
To specify the beta parameter for the beta distribution that describes the derived allele frequencies of the polymorphic sites when --type HW is used for simulation. Default = 0.5. |
--F float_value |
To specify an inbreeding coefficient when --type HW is used for simulation. Default = 0 (Will assume no inbreeding). |
--numReadGroup integer_value |
To specify the number of read groups (libraries) for the simulation. Default = Data from one read group is simulated. |
--seqType paired or single |
To specify the type or sequencing: paired-end or single-end. Default = ‘single’ for all read groups. |
--seqCycles integer_value |
To specify the number of sequencing cycles. Default = 100 for all read groups. |
--fragmentLength 'model(parameters)' |
To specify model for fragment length distribution. Default = ‘gamma(10,0.2)[30,200]’ for all read groups. |
--baseQuality 'model(parameters)' |
To specify model for base quality distribution. Default = ‘normal(30,10)[0,93]’ for all read groups. |
--mappingQuality 'model(parameters)' |
To specify model for mapping quality distribution. Default = ‘normal(60,10)[1,255]’ for all read groups. |
--softClipping 'model(parameters)' |
To specify model for soft clipping distribution. Default = ‘poisson(0.5)[0,20]’ for all read groups. |
--pmd 'model(parameters)' |
To specify postmortem damage model. Default = ‘-’ for all read groups. |
--recal 'model(parameters)' |
To specify base quality score recalibration model. Default = ‘-’ for all read groups. |
--frequency float_value |
To specify read group frequency. Default = 1 for all read groups. |
--seqType
,--seqCycles
,--fragmentLength
, --baseQuality
,--mappingQuality'
,--softClipping
,--pmd
,--recal
can also be provided together in a separate file, Readgroupinfo.txt
using the following parameter: --RGInfo Readgroupinfo.txt
. See here for such an example.
Note :The number of chromosomes implied by chrLength, ploidy and depth need to match for the simulation to run! If only one value is given for --chrlength
all the simulated chromosomes (implied by number of values for depth and ploidy) will have the same chromosome length.
- See Filter parameters to apply specific filters for bases, reads and parsing window setting.
Engine parameters that are common to all tasks can be found here.
Optional Parameters :
--writeTrueGenotypes |
to write genotype of the simulated data (Default = true genotypes are NOT written to file) |
--writeVariantBED |
to create BED file for the simulated data (Default = BED files with variant and invariant positions are NOT written) |
8.1.2 Output
*.bam | A Simulated bam file. |
*.bam.bai | Index file for the simulated bam file. |
*.fasta | A simulated reference sequence file. |
*.fasta.fai | Index file for the simulated reference sequence file. |
*_trueGenotypes.vcf.gz (optional) | vcf file with genotypes of the simulated data. By default genotypes are not written to a file |
*_invariantSites.bed.gz (optional) | BED file with invariant site positions and genotypes. By default BED files with invariant positions are not created. |
8.1.3 Usage Example
#! /bin/bash
k="321"
. $(dirname $0)/find_atlas
. $(dirname $0)/simulate --type pair --seqType paired --seqCycles 200 \
--chrLength 50$k{2},30$k,40$k,60$k --ploidy 2{3},1,2 --depth 10,8{2},5{2} \
--phi 0.1{8},0.2 --baseFreq 0.5,0.3,0.2,0 --refDiv 0.5 \
--fragmentLength 'fixed(500)' --baseQuality 'binomial(95,0.01)[0,93]' \
--mappingQuality 'normal(60,10)[1,255]' --softClipping 'poisson(20)[0,50]' \
--pmd 'CT5:0.2*exp(-0.2*p)+0.02;GA3:0.3' --frequency 0.2 --fixedSeed 234