4.2 BAMDiagnostics

Estimating approximate depth, read length frequencies and mapping quality frequencies

BAMDiagnostics provides a set of read statistics for the input BAM file while taking into account all standard input filters. The output are written to .txt files that summarize the following information:

Total number of reads
Number of reads that passed filters
Number of duplicate reads
Average read length
Maximum read length
Number of proper pairs
Average fragment length (only known for paired-end data)
Total number of soft-clipped positions
Average soft-clipped length
Average aligned length
Mean sequencing depth across the whole genome
Average mapping quality

It also provides histograms which display the distributions of fragment lengths, mapping qualities, read lengths, soft-clipped lengths and aligned lengths. All of this data is written for all read groups combined, as well as for each read group separately.

4.2.1 Input

Required inputs :

--bam Input_bam_file.bam Input BAM file.

Optional inputs :

none

Specific Parameters :

`--diagnosticsPerChromosome`	To output data per chromosome into a *_diagnostics.txt diagnostics file. Default = Only per-read group summary statistics is provided (per chromosome summary statistics is provided).
`--splitMergeInput`	To create input file for `splitMerge`. Default = Will not create input file for splitMerge.
`--printReferenceLength`	To print reference lengths of chromosomes to file. Default = Will not print reference lengths of chromosomes to file.

See Filter parameters to apply specific filters for bases, reads and parsing window setting.

Engine parameters that are common to all tasks can be found here.

4.2.2 Output

*_filterSummary.txt	Filter summary for all read groups combined and individual read groups.
*_fragmentLengthHistogram.txt	Counts for all fragment length for all read groups combined and individual read groups.
*_mappingQualityHistogram.txt	Mapping quality counts for all read groups combined and individual read groups.
*_readLengthHistogram.txt	Read length counts for all read groups combined and individual read groups.
*_softClippedLengthHistogram.txt	Length of soft-clipped bases as counts for all read groups combined and individual read groups.
*_alignedLengthHistogram.txt	Aligned length counts for all read groups combined and individual read groups
*_diagnostics.txt	File containing per-read group summary statistics. Also contains per chromosome summary statistics is provided when `--diagnosticsPerChromosome` parameter is used. This file can be used as input file for the splitMerge task.

4.2.3 Usage Example

#! /bin/bash

# `--fixedSeed = N` is needed to have reproducable results in regression test

. $(dirname $0)/find_atlas
. $(dirname $0)/simulate --type HW --F 0.1 --fixedSeed 0 \
    --sampleSize 17 --chrLength 11111 --fracPoly 1.0 \
    --alpha 2.0 --beta 2.0 --seqType single --seqCycles 101

for i in {1..17}; do
    samtools view simulate_ind"$i".bam | head -250 | tail -10 | cut -f1 \
            > blacklist_"$i".txt
    u=$(echo "$i*5" | bc)

    out="simple$i"
    $atlas --task BAMDiagnostics --perChromosome --bam simulate_ind$i.bam \
           --fixedSeed $i --out $out --logFile $out.out 2> $out.eout

    out="complex$i"
    $atlas --task BAMDiagnostics --identifyDuplicates --bam simulate_ind$i.bam \
           --downsampleReads 0.75 \
           --filterSoftClips --filterMQ 0,$u --blacklist blacklist_$i.txt \
           --filterReadLength 0,$u --filterFragmentLength 0,$u \
           --fixedSeed 1$i --out $out --logFile $out.out 2> $out.eout
done