4.2 BAMDiagnostics

Estimating approximate depth, read length frequencies and mapping quality frequencies

BAMDiagnostics provides a set of read statistics for the input BAM file while taking into account all standard input filters. The output are written to .txt files that summarize the following information:

  • Total number of reads
  • Number of reads that passed filters
  • Number of duplicate reads
  • Average read length
  • Maximum read length
  • Number of proper pairs
  • Average fragment length (only known for paired-end data)
  • Total number of soft-clipped positions
  • Average soft-clipped length
  • Average aligned length
  • Mean sequencing depth across the whole genome
  • Average mapping quality

It also provides histograms which display the distributions of fragment lengths, mapping qualities, read lengths, soft-clipped lengths and aligned lengths. All of this data is written for all read groups combined, as well as for each read group separately.

4.2.1 Input

Required inputs :

--bam Input_bam_file.bam Input BAM file.

Optional inputs :

  • none

Specific Parameters :

--diagnosticsPerChromosome To output data per chromosome into a *_diagnostics.txt diagnostics file. Default = Only per-read group summary statistics is provided (per chromosome summary statistics is provided).
--splitMergeInput To create input file for splitMerge. Default = Will not create input file for splitMerge.
--printReferenceLength To print reference lengths of chromosomes to file. Default = Will not print reference lengths of chromosomes to file.
  • See Filter parameters to apply specific filters for bases, reads and parsing window setting.

Engine parameters that are common to all tasks can be found here.

4.2.2 Output

*_filterSummary.txt Filter summary for all read groups combined and individual read groups.
*_fragmentLengthHistogram.txt Counts for all fragment length for all read groups combined and individual read groups.
*_mappingQualityHistogram.txt Mapping quality counts for all read groups combined and individual read groups.
*_readLengthHistogram.txt Read length counts for all read groups combined and individual read groups.
*_softClippedLengthHistogram.txt Length of soft-clipped bases as counts for all read groups combined and individual read groups.
*_alignedLengthHistogram.txt Aligned length counts for all read groups combined and individual read groups
*_diagnostics.txt File containing per-read group summary statistics. Also contains per chromosome summary statistics is provided when --diagnosticsPerChromosome parameter is used. This file can be used as input file for the splitMerge task.

4.2.3 Usage Example

#! /bin/bash

# `--fixedSeed = N` is needed to have reproducable results in regression test

. $(dirname $0)/find_atlas
. $(dirname $0)/simulate --type HW --F 0.1 --fixedSeed 0 \
    --sampleSize 17 --chrLength 11111 --fracPoly 1.0 \
    --alpha 2.0 --beta 2.0 --seqType single --seqCycles 101

for i in {1..17}; do
    samtools view simulate_ind"$i".bam | head -250 | tail -10 | cut -f1 \
            > blacklist_"$i".txt
    u=$(echo "$i*5" | bc)

    out="simple$i"
    $atlas --task BAMDiagnostics --perChromosome --bam simulate_ind$i.bam \
           --fixedSeed $i --out $out --logFile $out.out 2> $out.eout

    out="complex$i"
    $atlas --task BAMDiagnostics --identifyDuplicates --bam simulate_ind$i.bam \
           --filterSoftClips --filterMQ 0,$u --blacklist blacklist_$i.txt \
           --filterReadLength 0,$u --filterFragmentLength 0,$u \
           --fixedSeed 1$i --out $out --logFile $out.out 2> $out.eout
done