4.5 mergeOverlappingReads

Merging paired-end reads in BAM file

mergeOverlappingReads should be run before any consecutive variant discovery or population genetic tool to merge paired end reads. You need to specify which read groups should be considered for merging. Others will just be written to the BAM as they are.

4.5.1 Input

Required inputs :

--bam Input_bam_file.bam Input bam file.
--readGroupSettings Input_text_file.txt Input text file specifying read group settings: the names of the read group to be considered, if they are single-end read, paired or mixed, and if they are single-end or paired their maximum cycle number (separated by any white space). The read group names can be found in the header of your BAM file, but note that ‘ID:’ is not part of the read group name. If you do not know the maximum cycle number the genome was sequenced at, you can find the maximum read length in the BAM file with the task BAMDiagnostics. If you do not know whether a read group was sequenced with a paired or single-end protocol, or whether it is mixed (i.e. contains reads from both paired and single-end sequencing) you can check the SAM flags of the reads. If you specify the wrong type, ATLAS will throw an error saying that it found nonsensical settings for an offending read.

Optional inputs :

--blacklist blacklist.txt Input text file to filter out specific reads (each on a new line) provided as a list that should be ignored and just written to file. Default = nothing specific is filtered out.
--recal OR --RGInfo sample_RGInfo.json Quality score recalibration file (see estimateErrors for further information).
--pmd OR --RGInfo sample_RGInfo.json Post-mortem damage parameters (see estimateErrors for further information).

Specific Parameters :

--mergingMethod method_to_be_used_for_merging To specify the method to be used for merging. Options: middle, firstMate, secondMate, highestQuality, randomRead. Default = middle.
--outQual integer_1,integer_2 To constrain the quality scores to the indicated range when writing alignments. Default = Will use the full range of quality scores when writing alignments.
--writeBinnedQualities To write Illumina-binned quality scores. Default = Writes raw quality scores.
--keepOrphans To keep orphaned reads. Default = Will filter out orphaned reads.
--acceptedDistance integer_value To specify distance up-to which mates will not be considered orphans. Default = 2000 bp.
--removeSoftClippedBases To remove all softclipped bases. Default = Will not remove softclipped bases.

Engine parameters that are common to all tasks can be found here.

4.5.2 Output

*_merged.bam A BAM file with merged reads.
*_merged.bam.bai Index files for merged BAM files.
*_filterSummary.txt A text file with filter summary.

4.5.3 Usage Example

#! /bin/bash

. $(dirname $0)/find_atlas

# paired end
out="paired"
. $(dirname $0)/simulate --numReadGroups 10 --fixedSeed 140 \
  --seqType paired --out $out --logFile $out.out

echo "readGroup seqType seqCycles" > rgs_paired.txt
echo "SimReadGroup1 paired 100" >> rgs_paired.txt
echo "SimReadGroup3 paired 100" >> rgs_paired.txt
echo "SimReadGroup5 paired 100" >> rgs_paired.txt
echo "SimReadGroup7 paired 100" >> rgs_paired.txt
echo "SimReadGroup9 paired 100" >> rgs_paired.txt


for name in "middle" "firstMate" "secondMate" "highestQuality" "randomRead"; do
    out="$name"
    $atlas --task mergeOverlappingReads  --mergingMethod $name \
           --bam paired.bam --readGroupSettings rgs_paired.txt\
           --fixedSeed 141 --out $out --logFile $out.out 2> $out.eout

    out="${name}_2nd"
    $atlas --task mergeOverlappingReads  --mergingMethod $name \
           --bam ${name}_merged.bam --readGroupSettings rgs_paired.txt\
           --fixedSeed 142 --out $out --logFile $out.out 2> $out.eout

    if ! diff -q <(samtools view ${name}_merged.bam) <(samtools view ${name}_2nd_merged.bam) > /dev/null; then
        samtools view ${name}_merged.bam > ${name}_merged.sam
        samtools view ${name}_2nd_merged.bam > ${name}_2nd_merged.sam
        >&2 echo "${name}_merged.bam and  ${name}_2nd_merged.bam differ"
    fi
done

# single end
out="single"
. $(dirname $0)/simulate --numReadGroups 10 --fixedSeed 143 \
  --seqType single --out $out --logFile $out.out

echo "readGroup seqType seqCycles" > rgs_single.txt
echo "SimReadGroup1 single 100" >> rgs_single.txt
echo "SimReadGroup3 single 100" >> rgs_single.txt
echo "SimReadGroup5 single 100" >> rgs_single.txt
echo "SimReadGroup7 single 100" >> rgs_single.txt
echo "SimReadGroup9 single 100" >> rgs_single.txt

out="merge"
$atlas --task mergeOverlappingReads \
       --bam single.bam --readGroupSettings rgs_single.txt \
       --fixedSeed 144 --out $out --logFile $out.out 2> $out.eout

4.5.4 Additional Information

Merging Methods

none No merging.
middle Will keep half of the overlapping positions of each mate.
firstMate Will merge first mate at overlapping positions.
secondMate Will merge second mate at overlapping positions.
highestQuality Will keep read with the highest minimum quality at overlapping positions.
randomRead Will keep random read for all overlapping positions.