4.5 mergeOverlappingReads
Merging paired-end reads in BAM file
mergeOverlappingReads
should be run before any consecutive variant discovery or population genetic tool to merge paired end reads. You need to specify which read groups should be considered for merging. Others will just be written to the BAM as they are.
4.5.1 Input
Required inputs :
--bam Input_bam_file.bam |
Input bam file. |
--readGroupSettings Input_text_file.txt |
Input text file specifying read group settings: the names of the read group to be considered, if they are single-end read, paired or mixed, and if they are single-end or paired their maximum cycle number (separated by any white space). The read group names can be found in the header of your BAM file, but note that ‘ID:’ is not part of the read group name. If you do not know the maximum cycle number the genome was sequenced at, you can find the maximum read length in the BAM file with the task BAMDiagnostics. If you do not know whether a read group was sequenced with a paired or single-end protocol, or whether it is mixed (i.e. contains reads from both paired and single-end sequencing) you can check the SAM flags of the reads. If you specify the wrong type, ATLAS will throw an error saying that it found nonsensical settings for an offending read. |
Optional inputs :
--blacklist blacklist.txt |
Input text file to filter out specific reads (each on a new line) provided as a list that should be ignored and just written to file. Default = nothing specific is filtered out. |
--recal OR --RGInfo sample_RGInfo.json |
Quality score recalibration file (see estimateErrors for further information). |
--pmd OR --RGInfo sample_RGInfo.json |
Post-mortem damage parameters (see estimateErrors for further information). |
Specific Parameters :
--mergingMethod method_to_be_used_for_merging |
To specify the method to be used for merging. Options: middle, firstMate, secondMate, highestQuality, randomRead. Default = middle. |
--outQual integer_1,integer_2 |
To constrain the quality scores to the indicated range when writing alignments. Default = Will use the full range of quality scores when writing alignments. |
--writeBinnedQualities |
To write Illumina-binned quality scores. Default = Writes raw quality scores. |
--keepOrphans |
To keep orphaned reads. Default = Will filter out orphaned reads. |
--acceptedDistance integer_value |
To specify distance up-to which mates will not be considered orphans. Default = 2000 bp. |
--removeSoftClippedBases |
To remove all softclipped bases. Default = Will not remove softclipped bases. |
Engine parameters that are common to all tasks can be found here.
4.5.2 Output
*_merged.bam | A BAM file with merged reads. |
*_merged.bam.bai | Index files for merged BAM files. |
*_filterSummary.txt | A text file with filter summary. |
4.5.3 Usage Example
#! /bin/bash
. $(dirname $0)/find_atlas
echo '{"RG1":{"seqType": "paired"}, "RG2":{"seqType": "single"}, "RG3":{"seqType": "paired"}, "RG4":{"seqType": "single"}, "RG5":{"seqType": "paired"}}' > sim.json
# paired end
out="simulate"
$atlas --task simulate --RGInfo sim.json --chrLength 111111 \
--fixedSeed 140 --out $out --logFile $out.out 2> $out.eout
echo "RG1 RG2" > rgs.txt
out="mergeRG"
$atlas --task mergeRG --bam simulate.bam --readGroups rgs.txt \
--fixedSeed 141 --out $out --logFile $out.out 2> $out.eout
bam=mergeRG_mergedRG.bam
out="simulate_bd"
$atlas --task BAMDiagnostics --bam $bam \
--fixedSeed 142 --logFile $out.out 2> $out.eout
out="simulate_tt"
$atlas --task transitionTable --bam $bam --fasta simulate.fasta \
--fixedSeed 143 --out $out --logFile $out.out 2> $out.eout
out="simulate_an"
$atlas --task analysePairs --bam $bam \
--fixedSeed 144 --out $out --logFile $out.out 2> $out.eout
for name in "middle" "keepFirst" "keepSecond" "keepFwd" "keepRev" "random"; do
out="$name"
$atlas --task mergeOverlappingReads --mergingMethod $name --bam $bam \
--fixedSeed 150 --out $out --logFile $out.out 2> $out.eout
out="${name}_bd"
$atlas --task BAMDiagnostics --bam ${name}_merged.bam \
--fixedSeed 151 --logFile $out.out 2> $out.eout
out="${name}_tt"
$atlas --task transitionTable --bam ${name}_merged.bam --fasta simulate.fasta \
--fixedSeed 152 --out $out --logFile $out.out 2> $out.eout
out="${name}_an"
$atlas --task analysePairs --bam ${name}_merged.bam \
--fixedSeed 153 --out $out --logFile $out.out 2> $out.eout
out="${name}_2nd"
$atlas --task mergeOverlappingReads --mergingMethod $name --bam ${name}_merged.bam \
--fixedSeed 154 --out $out --logFile $out.out 2> $out.eout
if ! diff -q <(samtools view ${name}_merged.bam) <(samtools view ${name}_2nd_merged.bam) > /dev/null; then
samtools view ${name}_merged.bam > ${name}_merged.sam
samtools view ${name}_2nd_merged.bam > ${name}_2nd_merged.sam
>&2 echo "${name}_merged.bam and ${name}_2nd_merged.bam differ"
fi
done
4.5.4 Additional Information
Merging Methods
none |
No merging. |
middle |
Will keep half of the overlapping positions of each mate. |
firstMate |
Will merge first mate at overlapping positions. |
secondMate |
Will merge second mate at overlapping positions. |
highestQuality |
Will keep read with the highest minimum quality at overlapping positions. |
randomRead |
Will keep random read for all overlapping positions. |