4.5 mergeOverlappingReads
Merging paired-end reads in BAM file
mergeOverlappingReads
should be run before any consecutive variant discovery or population genetic tool to merge paired end reads. You need to specify which read groups should be considered for merging. Others will just be written to the BAM as they are.
4.5.1 Input
Required inputs :
--bam Input_bam_file.bam |
Input bam file. |
--readGroupSettings Input_text_file.txt |
Input text file specifying read group settings: the names of the read group to be considered, if they are single-end read, paired or mixed, and if they are single-end or paired their maximum cycle number (separated by any white space). The read group names can be found in the header of your BAM file, but note that ‘ID:’ is not part of the read group name. If you do not know the maximum cycle number the genome was sequenced at, you can find the maximum read length in the BAM file with the task BAMDiagnostics. If you do not know whether a read group was sequenced with a paired or single-end protocol, or whether it is mixed (i.e. contains reads from both paired and single-end sequencing) you can check the SAM flags of the reads. If you specify the wrong type, ATLAS will throw an error saying that it found nonsensical settings for an offending read. |
Optional inputs :
--blacklist blacklist.txt |
Input text file to filter out specific reads (each on a new line) provided as a list that should be ignored and just written to file. Default = nothing specific is filtered out. |
--recal OR --RGInfo sample_RGInfo.json |
Quality score recalibration file (see estimateErrors for further information). |
--pmd OR --RGInfo sample_RGInfo.json |
Post-mortem damage parameters (see estimateErrors for further information). |
Specific Parameters :
--mergingMethod method_to_be_used_for_merging |
To specify the method to be used for merging. Options: middle, firstMate, secondMate, highestQuality, randomRead. Default = middle. |
--outQual integer_1,integer_2 |
To constrain the quality scores to the indicated range when writing alignments. Default = Will use the full range of quality scores when writing alignments. |
--writeBinnedQualities |
To write Illumina-binned quality scores. Default = Writes raw quality scores. |
--keepOrphans |
To keep orphaned reads. Default = Will filter out orphaned reads. |
--acceptedDistance integer_value |
To specify distance up-to which mates will not be considered orphans. Default = 2000 bp. |
--removeSoftClippedBases |
To remove all softclipped bases. Default = Will not remove softclipped bases. |
Engine parameters that are common to all tasks can be found here.
4.5.2 Output
*_merged.bam | A BAM file with merged reads. |
*_merged.bam.bai | Index files for merged BAM files. |
*_filterSummary.txt | A text file with filter summary. |
4.5.3 Usage Example
#! /bin/bash
. $(dirname $0)/find_atlas
# paired end
out="paired"
. $(dirname $0)/simulate --numReadGroups 10 --fixedSeed 140 \
--seqType paired --out $out --logFile $out.out
echo "readGroup seqType seqCycles" > rgs_paired.txt
echo "SimReadGroup1 paired 100" >> rgs_paired.txt
echo "SimReadGroup3 paired 100" >> rgs_paired.txt
echo "SimReadGroup5 paired 100" >> rgs_paired.txt
echo "SimReadGroup7 paired 100" >> rgs_paired.txt
echo "SimReadGroup9 paired 100" >> rgs_paired.txt
for name in "middle" "firstMate" "secondMate" "highestQuality" "randomRead"; do
out="$name"
$atlas --task mergeOverlappingReads --mergingMethod $name \
--bam paired.bam --readGroupSettings rgs_paired.txt\
--fixedSeed 141 --out $out --logFile $out.out 2> $out.eout
out="${name}_2nd"
$atlas --task mergeOverlappingReads --mergingMethod $name \
--bam ${name}_merged.bam --readGroupSettings rgs_paired.txt\
--fixedSeed 142 --out $out --logFile $out.out 2> $out.eout
if ! diff -q <(samtools view ${name}_merged.bam) <(samtools view ${name}_2nd_merged.bam) > /dev/null; then
samtools view ${name}_merged.bam > ${name}_merged.sam
samtools view ${name}_2nd_merged.bam > ${name}_2nd_merged.sam
>&2 echo "${name}_merged.bam and ${name}_2nd_merged.bam differ"
fi
done
# single end
out="single"
. $(dirname $0)/simulate --numReadGroups 10 --fixedSeed 143 \
--seqType single --out $out --logFile $out.out
echo "readGroup seqType seqCycles" > rgs_single.txt
echo "SimReadGroup1 single 100" >> rgs_single.txt
echo "SimReadGroup3 single 100" >> rgs_single.txt
echo "SimReadGroup5 single 100" >> rgs_single.txt
echo "SimReadGroup7 single 100" >> rgs_single.txt
echo "SimReadGroup9 single 100" >> rgs_single.txt
out="merge"
$atlas --task mergeOverlappingReads \
--bam single.bam --readGroupSettings rgs_single.txt \
--fixedSeed 144 --out $out --logFile $out.out 2> $out.eout
4.5.4 Additional Information
Merging Methods
none |
No merging. |
middle |
Will keep half of the overlapping positions of each mate. |
firstMate |
Will merge first mate at overlapping positions. |
secondMate |
Will merge second mate at overlapping positions. |
highestQuality |
Will keep read with the highest minimum quality at overlapping positions. |
randomRead |
Will keep random read for all overlapping positions. |