12.3 Rhea
The Rhea pipeline will use local realignment (GATK3.8) to realign around potential In-Del positions. As with low-depth data, especially from ancient or non-model organisms, the true range of InDel positions is not always known, it is good practice to perform local realignment on the whole dataset at once (increasing the depth). But with very large sample sizes this can increase computational load and adding individual samples at a later stage wil result in re-starting the whole process again.
With the Rhea pipeline, you can provide two fixed subsets of your studies samples, and all individual samples will be locally realigned using this exact set, allowing not only parallelization but also the adding or removal of individual samples at a later stage:
a) A target set to identify potential target InDel-positions.
b) A guidance set to realign alongside every sample in a multi-alignment.
If the mean depth of your samples has high variation, we advise to downsample the BAMfiles in your guidance set to the depth of your lowest sample. This will prevent local realignment to be performed in favor of the sample with the highest depth.
These two sets don’t need to overlap. Further, the guidance set can contain samples from outside your analysis or downsampled copies of samples from your study.
12.3.1 Input
Sample file
If you want to use all samples from a previous Gaia run as targets and guidance bams, you can define sampleFile: fromGaia. The pipeline will automatically take the Gaia_outTable.tsv as inputfile. Beware that this could be computationally very expensive and might not increase performance.
If you want to change the Gaia table to select target and guidance samples manually, we advise to copy it to another place (like supporting_files/) so your changes are not overwritten in case you decide to re-run the Gaia pipeline.
If you want to prepare the table by hand (e.g. because you have already aligned BAMfiles at hand), prepare a tab delimited table with the columns as indicated below.
Example :
We want to perform local realignment on the following files:
/path/to/s1/sample1.bam
/path/to/s2/sample2.bam
/path/to/s3/sample3.bam
/path/to/s4/sample4.bam
/path/to/s5/sample5.bam
/path/to/s6/sample6.bam
/path/to/s7/sample7.bam
/path/to/s8/sample8.bam
/path/to/s9/sample9.bam
/path/to/s10/sample10.bam
We want to use eight samples as a target set (samples 1 to 8) and five samples as guidance set (samples 5 to 10). But as we have quite different sequencing depths, we downsample the five BAMfiles with atlas downsample (link!) and save the results e.g. for sample 5 to /path/to/downsampled/sample5_4X.bam.
Also we want to include 5 modern BAMfiles (downsampled to 4x) to the guidance list, which should not undergo local realignment themselves (here marked as samples A to E).
So our sample table should look like this:
Sample | Path | LocalReal | Target | Guidance | GuidanceAlias | GuidancePath |
---|---|---|---|---|---|---|
#comment | ||||||
sample1 | /path/to/s1/ | T | T | F | F | F |
sample2 | /path/to/s2/ | T | T | F | F | F |
sample3 | /path/to/s3/ | T | T | F | F | F |
sample4 | /path/to/s4/ | T | T | F | F | F |
sample5 | /path/to/s5/ | T | T | T | sample5_4X | /path/to/downsampled/ |
sample6 | /path/to/s6/ | T | T | T | sample6_4X | /path/to/downsampled/ |
sample7 | /path/to/s7/ | T | T | T | sample7_4X | /path/to/downsampled/ |
sample8 | /path/to/s8/ | T | T | T | sample8_4X | /path/to/downsampled/ |
sample9 | /path/to/s9/ | T | F | T | sample9_4X | /path/to/downsampled/ |
sample10 | /path/to/s10/ | T | F | T | sample10_4X | /path/to/downsampled/ |
sampleA | /path/to/mod/ | F | F | T | F | F |
sampleB | /path/to/mod/ | F | F | T | F | F |
sampleC | /path/to/mod/ | F | F | T | F | F |
sampleD | /path/to/mod/ | F | F | T | F | F |
sampleE | /path/to/mod/ | F | F | T | F | F |
The order of the table columns can be changed. Additional columns can be present. The table can contain comments (starting with ‘#’) but no whitespace.
- Sample
The name of your BAMfile without the .bam suffix
- Path
The path to your BAMfile (don’t forget the final ‘/’)
- LocalReal
Indicate if this sample should be used to perform local realignment (and further downstream analysis). Should only be “F” for samples outside of your analysis that are being used in the guidance set. - Target
Indicate if this sample should be used to find InDel positions (target set). Make sure to select samples of a high variety (e.g. from all populations or geographically well distributed) so you capture many common InDel positions.
- Guidance
Indicate if this sample should be used to perform multi-alignment with every sample of your analysis during local-realignment (guidance set).
- GuidanceAlias
If Guidance is “T”, and you downsampled this sample to a new location to use in the guidance set, state here how the downsampled sample is called (without .bam suffix) - GuidancePath
Same as “GuidanceAlias”, but provide the path of the downsampled BAMfile instead.
config file
The config file has to be provided in yaml format. The same example as below can also be found in examples/example_config_Rhea.yaml. To use it as a template, make sure to copy it to a new location, otherwise it will be overwritten once you update the pipeline.
Compulsatory fields are:
- runScript: Rhea
- sampleFile: give location to sample file or specify fromGaia (see also Sample file)
- atlas: location of atlas executable
- ref: location of reference fasta file
# Which module do you want to run?
# select between Gaia, Rhea, Perses and Pallas
runScript: Rhea
###################################################################################
#-----------------------------------Rhea------------------------------------------#
###################################################################################
# set sampleFile to "fromGaia" if you want to use the outfile from Gaia pipeline.
# Attention! this can lead to extreme computational load. Check manual for details.
sampleFile: fromGaia
#sampleFile: "supporting_files/Gaia_outTable.tsv"
ref: ../../Reference/hs37d5.fa
###################################################################################
###################################################################################
#optional parameters:
#knownSites : full path to vcf-files containing known In-Del positions.
# Pass multiple files comma separated. default: "F"
#
#parallelize : run glf and major minor tasks in parallel for each chromosome.
# MajorMinor files will then be concatenated into one outfile.
# default=F
# binCount : only applies if parallelize=T.
# You can additionally set a number of bins for parallelization.
# This makes sense if your number of chromosomes/contigs is very high.
# default=F
#
#gatkParam : specific parameters to pass for all GATK tasks
# (target creator and local realignment).
# default: ""
#
#gatkInit: three options:
# "install" : get GATK 3.8 from online source,
# install locally and initialize.
# This is the default
# "/path/to/GenomeAnalysisTK.jar" : if you have GATK 3.8 installed locally,
# this version will be registered
# "module load ..." : GATK 3.8 is must be loaded
# via environment modules
# specify the exact module load command'
#
###################################################################################
# general parameters
# tmpDir: specify a temporary directory to be used. default: [$TMPDIR]
#
# outPath: specify the name of the results folder.
# In case you refer to prior modules (with fromGaia, fromRhea, etc.)
# these results must be in the same folder as well. default: results
#
parallelize execution
You can parallelize the execution based on your reference genome. In the config-file, specify parallelize: T to execute per contig. In case you have a large number of contigs, you can bin them for parallelization. For this, specify binCount: and the number of bins. Bin sizes will be calculated, taking contig sizes into account.
use GATK
As the program GATK3.8 needs a valid license on the system, it is not sufficient to load the conda environment. There are three options how to invoke GATK3.8:
- You let the ATLAS-Pipeline install a local version of GATK within the pipeline folder. This is the default behavior.
- You have GATK3.8 pre-installed on your system. You can provide the full path to your GenomeAnalysisTK.jar file with gatkInit: and it will be registered within the conda environment by the pipeline
- You are working on a cluster environment, where you can load GATK3.8 via a
module load
command. Specify the loading command in gatkInit:.
Note: the above does not apply if you invoke the whole pipeline with -e (envmodules).