12.5 Pallas
Pallas is the most flexible submodule in the ATLAS-Pipeline.
12.5.1 Input
We strongly advise you to run the Perses module, which estimates base quality recalibration parameters and merges paired-end reads before running Pallas.
If you want to use all samples from a previous Perses run, you can define sampleFile: fromPerses. The pipeline will automatically take the produced output-tables as inputfile.
If you want to change the automatic tables to define readgroup-merging, we advise to copy it to another place (like supporting_files/samples_Pallas.tsv) so your changes are not overwritten in case you decide to re-run the Gaia or Rhea pipeline.
If you want to prepare the table by hand (e.g. because you have already filtered BAMfiles or PMD- and recal files at hand), prepare a tab delimited table with the columns as indicated below.
Example: We want to analyze the following files:
Because sample1 and sample2 are of very low depth and did not contain enough data for sufficient PMD and recal analysis, but come from the same sampling site/aera and were sequenced on the same sequencing run, we have estimated the parameters outside of the pipeline for a merged BAMfile from sample1 and sample2. These parameters can now be provided with the optional columns PMDFile and recalFile. In our example, the input file could look similar to this:
Sample | Path | PMDFile | recalFile |
sample1 | /path/to/s1/ | /path/to/PMDfile.txt | /path/to/recalFile.txt |
sample2 | /path/to/s2/ | /path/to/PMDfile.txt | /path/to/recalFile.txt |
sample3 | /path/to/s3/ | F | F |
sample4 | /path/to/s4/ | F | F |
sample5 | /path/to/s5/ | F | F |
The order of the table columns can be changed. Additional columns can be present. Columns PMDFile and recalFile are optional. The table can contain comments (starting with ‘#’) but no whitespace.
The config file has to be provided in yaml format. The same example as below can also be found in examples/example_config_Pallas.yaml. To use it as a template, make sure to copy it to a new location, otherwise it will be overwritten once you update the pipeline.
Compulsatory fields are:
- runScript: Pallas
- sampleFile: give location to sample file or specify fromPerses (see also Sample file)
- atlas: location of atlas executable
- ref: location of reference fasta file
runScript: Pallas
sampleFile: fromPerses
atlas: ../atlas/build/atlas
ref: ../Reference/hs37d5.fa
# optional parameters
# atlasParams: general atlas parameters applying to all jobs unless specified within the job parameters. default: ""
# recal: update base quality scores. If enabled, ATLAS-Pipeline will assume recal parameter estimation by Perses workflow.
# You can also provide one file with recal parameters for all samples involved. Can be (T/F/<recal_params>). default=F
# pmd: take post-mortem-damage patterns into account. If enabled, ATLAS-Pipeline will assume recal parameter estimation by Perses workflow.
# You can also provide one file with recal parameters for all samples involved. Can be (T/F/<recal_params>). default=F
# parallelize: run glf and major minor tasks in parallel for each chromosome. MajorMinor files will then be concatenated into one outfile. default=F
# binCount: if parallelize=T you can additionally set a number of bins for parallelization. This makes sense if your number of chromosomes/contigs is very high. default=F
# chromPar: if you want to restrict your parallelization on a certain set of chromosomes, add them here as a list. example: [1,2,3,4]. If "F", default=all (all contigs in BAM header considered)
# glf: estimate genotype likelihoods. default=F
# glfParams: specific parameters to pass for Atlas-task glf. default: ""
# Attention: do not use option "--chr" if "parallelize" is used! use config parameter "chromPar:" instead.
# call: perform base calling with respective method. Can be (Bayes/MLE/AllelePresence/F). default=F
# callParams: specific parameters to pass for Atlas-task call. default: ""
# inbreeding: estimate inbreeding coefficient
# inbreedingParams: specific parameters to pass for Atlas-task inbreeding. default: ""
# theta: estimate theta either per window or genome wide. Can be (window/genomeWide/F). default=F
# thetaParams: specific parameters to pass for Atlas-task theta. default: ""
# maMi: produce major-minor VCF file (will automatically also produce glf-files). default=F
# maMiParams: specific parameters to pass for Atlas-task majorMinor. default: ""
# Attention: do not use option "--chr" if "parallelize" is used! use config parameter "chromPar:" instead.
# Attention: do not use option "--out"! use config parameter "maMiOut:" instead.
# maMiOut: save majorMinor vcf with a different prefix. this can be useful to produce outfiles with different filter options.
# You can also define which maMi file should be used for PCA by providing the prefix (without .vcf.gz) here.
# beagle: convert major-minor file to beagle format (as input for ANGSD). default=F
# will automaticaly produce glf-files and major-minor vcf file.
# beagleParams: specific parameters to pass for Atlas-task convertVCF to beagle. default: ""
# PCA: perform a PCA with pcANGSD. will automaticaly produce glf-files, major-minor vcf file and beagle file. default=F
# PCAngsd: location of PCAngsd executable. no default. Must be given.
# PCAParams: specific parameters to pass for PCAngsd. default: ""
# threads: threads for PCAngsd. default=8
# sex: only for human data: estimate genetic sex. default=F
# general parameters
# tmpDir: specify a temporary directory to be used. default: [$TMPDIR]
# outPath: specify the name of the results folder.
# In case you refer to prior modules (with fromGaia, fromRhea, etc.)
# these results must be in the same folder as well. default: results
You can choose any of the following paths to be executed:
To perform PCA, you need to have PCAngsd pre-installed. To properly install, activate the conda environment and run:
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd
pip3 install -e .
This will make sure, PCAngsd can be called with the command pcangsd
If you perform PCA, and no MajorMinor file exists (either in default location or defined with maMiOut), the ATLAS-Pipeline will automatically produce GLF and Major Minor files.