12.5 Pallas

Pallas is the most flexible submodule in the ATLAS-Pipeline.

12.5.1 Input

We strongly advise you to run the Perses module, which estimates base quality recalibration parameters and merges paired-end reads before running Pallas. If you want to use all samples from a previous Perses run, you can define sampleFile: fromPerses. The pipeline will automatically take the produced output-tables as inputfile.
If you want to change the automatic tables to define readgroup-merging, we advise to copy it to another place (like supporting_files/samples_Pallas.tsv) so your changes are not overwritten in case you decide to re-run the Gaia or Rhea pipeline.
If you want to prepare the table by hand (e.g. because you have already filtered BAMfiles or PMD- and recal files at hand), prepare a tab delimited table with the columns as indicated below.

Example: We want to analyze the following files:


Because sample1 and sample2 are of very low depth and did not contain enough data for sufficient PMD and recal analysis, but come from the same sampling site/aera and were sequenced on the same sequencing run, we have estimated the parameters outside of the pipeline for a merged BAMfile from sample1 and sample2. These parameters can now be provided with the optional columns PMDFile and recalFile. In our example, the input file could look similar to this:

Sample Path PMDFile recalFile
sample1 /path/to/s1/ /path/to/PMDfile.txt /path/to/recalFile.txt
sample2 /path/to/s2/ /path/to/PMDfile.txt /path/to/recalFile.txt
sample3 /path/to/s3/ F F
sample4 /path/to/s4/ F F
sample5 /path/to/s5/ F F

The order of the table columns can be changed. Additional columns can be present. Columns PMDFile and recalFile are optional. The table can contain comments (starting with ‘#’) but no whitespace.

The config file has to be provided in yaml format. The same example as below can also be found in examples/example_config_Pallas.yaml. To use it as a template, make sure to copy it to a new location, otherwise it will be overwritten once you update the pipeline. Compulsatory fields are:

  • runScript: Pallas
  • sampleFile: give location to sample file or specify fromPerses (see also Sample file)
  • atlas: location of atlas executable
  • ref: location of reference fasta file

runScript: Pallas
sampleFile: fromPerses
atlas: ../atlas/build/atlas 
ref: ../Reference/hs37d5.fa 

# optional parameters
# atlasParams:         general atlas parameters applying to all jobs unless specified within the job parameters. default: ""

# recal:               update base quality scores. If enabled, ATLAS-Pipeline will assume recal parameter estimation by Perses workflow. 
#                      You can also provide one file with recal parameters for all samples involved. Can be (T/F/<recal_params>). default=F
# pmd:                 take post-mortem-damage patterns into account. If enabled, ATLAS-Pipeline will assume recal parameter estimation by Perses workflow. 
#                      You can also provide one file with recal parameters for all samples involved. Can be (T/F/<recal_params>). default=F
# parallelize:         run glf and major minor tasks in parallel for each chromosome. MajorMinor files will then be concatenated into one outfile. default=F
#     binCount:        if parallelize=T you can additionally set a number of bins for parallelization. This makes sense if your number of chromosomes/contigs is very high. default=F
#     chromPar:        if you want to restrict your parallelization on a certain set of chromosomes, add them here as a list. example: [1,2,3,4]. If "F", default=all (all contigs in BAM header considered)
# glf:                 estimate genotype likelihoods. default=F
# glfParams:           specific parameters to pass for Atlas-task glf. default: ""
#                      Attention: do not use option "--chr" if "parallelize" is used! use config parameter "chromPar:" instead.
# call:                perform base calling with respective method. Can be (Bayes/MLE/AllelePresence/F). default=F
# callParams:          specific parameters to pass for Atlas-task call. default: ""
# inbreeding:          estimate inbreeding coefficient
# inbreedingParams:    specific parameters to pass for Atlas-task inbreeding. default: ""
# theta:               estimate theta either per window or genome wide. Can be (window/genomeWide/F). default=F
# thetaParams:         specific parameters to pass for Atlas-task theta. default: ""
# maMi:                produce major-minor VCF file (will automatically also produce glf-files). default=F
# maMiParams:          specific parameters to pass for Atlas-task majorMinor. default: ""
#                      Attention: do not use option "--chr" if "parallelize" is used! use config parameter "chromPar:" instead.
#                      Attention: do not use option "--out"! use config parameter "maMiOut:" instead.
# maMiOut:             save majorMinor vcf with a different prefix. this can be useful to produce outfiles with different filter options.
#                      You can also define which maMi file should be used for PCA by providing the prefix (without .vcf.gz) here.
# beagle:              convert major-minor file to beagle format (as input for ANGSD). default=F
#                      will automaticaly produce glf-files and major-minor vcf file.
# beagleParams:        specific parameters to pass for Atlas-task convertVCF to beagle. default: ""
# PCA:                 perform a PCA with pcANGSD. will automaticaly produce glf-files, major-minor vcf file and beagle file. default=F
#    PCAngsd:          location of PCAngsd executable. no default. Must be given.
#    PCAParams:        specific parameters to pass for PCAngsd. default: ""
#    threads:          threads for PCAngsd. default=8
# sex:                 only for human data: estimate genetic sex. default=F
# general parameters
# tmpDir:             specify a temporary directory to be used. default: [$TMPDIR]
# outPath:            specify the name of the results folder.
#                     In case you refer to prior modules (with fromGaia, fromRhea, etc.)
#                     these results must be in the same folder as well. default: results

You can choose any of the following paths to be executed:

To perform PCA, you need to have PCAngsd pre-installed. To properly install, activate the conda environment and run:

git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd
pip3 install -e .

This will make sure, PCAngsd can be called with the command pcangsd system-wide.

If you perform PCA, and no MajorMinor file exists (either in default location or defined with maMiOut), the ATLAS-Pipeline will automatically produce GLF and Major Minor files.