12.1 Requirements
- ATLAS-Pipeline runs on the command line. A cluster-support for SLURM clusters is included.
- The program ATLAS needs to be installed on your local machine. The
current pipeline runs with the alpha branch of ATLAS. You can switch
to the alpha branch by typing inside the atlas directory:
git checkout --track remotes/origin/alpha
- Conda (Miniconda or Anaconda) must be installed on your system
12.1.1 Installation
Download the repository to the location on your computer where the analysis
should be executed:
git clone https://bitbucket.org/wegmannlab/atlas-pipeline.git
The pipeline comes with its own conda environment for which it works best (environment_X.yaml). To set up the environment, do:
conda env create -f environment_X.yml
Then activate your environment before executing the pipeline:
conda activate ATLAS-Pipeline_X
Set conda to use strict channel priorities
conda config --set channel_priority strict
Attention: Sometimes, if snakemake is pre-installed, especially on cluster systems, an additional installation of snakemake via mamba is necessary. After activating the environment, check your snakemake version (snakemake --version
).
It should be 7.8.3 or higher. If it’s lower, run from within the environment:
mamba install -c bioconda snakemake
Exit and activate the conda environment again and re-check the snakemake
version before you proceed.
12.1.2 Getting started
General syntax:
bash Atlas-Pipeline.sh -f [configfile.yaml] [options]
Be aware, that if you are not running the pipeline on a slurm cluster system, all log-output will be printed to your stdout (which can get very long). To redirect to a logfile, append &> log/[logname].txt
to your command.
It is advised to first run a dry-run with the -d
option, to test for potential input errors.
Unit Tests:
To test that the pipeline is properly set-upand all modules are running, you can run a unit test. The configfiles to run unit tests are in the folder tests/. You only need to adapt the path to your ATLAS executable in each of the configfiles and then you can start the unit tests.
Start by simulating a testcset with
./Atlas-Pipeline.sh -f tests/config_createTestset.yaml
.
This will create an artificial testset in results_testset/0.Testset.
Now you can run the tests/config_Gaia-Testset.yaml, tests/config_Rhea-Testset.yaml, tests/config_Perses-Testset.yaml and tests/config_Pallas-Testset.yaml in that order. Each of the modules will produce the inputfiles for the next module.
Common parameters
on the commandline: | |
---|---|
-h | open help page |
-f | specify config-file (required) |
-d | perform dry-run of the pipeline (It is advised to run a workflow with the dryrun option first, to test for potential input errors) |
-p | additionally make a plot of the DAG structure (can become very large with high sample size) |
-c | run on slurm cluster. A file with cluster parameters can be provided (default: cluster.json) |
-j | (only on cluster) specify how many jobs should be submitted alongside (default: 50) |
-s | pass snakemake commands (in quotes) to the snakemake command line (e.g. “–until rulename”) |
in the config files: | |
---|---|
tmpDir | temporary files will be stored in the defined directory. default: system defined $TEMP |
outPath: | output results will be written to specified folder. Also the default inputs (when using fromGaia or fromRhea sample files) will be expected in this folder. default: results |
Sample files
Each of the modules take a sample file as input. This is a tab-separated table specified in the config-file. The Gaia and Rhea modules each produce sample files that are compatible with all subsequent modules, as the order of the column names and the amount of additional columns than the required for the module is not of importance.
-f [config-File]
To run the ATLAS-Pipeline, you need to provide a config-file with the -f
flag. Here you specify all major information, input-files and thresholds needed for your project. Prepare one config-file per module (Gaia, Rhea, Perses, Pallas) and indicate the desired module with the runScript: keyword. You can find example-config-files in ‘example_files/example.config.*’ and on the wiki pages of each module.
-c (SLURM cluster support):
The ATLAS-Pipeline was specifically designed to run on a SLURM cluster which can be invoced with opton -c
.
A cluster file named cluster.yaml
is expected in your directory (to change, specify a file after -c option). The first entry of this file must be called __default__
and contain the default to all parameters you want to pass to your cluster system. All subsequent entries are called like their tasks within the pipeline (see example.cluster.yaml) and entries therein overwrite the default behavior for that specific rule.
Example :
If your standard SLURM header would look like this :
#SBATCH --err log/logfile-%A.err
#SBATCH --out log/logfile-%A.out
#SBATCH --mem=25G
#SBATCH --cpus-per-task=1
#SBATCH --time=100:00:00
#SBATCH --job-name=MyJob
#SBATCH --partition=pshort
Then your cluster.yaml could look like this :
{
"__default__" :
{
"time" : "3:00:00",
"mem" : "25G",
"out": "log/{rule}.{wildcards}.out",
"err": "log/{rule}.{wildcards}.out",
"job-name": "default",
"cpus-per-task": "1",
"partition": "pshort"
},
"fastqc" :
{
"time" : "03:00:00",
"mem" : "1G",
"job-name": "rawcount"
},
}
In this example, the rule fastqc
will inherit all default settings but specify individual values for time, memory and the job-name. Snakemake-specific rule names and wildcards used in this pipeline (like sample names or bin numbers) can be accessed within curly brackets.
Note : not all jobs will be executed as cluster jobs. Computationally non-demanding jobs (like simple textfile manipulation) will be executed directly within the main pipeline task.
A list of rule names can be found in the respectve module chapters, as well as in the example.cluster.yaml
.