12.2 Gaia

This part of the workflow handles the raw data analysis from unaligned FASTQ files to aligned BAM files. It includes first quality checks, adapter trimming with trim-galore, alignment with BWA-MEM or bwa aln, marking of duplicates with picard-tools and some prior filtering. Also it produces input files for subsequent modules (Rhea or Perses).

12.2.1 Input

Sample file
Prepare a tab delimited table with the following columns (case sensitive):

  • Sample - The prefix you want to give your final BAMfile in the end
  • Lib - Library name. Put the same string for all files arising from the same PCR event (for example if you have lane1 and lane2 files for the same library of if you sequenced the same library multiple times). Duplicates will be marked among all files of the same sample that share the same ‘Lib’ identifier.
  • File - The actual prefix of each of your input files. No restriction on characters or signs. Suffix must be according to Illumina standard (_R1.001.fastq.gz)
  • Read - Specify either “R1” or “R2” for forward and reverse sequencing types (if only R1 is given, single-end sequencing is assumed). Path - Path to each file. No specific folder structure needed. Don’t forget the last slash.

Assume these are your input-files:

/path/to/paired-end/filename1_R1_001.fastq.gz  
/path/to/paired-end/filename1_R2_001.fastq.gz  
/path/to/paired-end/filename2_R1_001.fastq.gz  
/path/to/paired-end/filename2_R2_001.fastq.gz  
/path/to/paired-end/filename3_R1_001.fastq.gz  
/path/to/paired-end/filename3_R2_001.fastq.gz  
/path/to/single-end/filename4_R1_001.fastq.gz  
/path/to/single-end/filename5_R1_001.fastq.gz  
/path/to/single-end/filename6_R1_001.fastq.gz  
/path/to/single-end/filename7_R1_001.fastq.gz  
/path/to/single-end/filename8_R1_001.fastq.gz  

And that files with filename1 to filename3 have been sequenced paired-end, while filename4 to filename8 were sequenced single-end. Further, the files filename1 and filename2 arise from the same PCR event (similarly the groups filename3-filename4, filename5-filename6, and filename7-filename8). In this case, duplicates across those files must be removed. To accomplish this, place the same Lib identifyer (any string you want) for files arising from the same PCR event. In the end, the Gaia pipeline will produce two BAMfiles named Sample1.bam and Sample2.bam, each containing one readgroup per filename.

Example samples table:

Sample Lib File Read Path
#comment
Sample1 LibA filename1 R1 /path/to/paired-end/
Sample1 LibA filename1 R2 /path/to/paired-end/
Sample1 LibA filename2 R1 /path/to/paired-end/
Sample1 LibA filename2 R2 /path/to/paired-end/
Sample1 LibB filename3 R1 /path/to/paired-end/
Sample1 LibB filename3 R2 /path/to/paired-end/
Sample1 LibB filename4 R1 /path/to/single-end/
Sample2 LibC filename5 R1 /path/to/single-end/
Sample2 LibC filename6 R1 /path/to/single-end/
Sample2 LibD filename7 R1 /path/to/single-end/
Sample2 LibD filename8 R1 /path/to/single-end/

The order of the table columns can be changed. Additional columns can be present. The table can contain comments (starting with ‘#’) but no whitespace.

Config file
The config file has to be provided in yaml format. The same example as below can also be found in examples/example_config_Gaia.yaml. To use it as a template, make sure to copy it to a new location, otherwise it will be overwritten once you update the pipeline. Compulsatory fields are: runScript: Gaia, sampleFile: (give location to sample file), atlas: (location of atlas executable), ref: (location of reference fasta file).

# Which module do you want to run?
# select between Gaia, Rhea, Perses and Pallas
runScript: Gaia

###################################################################################
#-----------------------------------GAIA------------------------------------------#
###################################################################################
sampleFile: supporting_files/samples_Gaia.tsv
atlas: ../atlas/build/atlas

#path to reference or bwa index location.
#If the ATLAS-Pipeline should create the index files, set bwaCreateIdx=T
ref: ../../Reference/hs37d5.fa

###################################################################################
# optional parameters
# threads:            default: 1
# javaMem:            increase java heap space. default:10G
# adapter:            indicate if adapter trimming should be performed. default: T
#   adapterSequence1: trimgalore will find standard adapters based on the data.
#                     If no standard adapters were used, add the sequence here.
#                     default: standardAdapters
#   adapterSequence2: same as adapterSequence1 but for second adapter
#                     (only applies to paired-end sequencing)
#   minLen:           shorter reads will be filtered out at trimming. default: 30
#   minQ:             reads with lower quality be filtered during adapter trimming.
#                     default: 0
#                     NOTE: if base quality recalibration is performed in Pallas step,
#                           we advise to keep this threshold at 0.
# aligner:            aligner software to be used.
#                     current choices: BWAmem and BWAaln. default: BWAmem
# bwaCreateIdx        if your reference is not indexed yet and you want the pipeline
#                     to perform this step (default: F)
#
# mappingQ:           reads with lower mapping quality will be filtered out
#                     default: 30
# sequencingFacility: specify the CN flag in BAM headers. default: NA
# BAMDParam:          pass special parameters to BAMDiagnostics step from atlas.
#                     default: ""
#
###################################################################################
# general parameters
# tmpDir:             specify a temporary directory to be used. default: [$TMPDIR]
#
# outPath:            specify the name of the results folder.
#                     In case you refer to prior modules (with fromGaia, fromRhea, etc.)
#                     these results must be in the same folder as well. default: results
#

12.2.2 Output

The Gaia pipeline will produce the following final output-files:

  1. results/1.Gaia/02.fastqc/
    FastQC results per input-file. You can check your sequencing data quality in the *html summary.
    For further interpretation, refer to FastQC
  2. results/1.Gaia/03.trimmed/
    Trimming report and unpaired sequences.
  3. results/1.Gaia/08.MkDup_per_lib/
    Final BAMfile per input file(pair).
    In the example above, this would be filename1.bam, filename2.bam, … filename8.bam.
  4. results/1.Gaia/10.MkDup_per_sample/
    Final merged BAMfile per sample name.
    In the example above, this would be Sample1.bam and Sample2.bam
  5. results/1.Gaia/x.Filestats and results/1.Gaia/x.SamStats
    Output files from atlas task BAMDiagnostics and samtools flagstat for each file / merged sample.
  6. results/1.Gaia/outfiles
    Two tables called FILE_COUNTS and SAM_COUNTS, each containing read statistics per file/sample.
    Also the table Gaia_outTable.tsv, a ready-to-use input table for subsequent modules (Rhea, Perses, Pallas).