3.2 Data from a Population
Let us first create a directory specific for this tutorial:
mkdir atlas-tutorial_B
cd atlas-tutorial_B
3.2.1 Simulating Population
We will start by simulating a population of 11 samples, adding some Postmortem Damage and Sequencing Errors. We will use the following, non-standard, site frequency spectrum:
echo '#SHAPE=<23>
0.96 0.005 0.01 0.0025 0.003 0.002 0.0016 0.0014 0.0013 0.001 0.001 0.0009 0.0008 0.00075 0.0007 0.00065 0.0006 0.0005 0.0006 0.0005 0.0005 0.0005 0.0004' > sfs.txt
atlas --task simulate --type SFS --sfs sfs.txt \
--sampleSize 11 --chrLength 123456 --depth 13 --ploidy 2 \
--recal "intercept[0.1];quality:polynomial[0.95,0.01]" \
--pmd "CT5:0.09*exp(-0.2*p)+0.01;GA3:0.11*exp(-0.2*p)+0.01"
3.2.2 Estimating the Errors
First, for each bam-file, we need to estimate the PMD and Sequencing errors. This time, we will estimate the default empiric error model (this may take a few minutes):
for bam in *.bam; do
atlas estimateErrors --minDeltaLL 0.1 --fasta ATLAS_simulations.fasta --bam $bam
done
This creates 11 json-files, called ATLAS_simulations_ind{1..11}_RGInfo.json
3.2.3 Genotype Likelihood Files
The bam-file format is a per-read file format, however, for most downstream analysis, and especially when comparing different samples, we want to do this per site (position on chromosome). For this, the Genotype Likelihood File Format is convenient, which stores for each site the likelihoods of the ten genotypes: AA, AC, AG, AT, CC, CG, CT, GG, GT, TT.
Now, we can create the GLF files using the task GLF:
for bam in *.bam; do
name=${bam%.bam}
atlas GLF --bam $bam --RGInfo ${name}_RGInfo.json
done
The GLF-files are called *.glf.gz
, and the index-files *.glf.idx
. You can open the index-files with a text editor and verify that they all have the same number of chromosomes and the same length.
You can print a GLF-file using the following command:
atlas printGLF --glf ATLAS_simulations_ind5.glf.gz | less
The values of the likelihoods are 16-bit Phreds = -1000*log10(P))
, so a value of 301 would mean a log10Probability of -0.301 or a Probability of 0.5.
We will define a variable that holds all the glf-files:
samples=$(ls -1 *.glf.gz | paste -s -d ',' -)
3.2.4 Major and Minor Alleles
Now that we have the GLF-files, we can do various population tasks on them. Let’s start by Estimating major and minor alleles with the task majorMinor
atlas majorMinor --glf $samples
This creates the Variant Call Format file called ATLAS_majorMinor.vcf.gz
3.2.5 Site Allele Frequencies
We simulated the population with a given SFS, so let’s see if we can estimate it. First, we need to estimate the site allele frequencies using the task saf
:
atlas saf --glf $samples --fasta ATLAS_simulations.fasta
This creates saf-files. If you wish to view it, you can use the command realSFS print
of the angsd tool chain.
3.2.6 Site Frequency Spectrum
To infer the site frequency spectrum of our population, we will use the tool winsfs. Please follow the following instructions to install it:
https://github.com/malthesr/winsfs?tab=readme-ov-file#installation
Please set winsfs
as a global alias with:
alias winsfs="/path/to/your/winsfs/build/winsfs"
We can calculate and view the normalized site frequency spectrum using the same command:
winsfs saf.saf.idx | winsfs view --normalise
Is it similar to the one we simulated? What would the SFS look like without correcting the PMD and sequencing errors?