Salmon

How did it end up to be named after a fish I have no idea but I learnt this in Japan - happy coincident

There are many commonly used mapping workflow available, and benchmark paper are a good starting point in selecting suitable variant. It is painful but true that one has to try many before settling to the most suitable, but considering time, learning curve and the foreseeable continuous support to the related packages, this manual would focus on the Salmon method.

Salmon has 2 quantification modes. Practically, without going into technical details, in the first mode Salmon maps the fragment (raw reads stored inside fastq) to an indexed reference genome (quasi-mapping) and count the hit, then move on to the next. In the second mode a SAM/BAM alignment files were provided to Salmon and Salmon will produce the quantification from the alignment result. One does not need to index reference genome by Salmon before running the quantification for the second mode.

1. Selecting a reference genome

The most common reference genome database are Ensembl, Refseq (NCBI), and UCSC. I worked exclusively with genome curated by Ensembl so let's start from there. Google "Ensembl FTP" and you should safely land on the server within first 3 hits. The FASTA file of cDNA of Human is what we are after.

We want Homo_sapiens.GRCh38.cdna.all.fa.gz and please click to have it downloaded. This is the reference genome.

The difference between DNA and cDNA, with the former containing all the sequence of a genome and the latter carrying only the coding RNA. In case of mRNA sequencing, selecting cDNA as the reference genome is more sensible since we should have no non-coding RNA in our library. This will significantly speed up the mapping and quantification.

2. Install and activate Salmon

Before installing Salmon, we need to install conda first to provide the python environment for Salmon.

sudo sh miniconda.sh

$ conda config --add channels conda-forge
$ conda config --add channels bioconda
$ conda create -n salmon salmon

The last line means to create an environment called salmon and install package called salmon inside the environment. So every time when you want to fire up Salmon -

conda activate salmon

The indication that you are in conda environment is the attachment of bracketed environment name in front of your user name in the terminal, like this

(salmon) user@computer :

3. Index reference genome and quantify

Then use this line to index the reference genome, GRCh38.cDNA.fa.gz, for mapping and quantification and store the indexed files inside cDNA_index -

salmon index -t GRCh38.cDNA.fa.gz -i cDNA_index

By now one should know that the single character after a hyphen (-), -t and -i in this case, is the parameter/argument that being passed to the command at front for additional condition/options.

Then we can map and quantify the fastq file using Salmon. Our example is the sequence data from a single-end library so we should use

salmon quant -i /home/user/FUS/cDNA_index -l A \
	 -r SRR19241828.fastq \
         -p 7 --validateMappings --gcBias -o /home/user/FUS/quants/SRR19241828

Refer to here for the meaning of the parameter

Replace -r with -1 and -2 for paired-end read library to specific the paired .fastq file. -r parameter is for single-end library.

4. Batch process

The way to loop through the whole folder and process all files in one go - First of all create the below .sh file. You can do that with a .txt in the GUI and then save as .sh. One can surely do that within terminal using their favorite word processors such as nano.

#!/bin/bash
while read line
do
   samp=`basename ${line}`
echo "Processing sample ${samp}"
salmon quant -i /home/user/STB/FUS/cDNA_index -l A \
	 -r ${samp}.fastq \
         -p 7 --validateMappings --gcBias -o /home/user/FUS/quants/${samp}
done < file.txt

the file.txt at the end of line 9 means to input this file for the while loop to read, that means the name of the fastq file that the while loop is reading in are from the file.txt, which is essentially the SRR_Acc_List.txt that we generated before.

Run the file by bash file.sh, and quit conda by conda deactivate

Previous6. Mapping/alignment and quantification NextDESeq2

Last updated 3 years ago