Salmon
How did it end up to be named after a fish I have no idea but I learnt this in Japan - happy coincident
There are many commonly used mapping workflow available, and benchmark paper are a good starting point in selecting suitable variant. It is painful but true that one has to try many before settling to the most suitable, but considering time, learning curve and the foreseeable continuous support to the related packages, this manual would focus on the Salmon method.
Salmon has 2 quantification modes. Practically, without going into technical details, in the first mode Salmon maps the fragment (raw reads stored inside fastq) to an indexed reference genome (quasi-mapping) and count the hit, then move on to the next. In the second mode a SAM/BAM alignment files were provided to Salmon and Salmon will produce the quantification from the alignment result. One does not need to index reference genome by Salmon before running the quantification for the second mode.

1. Selecting a reference genome
The most common reference genome database are Ensembl, Refseq (NCBI), and UCSC. I worked exclusively with genome curated by Ensembl so let's start from there. Google "Ensembl FTP" and you should safely land on the server within first 3 hits. The FASTA file of cDNA of Human is what we are after.

We want Homo_sapiens.GRCh38.cdna.all.fa.gz
and please click to have it downloaded. This is the reference genome.
2. Install and activate Salmon
Before installing Salmon, we need to install conda first to provide the python environment for Salmon.
sudo sh miniconda.sh
$ conda config --add channels conda-forge
$ conda config --add channels bioconda
$ conda create -n salmon salmon
The last line means to create an environment called salmon and install package called salmon inside the environment. So every time when you want to fire up Salmon -
conda activate salmon
The indication that you are in conda
environment is the attachment of bracketed environment name in front of your user name in the terminal, like this
(salmon) user@computer :
3. Index reference genome and quantify
Then use this line to index the reference genome, GRCh38.cDNA.fa.gz
, for mapping and quantification and store the indexed files inside cDNA_index
-
salmon index -t GRCh38.cDNA.fa.gz -i cDNA_index
Then we can map and quantify the fastq file using Salmon. Our example is the sequence data from a single-end library so we should use
salmon quant -i /home/user/FUS/cDNA_index -l A \
-r SRR19241828.fastq \
-p 7 --validateMappings --gcBias -o /home/user/FUS/quants/SRR19241828
Replace -r
with -1
and -2
for paired-end read library to specific the paired .fastq file. -r
parameter is for single-end library.

4. Batch process
The way to loop through the whole folder and process all files in one go - First of all create the below .sh
file. You can do that with a .txt
in the GUI and then save as .sh
. One can surely do that within terminal using their favorite word processors such as nano
.
#!/bin/bash
while read line
do
samp=`basename ${line}`
echo "Processing sample ${samp}"
salmon quant -i /home/user/STB/FUS/cDNA_index -l A \
-r ${samp}.fastq \
-p 7 --validateMappings --gcBias -o /home/user/FUS/quants/${samp}
done < file.txt
the file.txt
at the end of line 9 means to input this file for the while
loop to read
, that means the name of the fastq file that the while loop
is reading in are from the file.txt
, which is essentially the SRR_Acc_List.txt
that we generated before.
Run the file by bash file.sh
, and quit conda by conda deactivate
Last updated