Salmon
How did it end up to be named after a fish I have no idea but I learnt this in Japan - happy coincident
Last updated
How did it end up to be named after a fish I have no idea but I learnt this in Japan - happy coincident
Last updated
There are many commonly used mapping workflow available, and benchmark paper are a good starting point in selecting suitable variant. It is painful but true that one has to try many before settling to the most suitable, but considering time, learning curve and the foreseeable continuous support to the related packages, this manual would focus on the Salmon method.
Salmon has 2 quantification modes. Practically, without going into technical details, in the first mode Salmon maps the fragment (raw reads stored inside fastq) to an indexed reference genome (quasi-mapping) and count the hit, then move on to the next. In the second mode a SAM/BAM alignment files were provided to Salmon and Salmon will produce the quantification from the alignment result. One does not need to index reference genome by Salmon before running the quantification for the second mode.
The most common reference genome database are Ensembl, Refseq (NCBI), and UCSC. I worked exclusively with genome curated by Ensembl so let's start from there. Google "Ensembl FTP" and you should safely land on the server within first 3 hits. The FASTA file of cDNA of Human is what we are after.
We want Homo_sapiens.GRCh38.cdna.all.fa.gz
and please click to have it downloaded. This is the reference genome.
The difference between DNA and cDNA, with the former containing all the sequence of a genome and the latter carrying only the coding RNA. In case of mRNA sequencing, selecting cDNA as the reference genome is more sensible since we should have no non-coding RNA in our library. This will significantly speed up the mapping and quantification.
Before installing Salmon, we need to install conda first to provide the python environment for Salmon.
sudo sh miniconda.sh
The last line means to create an environment called salmon and install package called salmon inside the environment. So every time when you want to fire up Salmon -
The indication that you are in conda
environment is the attachment of bracketed environment name in front of your user name in the terminal, like this
(salmon) user@computer :
Then use this line to index the reference genome, GRCh38.cDNA.fa.gz
, for mapping and quantification and store the indexed files inside cDNA_index
-
salmon index -t GRCh38.cDNA.fa.gz -i cDNA_index
By now one should know that the single character after a hyphen (-), -t
and -i
in this case, is the parameter/argument that being passed to the command at front for additional condition/options.
Then we can map and quantify the fastq file using Salmon. Our example is the sequence data from a single-end library so we should use
Refer to here for the meaning of the parameter
Replace -r
with -1
and -2
for paired-end read library to specific the paired .fastq file. -r
parameter is for single-end library.
The way to loop through the whole folder and process all files in one go - First of all create the below .sh
file. You can do that with a .txt
in the GUI and then save as .sh
. One can surely do that within terminal using their favorite word processors such as nano
.
the file.txt
at the end of line 9 means to input this file for the while
loop to read
, that means the name of the fastq file that the while loop
is reading in are from the file.txt
, which is essentially the SRR_Acc_List.txt
that we generated before.
Run the file by bash file.sh
, and quit conda by conda deactivate