🎒
NGS for natural scientist
  • 1. Preface
    • How to use this book
    • Motivation
    • Genomic data science as a tool to biologist
    • Next Generation Science (also NGS)
  • 2. Getting started
    • A step by step pipeline tutorial
    • Sequencing chemistry explained by Illumina
    • Joining a course
    • RNA quality and Library prep
    • (optional) My click moment about "Why Linux"
  • 3. Good-to-know beforehand
    • Experiment design
    • Single-end and Paired-end
    • Read per sample and data size
    • Normalization - RPKM/FPKM/TPM
    • Gene annotation
  • 4. Setting up terminal
    • My Linux terminal
    • Linux environment
    • R and RStudio
    • PATH
  • 5. FASTQ and quality control
    • Getting FASTQ files from online database
    • FASTQ quality assessment
  • 6. Mapping/alignment and quantification
    • Salmon
    • DESeq2
  • 7. Visualization
  • 8. Single cell RNA-Seq
  • 9. AWS cloud and Machine Learning
    • Machine Learning in a nutshell
    • R vs Python
    • Setting up ML terminal
    • Data exploration
  • (pending material)
    • graphPad
    • readings for ML
Powered by GitBook
On this page
  • 1. Selecting a reference genome
  • 2. Install and activate Salmon
  • 3. Index reference genome and quantify
  • 4. Batch process
  1. 6. Mapping/alignment and quantification

Salmon

How did it end up to be named after a fish I have no idea but I learnt this in Japan - happy coincident

Previous6. Mapping/alignment and quantificationNextDESeq2

Last updated 2 years ago

There are many commonly used mapping workflow available, and paper are a good starting point in selecting suitable variant. It is painful but true that one has to try many before settling to the most suitable, but considering time, learning curve and the foreseeable continuous support to the related packages, this manual would focus on the method.

Salmon has 2 quantification modes. Practically, without going into technical details, in the first mode Salmon maps the fragment (raw reads stored inside fastq) to an indexed reference genome () and count the hit, then move on to the next. In the second mode a SAM/BAM alignment files were provided to Salmon and Salmon will produce the quantification from the alignment result. One does not need to index reference genome by Salmon before running the quantification for the second mode.

ENST means that this is a gene annotation from Ensembl database and it indicates a transcript instead of a gene, which will be ENSG

1. Selecting a reference genome

We want Homo_sapiens.GRCh38.cdna.all.fa.gz and please click to have it downloaded. This is the reference genome.

The difference between DNA and cDNA, with the former containing all the sequence of a genome and the latter carrying only the coding RNA. In case of mRNA sequencing, selecting cDNA as the reference genome is more sensible since we should have no non-coding RNA in our library. This will significantly speed up the mapping and quantification.

2. Install and activate Salmon

sudo sh miniconda.sh

$ conda config --add channels conda-forge
$ conda config --add channels bioconda
$ conda create -n salmon salmon

The last line means to create an environment called salmon and install package called salmon inside the environment. So every time when you want to fire up Salmon -

conda activate salmon

The indication that you are in conda environment is the attachment of bracketed environment name in front of your user name in the terminal, like this

(salmon) user@computer :

3. Index reference genome and quantify

Then use this line to index the reference genome, GRCh38.cDNA.fa.gz, for mapping and quantification and store the indexed files inside cDNA_index -

salmon index -t GRCh38.cDNA.fa.gz -i cDNA_index

By now one should know that the single character after a hyphen (-), -t and -i in this case, is the parameter/argument that being passed to the command at front for additional condition/options.

Then we can map and quantify the fastq file using Salmon. Our example is the sequence data from a single-end library so we should use

salmon quant -i /home/user/FUS/cDNA_index -l A \
	 -r SRR19241828.fastq \
         -p 7 --validateMappings --gcBias -o /home/user/FUS/quants/SRR19241828

Replace -r with -1 and -2 for paired-end read library to specific the paired .fastq file. -r parameter is for single-end library.

4. Batch process

The way to loop through the whole folder and process all files in one go - First of all create the below .sh file. You can do that with a .txt in the GUI and then save as .sh. One can surely do that within terminal using their favorite word processors such as nano.

#!/bin/bash
while read line
do
   samp=`basename ${line}`
echo "Processing sample ${samp}"
salmon quant -i /home/user/STB/FUS/cDNA_index -l A \
	 -r ${samp}.fastq \
         -p 7 --validateMappings --gcBias -o /home/user/FUS/quants/${samp}
done < file.txt

Run the file by bash file.sh, and quit conda by conda deactivate

The most common reference genome database are Ensembl, Refseq (NCBI), and UCSC. I worked exclusively with genome curated by Ensembl so let's start from there. Google "" and you should safely land on the server within first 3 hits. The FASTA file of cDNA of Human is what we are after.

Before installing Salmon, we need to install first to provide the python environment for Salmon.

Refer to for the meaning of the parameter

the file.txt at the end of line 9 means to input this file for the while loop to read, that means the name of the fastq file that the while loop is reading in are from the file.txt, which is essentially the SRR_Acc_List.txt that we generated .

Ensembl FTP
conda
here
benchmark
Salmon
quasi-mapping
before
Human, cDNA, FASTA
This is what salmon would give you