DESeq2

Salmon, tximport and DESeq2 flow was authored by the same person, Michael Love. I think I could never understand what he did.

Simply speaking DESeq2 is an open source package that calculate the differential gene expression between 2 groups. It includes some of the popular visualization method that one could generate with a single line command. Statistic and technical details are available here, where most of the content of this page was extracted from.

1. Data preparation

First of all to straight things out we can rearrange the folder structure - that is to group all the .sf files into the working directory of the current R session. This is not necessary computer-wise but I think it helps the beginner to get on their feet faster and better.

After quantification, Salmon generates a folder with the name of the .fastq file and store all the information inside, including logs and alignment summary. The quantification result would be named quant.sf and in this demonstration we want to copy all the quant.sf file to the upper level and be renamed accordingly.

2. colData

A data frame that contains the necessary information for DESeq2 to understand has to be customized every time for each analysis. The minimum information would be sample name with the matching comparison group tag, in a specific order. Since in the final steps DESeq2 does not take the file paths or names into account, all it does is to index the quantification matrix in alphabetical order and process according to the comparison group, or experiment design in its own language, listed in the colData data frame, one has to make sure the colData is logged in the exact order as how the file arranges itself.

The colData can of course be manually created, but to do it with coding one can ensure an error free situation, providing that human error is not involved in the code.
// DESeq2 colData
dir <- getwd()
files <- list.files(dir, pattern=".sf", full.names = T)
library("stringr")
sample <- str_remove_all(list.files(dir, pattern = ".sf", full.names = F), ".sf")
library("stringi")
group <- stri_sub(sample, -6, -3)
sampleTable <- data.frame(sample, group)

This sampleTable is essentially the colData that we will need for DESeq2 analysis later.

3. tximport

As mentioned before, Salmon works on mRNA referencing against cDNA and therefore the .sf file that generated contains code name ENST, meaning the transcriptome in the Ensembl coding system. We need to convert the TPM list of ENST into ENSG, codes in the gene format, and this is where tximport package kicks in.

// transcriptome to gene ID
txdb <- transcripts(EnsDb.Hsapiens.v86, return.type="DataFrame")
tx2gene <- as.data.frame(txdb[,c("tx_id", "gene_id")])
txi <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion =T)

4. DESeq2

Finally we can put the txi into DESeq2 to get the log2foldchange of detected genes and their p-values.

//DESeq2 calculation
ddsTxi <- DESeqDataSetFromTximport(txi, colData = sampleTable, design = ~group)
dds <- DESeq(ddsTxi)
resultsNames(dds)

The resultsNames(dds) command would call out the log2foldchange result stored inside the dds, and it should look like this:

> resultsNames(dds)
[1] "Intercept"         "group_CTRL_vs_ALS"

This means "for everything inside ALS as 1, how does CTRL change?" And this is not exactly what we want. We need the result to be in the reversed order. This unwanted behavior was stemmed from we didn't specific the level of the comparison group. The default level is always arranged in the alphabetical order if we do not assign it ourselves. In order to relevel the groups

//comparison relevel
levels(ddsTxi$group)
>[1] "ALS"  "CTRL"
ddsTxi$group <- relevel(ddsTxi$group, "CTRL")
levels(ddsTxi$group)
>[1] "CTRL" "ALS" 
dds <- DESeq(ddsTxi)
res <- results(dds)

We can finally start to analyse and visualize the data. But first of all what is log2foldchange?

Imagine you are calculating fold-change by measuring counts ratios, so:

FoldChange = counts( Treatment ) / Counts ( Control )

Here, FoldChange values larger than one means the gene is more expressed in Treatment, a FoldChange of exactly one means no difference, and FoldChange values between zero and one means the gene is more expressed in Control. Now, if you take the log base 2 of the FoldChange values, you will have:

  • if FoldChange > 1, log2( FoldChange ) > 0

  • if FoldChange = 1, log2( FoldChange ) = 0

  • if 0 < FoldChange < 1, log2( FoldChange ) < 0

So positive values means the gene is more expressed in Treatment, and negative values means the gene is more expressed in Control.

https://www.biostars.org/p/347273/

Last updated