7. Visualization

And this is just one in many jillion

Data normalization

// data normalization
vsd <- vst(dds, blind = F)
rld <- rlog(dds, blind = F)
ntd <- normTransform(dds)
library("apeglm")
resLFC <- lfcShrink(dds, coef="group_ALS_vs_CTRL", type="apeglm")

Data exploration (also known as Exploratory Data Analysis EDA)

//plotMA
plotMA(res, ylim=c(-3,3))
plotMA(resLFC, ylim=c(-3,3))

Data QC

// plotPCA
plotPCA(vsd, intgroup="group")
plotPCA(vsd, intgroup=c("sample","group"))

// more setting for plotPCA
pcaData <- plotPCA(vsd, intgroup="group", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))

ggplot(pcaData, aes(PC1, PC2, color=sample, shape=group)) +
  geom_point(aes(fill=group), size=3, stroke = 0.3, color="black") +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  ylim(-10, 10) + 
  scale_shape_manual(values=c(22,24)) +
coord_fixed()

ggplot(pcaData, aes(PC1, PC2, color=sample, shape=group)) +
  geom_point(size=3, fill="white") +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  ylim(-10, 10) +
  scale_shape_manual(values=c(22,24)) +
  coord_fixed()

It is good to know what exactly is a PCA and how it is calculated (although I skipped the detailed schematic - I didn't attend my additional mathematics GCSE 20 years ago)

https://hbctraining.github.io/scRNA-seq_online/lessons/05_theory_of_PCA.html

PCA stands for Principal Components Analysis and the maths is done to determine the variation of the princial components (PCs), which are essentailly the mostly changed genes, with PC1 as the most significantly changed genes and so fro. The variance as shown in the above graphs were the calculated variance of PC1 and PC2 respectively. A scatter plot is used to visualize the value of each and every sample so to culster the specimens into groups without any treatment information involved.

Heatmap

//pheatmap for fancy display
df <- as.data.frame(colData(dds)[,"group"])
select <- order(rowMeans(counts(dds, normalized=T)), decreasing=TRUE)[1:200]
pheatmap(assay(vsd)[select,], 
         color=colorRampPalette(c("navy", "white", "red"))(100),
         cluster_rows = T, 
         show_rownames = F, 
         show_colnames = T, 
         cluster_cols = F, 
         labels_col = paste0(sampleTable$sample, sampleTable$group), 
         border_color = NA)

First of all copy and paste the code trunk, make sure it works on your machine, then change the parameters to play around for your own study. The graph is totally customizable to granule level and it is vital to make it as visual as possible - you can change the number of genes, the p-value cut-off, the orientation, the division of groups - use the visual aids to reflect your point. For example, the map that I have shows that the technical replication are close enough but the difference between biological replication is higher, just so not too high to mask the difference between control and diseased groups. The color scheme is the hero to contrast the difference in value. To conclude I would say this heatmap visualized the differentiated expression of the highly expressed gene in human neurons between wild type and mutated FUS.

Pathway analysis (Gene set enrichment analysis)

//GSEA analyse https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/
organism = "org.Hs.eg.db"
library(organism, character.only = TRUE)
original_gene_list <- res$log2FoldChange
names(original_gene_list) <- row.names(res)
gene_list<-na.omit(original_gene_list)
gene_list = sort(gene_list, decreasing = TRUE)
gse <- gseGO(geneList = gene_list, ont="BP", keyType = "ENSEMBL", nPerm = 10000, minGSSize = 3, maxGSSize = 800, pvalueCutoff = 0.005, verbose = T, OrgDb = organism, pAdjustMethod = "none")
dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)

At the end of the day we arrived at the whole purpose of this analysis. Instead of crunching on individual point to describe numbers, no matter how meaningful they are, is to first make sure those point on the graph justify the appearance of the gene function group as shown. Let's assume that the groups on the figure are statistically significant to be discussed about, we then need to group and interpret the gene functions based on what we know about ALS biology. The endothelin related genes are suppressed, where a changing endothelin proteins level in ALS has been evidented biologically at molecular level for multiple times from multiple sources. Checked. Muscle tissue and organ development, cell differentiation, proliferation, morphogenesis, all points to altered cellular level development - and how does it associate with phenotypes and/or observed pathology? What sort of experiment do you want to plan around these changes, and what point do you want to address? Autophagosome, lysosome, endosome, it is known in many neurodegenerative disease, but ALS does not necessary have a foggy brain, why? RNA and DNA damage, structure, transport, does that has anything to do with the dying neurons?

A paper starting with this pathway analysis and ends with a vaildation of predicted phenotype could be so strong to propose a new hypothesis on Science, albeit not (yet) benefitial to the general public.

PreviousDESeq2 Next8. Single cell RNA-Seq

Last updated 2 years ago