🎒
NGS for natural scientist
  • 1. Preface
    • How to use this book
    • Motivation
    • Genomic data science as a tool to biologist
    • Next Generation Science (also NGS)
  • 2. Getting started
    • A step by step pipeline tutorial
    • Sequencing chemistry explained by Illumina
    • Joining a course
    • RNA quality and Library prep
    • (optional) My click moment about "Why Linux"
  • 3. Good-to-know beforehand
    • Experiment design
    • Single-end and Paired-end
    • Read per sample and data size
    • Normalization - RPKM/FPKM/TPM
    • Gene annotation
  • 4. Setting up terminal
    • My Linux terminal
    • Linux environment
    • R and RStudio
    • PATH
  • 5. FASTQ and quality control
    • Getting FASTQ files from online database
    • FASTQ quality assessment
  • 6. Mapping/alignment and quantification
    • Salmon
    • DESeq2
  • 7. Visualization
  • 8. Single cell RNA-Seq
  • 9. AWS cloud and Machine Learning
    • Machine Learning in a nutshell
    • R vs Python
    • Setting up ML terminal
    • Data exploration
  • (pending material)
    • graphPad
    • readings for ML
Powered by GitBook
On this page
  • Data normalization
  • Data exploration (also known as Exploratory Data Analysis EDA)
  • Data QC
  • Heatmap
  • Pathway analysis (Gene set enrichment analysis)

7. Visualization

And this is just one in many jillion

PreviousDESeq2Next8. Single cell RNA-Seq

Last updated 1 year ago

Data normalization

// data normalization
vsd <- vst(dds, blind = F)
rld <- rlog(dds, blind = F)
ntd <- normTransform(dds)
library("apeglm")
resLFC <- lfcShrink(dds, coef="group_ALS_vs_CTRL", type="apeglm")

Data exploration (also known as Exploratory Data Analysis EDA)

//plotMA
plotMA(res, ylim=c(-3,3))
plotMA(resLFC, ylim=c(-3,3))

Data QC

// plotPCA
plotPCA(vsd, intgroup="group")
plotPCA(vsd, intgroup=c("sample","group"))
// more setting for plotPCA
pcaData <- plotPCA(vsd, intgroup="group", returnData=TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))

ggplot(pcaData, aes(PC1, PC2, color=sample, shape=group)) +
  geom_point(aes(fill=group), size=3, stroke = 0.3, color="black") +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  ylim(-10, 10) + 
  scale_shape_manual(values=c(22,24)) +
coord_fixed()

ggplot(pcaData, aes(PC1, PC2, color=sample, shape=group)) +
  geom_point(size=3, fill="white") +
  xlab(paste0("PC1: ",percentVar[1],"% variance")) +
  ylab(paste0("PC2: ",percentVar[2],"% variance")) +
  ylim(-10, 10) +
  scale_shape_manual(values=c(22,24)) +
  coord_fixed()

It is good to know what exactly is a PCA and how it is calculated (although I skipped the detailed schematic - I didn't attend my additional mathematics GCSE 20 years ago)

PCA stands for Principal Components Analysis and the maths is done to determine the variation of the princial components (PCs), which are essentailly the mostly changed genes, with PC1 as the most significantly changed genes and so fro. The variance as shown in the above graphs were the calculated variance of PC1 and PC2 respectively. A scatter plot is used to visualize the value of each and every sample so to culster the specimens into groups without any treatment information involved.

Heatmap

//pheatmap for fancy display
df <- as.data.frame(colData(dds)[,"group"])
select <- order(rowMeans(counts(dds, normalized=T)), decreasing=TRUE)[1:200]
pheatmap(assay(vsd)[select,], 
         color=colorRampPalette(c("navy", "white", "red"))(100),
         cluster_rows = T, 
         show_rownames = F, 
         show_colnames = T, 
         cluster_cols = F, 
         labels_col = paste0(sampleTable$sample, sampleTable$group), 
         border_color = NA)

First of all copy and paste the code trunk, make sure it works on your machine, then change the parameters to play around for your own study. The graph is totally customizable to granule level and it is vital to make it as visual as possible - you can change the number of genes, the p-value cut-off, the orientation, the division of groups - use the visual aids to reflect your point. For example, the map that I have shows that the technical replication are close enough but the difference between biological replication is higher, just so not too high to mask the difference between control and diseased groups. The color scheme is the hero to contrast the difference in value. To conclude I would say this heatmap visualized the differentiated expression of the highly expressed gene in human neurons between wild type and mutated FUS.

Pathway analysis (Gene set enrichment analysis)

//GSEA analyse https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/
organism = "org.Hs.eg.db"
library(organism, character.only = TRUE)
original_gene_list <- res$log2FoldChange
names(original_gene_list) <- row.names(res)
gene_list<-na.omit(original_gene_list)
gene_list = sort(gene_list, decreasing = TRUE)
gse <- gseGO(geneList = gene_list, ont="BP", keyType = "ENSEMBL", nPerm = 10000, minGSSize = 3, maxGSSize = 800, pvalueCutoff = 0.005, verbose = T, OrgDb = organism, pAdjustMethod = "none")
dotplot(gse, showCategory=10, split=".sign") + facet_grid(.~.sign)

At the end of the day we arrived at the whole purpose of this analysis. Instead of crunching on individual point to describe numbers, no matter how meaningful they are, is to first make sure those point on the graph justify the appearance of the gene function group as shown. Let's assume that the groups on the figure are statistically significant to be discussed about, we then need to group and interpret the gene functions based on what we know about ALS biology. The endothelin related genes are suppressed, where a changing endothelin proteins level in ALS has been evidented biologically at molecular level for multiple times from multiple sources. Checked. Muscle tissue and organ development, cell differentiation, proliferation, morphogenesis, all points to altered cellular level development - and how does it associate with phenotypes and/or observed pathology? What sort of experiment do you want to plan around these changes, and what point do you want to address? Autophagosome, lysosome, endosome, it is known in many neurodegenerative disease, but ALS does not necessary have a foggy brain, why? RNA and DNA damage, structure, transport, does that has anything to do with the dying neurons?

A paper starting with this pathway analysis and ends with a vaildation of predicted phenotype could be so strong to propose a new hypothesis on Science, albeit not (yet) benefitial to the general public.

https://hbctraining.github.io/scRNA-seq_online/lessons/05_theory_of_PCA.html
Color is very important visualization tools in heatmap analysis - the whole point is (just) to show that the difference between groups are significant qualitatively
There are 3 readable information on the data point - size for gene count, color for adjusted p-value, and gene ratio for normalized comparison