🎒
NGS for natural scientist
  • 1. Preface
    • How to use this book
    • Motivation
    • Genomic data science as a tool to biologist
    • Next Generation Science (also NGS)
  • 2. Getting started
    • A step by step pipeline tutorial
    • Sequencing chemistry explained by Illumina
    • Joining a course
    • RNA quality and Library prep
    • (optional) My click moment about "Why Linux"
  • 3. Good-to-know beforehand
    • Experiment design
    • Single-end and Paired-end
    • Read per sample and data size
    • Normalization - RPKM/FPKM/TPM
    • Gene annotation
  • 4. Setting up terminal
    • My Linux terminal
    • Linux environment
    • R and RStudio
    • PATH
  • 5. FASTQ and quality control
    • Getting FASTQ files from online database
    • FASTQ quality assessment
  • 6. Mapping/alignment and quantification
    • Salmon
    • DESeq2
  • 7. Visualization
  • 8. Single cell RNA-Seq
  • 9. AWS cloud and Machine Learning
    • Machine Learning in a nutshell
    • R vs Python
    • Setting up ML terminal
    • Data exploration
  • (pending material)
    • graphPad
    • readings for ML
Powered by GitBook
On this page
  • Why quality control
  • Standard FastQC installation and usage
  • Perform quality check in batch using fastqcr
  1. 5. FASTQ and quality control

FASTQ quality assessment

Although an introduction

PreviousGetting FASTQ files from online databaseNext6. Mapping/alignment and quantification

Last updated 2 years ago

Why quality control

Very briefly, the initial QC reports are to inform 1. Do you need to re-run the experiment (RNA-quality, read quality, mapping/alignment quality) 2. Do you need to pre-process the fastq file before quantification (adapter contamination, abundance of specific mRNA e.g. ribosomal protein related mRNA)

Galaxy has done a great job in explaining and interpreting the data . Briefly but informative, at least I believe, when you encounter warning from these 2 criteria, you do not need to be alerted in most of the cases.

Standard FastQC installation and usage

For standard fastqc, input this into terminal to install the package

sudo apt install fastqc

The usage would be

fastqc sequence.fastq

to process individual file one by one.

Perform quality check in batch using fastqcr

1. Environment initiation

(if you haven't, read how to install fastqcr in RStudio from )

Have QC done inside RStudio would enable a GUI interaction and may make things a bit easier, at least for me

// Install fastqcr in RStudio
library(BiocManager)
install("fastqcr")
// after obtaining fastqcr package, call the library and install fastqc
library(fastqcr)
fastqc_install()

Essentially, the QC are done in the traditional way by the good old fastqc package. Fastqcr package that we are using here is an enhancement to the existing fastqc package, that it has the ability to handle the QC report of multiple files, for example to aggregate them into one single viewer, or to summarize the important point out from the routine QC. That means, the first step is and always will be to perform fastqc to the .fastq file, but this time in batch instead of one by one.

2. Generate individual QC report in batch

Then we need to actually check the quality of every and each fastq file.

// Assign the directory that storing the fastq files into "qc_dir"
qc_dir <- file.path(getwd(), "fastq") 
file.exists(qc_dir) 

// Assign a directory to hold the output QC report
qc_rep_dir <- file.path(getwd(), "files", "qc_report") 
file.exists(qc_rep_dir) 

// Perform Quality Check 
// use parameter "threads" to allocate number of CPU threads
fastqc(fq.dir = qc_dir, qc.dir = qc_rep_dir, threads = 7)

This was the output, QC result of corresponding fastqc file written in the .html format. One can start examine them now one by one from here, but since we became smarter and refuse to do things one by one anymore, we will try to group the data together in one single page to view them all in once.

3. Aggregating QC data

Aggregate all the QC data in one single data frame by a single line command in R.

qc <- qc_aggregate(qc_rep_dir) 
qc

The data was then be collected and formatted by the function qc_aggregate and be stored in variable qc, which had become a data frame by now.

If you notice, I have mentioned data frame for multiple times in 3 seconds. It sounds pretty important and some may ask what is that?

What is data frame? - the importance of understanding the data type of your variable

Data frame is a type of data in R. It looks like a table when viewed in R Studio and it acts like a table, but it is a bit more than that. First of all it has uniquely numbered row, which means each row is assigned with an unique, non-repeating numbers and that is the primary key of that row, think database. The column name is not numbered but again uniquely text named, and is customizable. The non-repeated nature is very important in manipulation and data sorting. The ability to use data frame in R comfortably is vital in data science.

It is also important to be familiar with the difference between data types. Different data type takes their own commands and functions, and has to be manipulated in different ways. The most common error in package usage would have to be data type mismatch.

4. Selecting the interesting row

After getting the data inside the qc data frame, it is a bit overwhelming to look at them row by row. There must be a way to filter and sort the list to increase readability. And indeed, not just one but multiple, for example

qc[qc$status=="FAIL",] or qc_fails(qc, "sample")

Since everything looks ok for this set of data, we will move on to mapping and transcript quantification.

Always check for details, scroll to the table/figure that you like and read on how to generate them.

this page
here
Per base sequence content
Sequence Duplication Levels
here
This is what you will get - individual report in .html format even though they where generated in batch