FASTQ quality assessment

Although an introduction

Why quality control

Very briefly, the initial QC reports are to inform 1. Do you need to re-run the experiment (RNA-quality, read quality, mapping/alignment quality) 2. Do you need to pre-process the fastq file before quantification (adapter contamination, abundance of specific mRNA e.g. ribosomal protein related mRNA)

Galaxy has done a great job in explaining and interpreting the data here. Briefly but informative, at least I believe, when you encounter warning from these 2 criteria, you do not need to be alerted in most of the cases.

Standard FastQC installation and usage

For standard fastqc, input this into terminal to install the package

sudo apt install fastqc

The usage would be

fastqc sequence.fastq

to process individual file one by one.

Perform quality check in batch using fastqcr

1. Environment initiation

(if you haven't, read how to install fastqcr in RStudio from here)

Have QC done inside RStudio would enable a GUI interaction and may make things a bit easier, at least for me

// Install fastqcr in RStudio
library(BiocManager)
install("fastqcr")
// after obtaining fastqcr package, call the library and install fastqc
library(fastqcr)
fastqc_install()

Essentially, the QC are done in the traditional way by the good old fastqc package. Fastqcr package that we are using here is an enhancement to the existing fastqc package, that it has the ability to handle the QC report of multiple files, for example to aggregate them into one single viewer, or to summarize the important point out from the routine QC. That means, the first step is and always will be to perform fastqc to the .fastq file, but this time in batch instead of one by one.

2. Generate individual QC report in batch

Then we need to actually check the quality of every and each fastq file.

// Assign the directory that storing the fastq files into "qc_dir"
qc_dir <- file.path(getwd(), "fastq") 
file.exists(qc_dir) 

// Assign a directory to hold the output QC report
qc_rep_dir <- file.path(getwd(), "files", "qc_report") 
file.exists(qc_rep_dir) 

// Perform Quality Check 
// use parameter "threads" to allocate number of CPU threads
fastqc(fq.dir = qc_dir, qc.dir = qc_rep_dir, threads = 7)

This was the output, QC result of corresponding fastqc file written in the .html format. One can start examine them now one by one from here, but since we became smarter and refuse to do things one by one anymore, we will try to group the data together in one single page to view them all in once.

3. Aggregating QC data

Aggregate all the QC data in one single data frame by a single line command in R.

qc <- qc_aggregate(qc_rep_dir) 
qc

The data was then be collected and formatted by the function qc_aggregate and be stored in variable qc, which had become a data frame by now.

If you notice, I have mentioned data frame for multiple times in 3 seconds. It sounds pretty important and some may ask what is that?

What is data frame? - the importance of understanding the data type of your variable

Data frame is a type of data in R. It looks like a table when viewed in R Studio and it acts like a table, but it is a bit more than that. First of all it has uniquely numbered row, which means each row is assigned with an unique, non-repeating numbers and that is the primary key of that row, think database. The column name is not numbered but again uniquely text named, and is customizable. The non-repeated nature is very important in manipulation and data sorting. The ability to use data frame in R comfortably is vital in data science.

It is also important to be familiar with the difference between data types. Different data type takes their own commands and functions, and has to be manipulated in different ways. The most common error in package usage would have to be data type mismatch.

4. Selecting the interesting row

After getting the data inside the qc data frame, it is a bit overwhelming to look at them row by row. There must be a way to filter and sort the list to increase readability. And indeed, not just one but multiple, for example

qc[qc$status=="FAIL",] or qc_fails(qc, "sample")

Always check this page for details, scroll to the table/figure that you like and read on how to generate them.

Since everything looks ok for this set of data, we will move on to mapping and transcript quantification.

Last updated