FASTQ quality assessment
Although an introduction
Why quality control
Very briefly, the initial QC reports are to inform 1. Do you need to re-run the experiment (RNA-quality, read quality, mapping/alignment quality) 2. Do you need to pre-process the fastq file before quantification (adapter contamination, abundance of specific mRNA e.g. ribosomal protein related mRNA)
Galaxy has done a great job in explaining and interpreting the data here. Briefly but informative, at least I believe, when you encounter warning from these 2 criteria, you do not need to be alerted in most of the cases.
Standard FastQC installation and usage
For standard fastqc, input this into terminal to install the package
sudo apt install fastqc
The usage would be
fastqc sequence.fastq
to process individual file one by one.
Perform quality check in batch using fastqcr
1. Environment initiation
(if you haven't, read how to install fastqcr in RStudio from here)
Have QC done inside RStudio would enable a GUI interaction and may make things a bit easier, at least for me
Essentially, the QC are done in the traditional way by the good old fastqc package. Fastqcr package that we are using here is an enhancement to the existing fastqc package, that it has the ability to handle the QC report of multiple files, for example to aggregate them into one single viewer, or to summarize the important point out from the routine QC. That means, the first step is and always will be to perform fastqc to the .fastq file, but this time in batch instead of one by one.
2. Generate individual QC report in batch
Then we need to actually check the quality of every and each fastq file.
This was the output, QC result of corresponding fastqc file written in the .html format. One can start examine them now one by one from here, but since we became smarter and refuse to do things one by one anymore, we will try to group the data together in one single page to view them all in once.
3. Aggregating QC data
Aggregate all the QC data in one single data frame
by a single line command in R.
The data was then be collected and formatted by the function qc_aggregate
and be stored in variable qc
, which had become a data frame
by now.
If you notice, I have mentioned data frame
for multiple times in 3 seconds. It sounds pretty important and some may ask what is that?
What is data frame? - the importance of understanding the data type of your variable
Data frame is a type of data in R. It looks like a table when viewed in R Studio and it acts like a table, but it is a bit more than that. First of all it has uniquely numbered row, which means each row is assigned with an unique, non-repeating numbers and that is the primary key of that row, think database. The column name is not numbered but again uniquely text named, and is customizable. The non-repeated nature is very important in manipulation and data sorting. The ability to use data frame in R comfortably is vital in data science.
It is also important to be familiar with the difference between data types. Different data type takes their own commands and functions, and has to be manipulated in different ways. The most common error in package usage would have to be data type mismatch.
4. Selecting the interesting row
After getting the data inside the qc data frame, it is a bit overwhelming to look at them row by row. There must be a way to filter and sort the list to increase readability. And indeed, not just one but multiple, for example
qc[qc$status=="FAIL",]
or qc_fails(qc, "sample")
Always check this page for details, scroll to the table/figure that you like and read on how to generate them.
Since everything looks ok for this set of data, we will move on to mapping and transcript quantification.
Last updated