🎒
NGS for natural scientist
  • 1. Preface
    • How to use this book
    • Motivation
    • Genomic data science as a tool to biologist
    • Next Generation Science (also NGS)
  • 2. Getting started
    • A step by step pipeline tutorial
    • Sequencing chemistry explained by Illumina
    • Joining a course
    • RNA quality and Library prep
    • (optional) My click moment about "Why Linux"
  • 3. Good-to-know beforehand
    • Experiment design
    • Single-end and Paired-end
    • Read per sample and data size
    • Normalization - RPKM/FPKM/TPM
    • Gene annotation
  • 4. Setting up terminal
    • My Linux terminal
    • Linux environment
    • R and RStudio
    • PATH
  • 5. FASTQ and quality control
    • Getting FASTQ files from online database
    • FASTQ quality assessment
  • 6. Mapping/alignment and quantification
    • Salmon
    • DESeq2
  • 7. Visualization
  • 8. Single cell RNA-Seq
  • 9. AWS cloud and Machine Learning
    • Machine Learning in a nutshell
    • R vs Python
    • Setting up ML terminal
    • Data exploration
  • (pending material)
    • graphPad
    • readings for ML
Powered by GitBook
On this page
  • Demonstration with FUS neuron RNA-Seq data
  • fastq-dump
  1. 5. FASTQ and quality control

Getting FASTQ files from online database

Re-analyzing published data - This is common when scientists want to have a glance on plotting new experiment. You can skip this if you are analyzing your own FASTQ files.

Previous5. FASTQ and quality controlNextFASTQ quality assessment

Last updated 2 years ago

(Continue from )

Demonstration with FUS neuron RNA-Seq data

For the rest of the text I will use this paper as an example. This is picked because I worked with FUS before, the conditions are simple (mutation against healthy), are human samples, and I like neurons in general. The aim is simple, compare the mRNA expression of wild type against the H517Q point mutated FUS carrying iPSC derived motor neurons. Statistically, the null hypothesis will be "mRNA is not differentially expressed in 2 types of neurons". The physiologically relevant hypothesis, in contrast, would be "protein expression is significant differed by the presence of H517Q mutation on FUS".

FUS is evident to be strongly associated with amyotrophic lateral sclerosis (ALS), which is the disease that killed Stephen Hawking. Point mutations could be found in familial cases, although the underlying mechanism is yet to be made known. Theoretically, 2 neurons are only differed by one single amino acid on one genes out of many thousands, but RNA-Seq will tell you how this little discrepancy could mess up the whole system in the downstream biological process.

Thanks Hawking. And Godspeed. (I like Sheldon and I am like Sheldon)

Fetching SRA files from SRA server is straight forward and well documented. So here I will go through the not clearly written steps - how to obtain the accession number from the project in a smart way.

prefetch --option-file SRR_Acc_List.txt

The output folder (the folder that holds the downloaded file) cannot be specify using -o parameter when downloading in batch. The downloaded .sra will be stored in the current working folder (in this case the same directory that holds SRR_Acc_List.txt.) To specific the receiving folder, use the interactive SRA configuration tool -

vdb-config -i

Depends on your broadband connection speed, after having all the .sra in your local drive, you can validate the integrity of the files. In this example, the files were stored in /home/user/FUS/sra.

vdb-validate /home/user/FUS/sra

If everything is consistent with the server side, then we can proceed to extract the fastq file from the .sra file.

fastq-dump

After you fetched (downloaded) the SRA files that interested you, you will need to extract the fastq from the compressed file. Fastq-dump does just that. In our example, fasterq-dump would be more suitable since we have a list of files to extract.

cat SRR_Acc_List.txt | xargs fasterq-dump --outdir fastq --mem 1G --threads 16

Use --threads parameter to assign different number of threads to process files in parallel. I didn't read about this but from my observation the default is 6. Anyway at the end of the process a folder named fastq, in this case, would have been generated with all the fastq files dumped inside.

fastq-dump parameter explained
here
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA838953
Select all entries and send to Run Selector
Click on the "Accession List" button on the top right hand corner and download the SRR_ACC_List.txt which contains the accession number of all the SRA listed under this submission
This is what we need to prefetch in bulk