Getting FASTQ files from online database

Re-analyzing published data - This is common when scientists want to have a glance on plotting new experiment. You can skip this if you are analyzing your own FASTQ files.

(Continue from here)

Demonstration with FUS neuron RNA-Seq data

https://www.ncbi.nlm.nih.gov/bioproject/PRJNA838953

For the rest of the text I will use this paper as an example. This is picked because I worked with FUS before, the conditions are simple (mutation against healthy), are human samples, and I like neurons in general. The aim is simple, compare the mRNA expression of wild type against the H517Q point mutated FUS carrying iPSC derived motor neurons. Statistically, the null hypothesis will be "mRNA is not differentially expressed in 2 types of neurons". The physiologically relevant hypothesis, in contrast, would be "protein expression is significant differed by the presence of H517Q mutation on FUS".

FUS is evident to be strongly associated with amyotrophic lateral sclerosis (ALS), which is the disease that killed Stephen Hawking. Point mutations could be found in familial cases, although the underlying mechanism is yet to be made known. Theoretically, 2 neurons are only differed by one single amino acid on one genes out of many thousands, but RNA-Seq will tell you how this little discrepancy could mess up the whole system in the downstream biological process.

Fetching SRA files from SRA server is straight forward and well documented. So here I will go through the not clearly written steps - how to obtain the accession number from the project in a smart way.

prefetch --option-file SRR_Acc_List.txt

The output folder (the folder that holds the downloaded file) cannot be specify using -o parameter when downloading in batch. The downloaded .sra will be stored in the current working folder (in this case the same directory that holds SRR_Acc_List.txt.) To specific the receiving folder, use the interactive SRA configuration tool -

vdb-config -i

Depends on your broadband connection speed, after having all the .sra in your local drive, you can validate the integrity of the files. In this example, the files were stored in /home/user/FUS/sra.

vdb-validate /home/user/FUS/sra

If everything is consistent with the server side, then we can proceed to extract the fastq file from the .sra file.

fastq-dump

After you fetched (downloaded) the SRA files that interested you, you will need to extract the fastq from the compressed file. Fastq-dump does just that. In our example, fasterq-dump would be more suitable since we have a list of files to extract.

cat SRR_Acc_List.txt | xargs fasterq-dump --outdir fastq --mem 1G --threads 16

Use --threads parameter to assign different number of threads to process files in parallel. I didn't read about this but from my observation the default is 6. Anyway at the end of the process a folder named fastq, in this case, would have been generated with all the fastq files dumped inside.

fastq-dump parameter explained

Previous5. FASTQ and quality control NextFASTQ quality assessment

Last updated 3 years ago