Read per sample and data size

Read. Per sample. and how many bytes.

Read per sample

I ripped this off from Wikipedia

In DNA sequencing, a read is an inferred sequence of base pairs (or base pair probabilities) corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

I am sure this doesn't work for me. So in my language it says

The number of fragments that being sequenced.

For example, you have one sample with 100 million fragment to be read. That means, theoretically, you will obtain 100 million cluster from the flow cell. Let's assume that your flow cell could only form 100 million cluster at maximum. That means if you have 2 samples running a single lane, you could only have 50 million fragments from each sample to be sequenced. This actually translates to sequencing depth or coverage, and eventually related to the statistical significance, depends on what you are working on. Although this is a matter of trial and error, this technique is mature enough to have a general recommendation based on the experimental parameter. Particularly, the recommended reads is to avoid under-represented sequencing result or exhaustion of the read pool that wasted the sequencing capacity ($$) on repeated reads.

Data size

In order to explain data size, which is essentially the file size of the resulting fastq, I have been tested with a rather simple calculation exercise, which I think is quite educational and explanatory.

I was asking the contractor to quote me single cell sequencing, and they explained the price based on the sequencing lane capacity. It turns out that if I rent the whole 1 lane from them it will be 750Gb per run (they are using NovaSeq 6000), and in the case of single cell RNA-seq the recommended read number would be 20,000 paired reads each from 20,000 cells. If I am to order paired-end 150 base per read (PE150), the file size would be

150 base x 20,000 reads x 2 (paired end) x 20,000 cells = 120,000,000,000 (120Gb)

And therefore if I have more than 750Gb/120Gb = 6.25 samples per experiment then they could make it cheaper if I order per lane than per sample from them. That's their point.

In the case of mRNA bulk sequencing, at the same PE150 format, it is generally recommended to have 20 millions unique reads per sample, which will make up to 6Gb of data. Try to compute it yourself to see if you could get the answer.

Last updated