What is a Sequencing Read?

October 24th, 2023
Probably the most common form of genetic sequencing these days is "paired-end" sequencing. It's very impressive: the sequencing machine can process the same nucleic acid fragment from both ends! This means that each observation looks like:

| forward read |   gap   | reverse read |

Because accuracy ("quality") tends to drop off as you sequence further into a fragment, sequencing from both ends gives you much more accurate data than trying to sequence the whole thing from one end. And because we build up larger sequences ("contigs") by piecing together overlapping ones ("assembly"), two sequences of bases separated by a gap are actually usually more helpful than the same number of bases without a gap.

It's common to refer to paired-end sequencing with designations like "2x150", where the "2x" tells us it's paired-end, and the "150" tells us it reads for 150 bases from each end, for a total of 300 bases per fragment.

But this introduces a terminology question: what is a read? When we only had "single-end" sequencing it was clear: each sequenced fragment, each contiguous sequence of bases, was a read. With paired-end sequencing, however, these are no longer the same thing! There are two things a "read" could mean:

  • Read: a continuous series of bases.
  • Read: the bases from a sequenced fragment.

For example, say we have:

>SRR14530724.2 2/1
>SRR14530724.2 2/2

This is a forward read (SRR14530724.2 2/1) and a reverse read (SRR14530724.2 2/2) that together comprise a single observation of a fragment from the sample and would generally by analyzed together. Does this count as one read or two?

Turns out people do both, and it leads to a lot of misunderstandings!

Some examples:

  • Illumina counts them as two. They say the "25B" flow cell on the NovaSeq X will produce 52B paired end reads, or ~8Tb ("terabases", or trillion bases) of 2x150. Since 52B * 150b = 7.8Tb (what they call ~8Tb), that tells us they're counting both the forward and the reverse read.

  • Element counts them as one. They say a 2x150 high output flow cell produces 300 Gb and 1B reads. Since 1B * 150b * 2 = 300 Gb, that tells us they're counting the forward and reverse read together as one.

  • Singular is not clear but I'm pretty sure they're counting each fragment's observations as a single read.

  • The European Nucleotide Archive counts them as one. For example, if you visit ERR1470825, which is Illumina MiSeq paired end sequencing at 2x250, you'll see it says 2.2M reads and if you download the fastq.gz files you'll find 2.2M reads in each of the forward and reverse files.

  • Rothman et. al 2021 counts them as one. They say "paired reads" on first use and then "reads" later, and you can tell that they are counting them as one because (a) they often give odd numbers for things like "there were only 337 SARS-CoV-2 reads" and (b) if you reanalyze the data their numbers only make sense if they're counting pairs.

  • An academic group I was recently talking to about a potential partnership counted them as one in a recent paper I reanalyzed and as two when talking over email.

  • A commercial sequencing company I was recently talking to counted them as two.

  • Asking ChatGPT and Claude, both count them as two. Ex: "In short-read paired-end sequencing, a forward-reverse pair is typically considered as two reads".

This is a mess! And, to make it worse, as far as I can tell there's no standard term other than "read" either "what both the forward and reverse read are examples of" or "what the forward and reverse read are when considered together".

I've been using "read" to mean "read pair", but given the ambiguity I think I should switch to another term. The NCBI SRA uses "spots", but no one else seems to use this terminology. You can just say "read pair", which is pretty good, but a bit long. Possible "pairs" or "mates" would be good? Thoughts?

Comment via: facebook, lesswrong, mastodon

Recent posts on blogs I like:

How Does Fiction Affect Reality?

Social norms

via Thing of Things April 19, 2024

Clarendon Postmortem

I posted a postmortem of a community I worked to help build, Clarendon, in Cambridge MA, over at Supernuclear.

via Home March 19, 2024

How web bloat impacts users with slow devices

In 2017, we looked at how web bloat affects users with slow connections. Even in the U.S., many users didn't have broadband speeds, making much of the web difficult to use. It's still the case that many users don't have broadband speeds, both …

via Posts on March 16, 2024

more     (via openring)