Sequencing Intro II: Adapters

September 14th, 2022
bio, tech
A couple weeks ago I wrote a short sequencing intro. Here's a bit more, starting with a minor puzzle. I'm working with California wastewater sequencing data (Rothman et al 2021) and I found a read that was a partial match for HIV:

>SRR14530740.1578405 1578405/2
AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
GGTAACTAGAGATCCCTCAGACCCTTTTAGTCCTGTCTCTTATACACATCT
GACGCTGCCGACGACCTTCGTGATGTGTAGATCTCGGGGGGCGGCGGGG

The 83 highlighted bases at the beginning of the read are an exact match for this section near the beginning of the HIV genome:

>AF033819.3 HIV-1, complete genome
GGTCTCTCTGGTTAGACCAGATCTGAGCCTGGGAGCTCTCTGGCTAACTA
GGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGT
AGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGAC
CCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGAC
...

This dataset was sequenced with paired-end reads, which means there's more information we can get on this particular genetic snippet. This was the 'reverse' read, so let's look at the corresponding forward read:

>SRR14530740.1578405 1578405/1
GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
GCACACACTACTTGAAGCACTCAAGGCAAGCTCTGTCTCTTATACACATCT
CCGAGCCCACGAGACCTTGCCATTAATCTCGTATGCCGTCTTCTGCTTG

When working with paired-end reads, they're sequenced in opposite directions working towards each other:

   forward read -->
5' --------------------------------------------- 3'
   |||||||||||||||||||||||||||||||||||||||||||||
3' --------------------------------------------- 5'
                                <-- reverse read

Because they're reading in different directions you need to reverse one of the reads, and since they're reading complementary strands you need to take the genetic complement. Here's the reverse complement of the forward read, to match the reverse read we were already looking at:

>SRR14530740.1578405 1578405/1, reverse complement
CAAGCAGAAGACGGCATACGAGATTAATGGCAAGGTCTCGTGGGCTCGGAG
ATGTGTATAAGAGACAGAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCC
GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTC

Most commonly your paired end reads go together like:

[read 1] [gap you didn't sequence] [read 2]

In this case, however, they overlap, allowing us to assemble a larger sequence. Sometimes you might have a read error, where the two don't perfectly match, but we're lucky here and there's no disagreement:

CAAGCAGAAGACGGCATACGAGATTAATGGCAAGGTCTCGTGGGCTCGGAG
ATGTGTATAAGAGACAGAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCC
GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTCCT
GTCTCTTATACACATCTGACGCTGCCGACGACCTTCGTGATGTGTAGATCT
CGGGGGGCGGCGGGG

Now here's the puzzle: the overlapping portion of the two happens to be exactly the sequence that matches HIV. This isn't something you'd expect to see by chance, right? How much overlap (or distance) you get between two sequences should be unpredictable. So, why is this happening?

In the sequencing process, your input DNA fragment gets more bits of DNA ("adapters") stuck on its ends, to allow the sequencer to manipulate it. At the beginning (5' end) of the target sequence this works well: sequencing uses the adapter to determine where to start the read, which then will nearly always start with the first base of your original fragment. If your initial fragment is very short, however, it will run past the end of the original sequence and into the adapter. Illumina has some documentation with figures explaining the process.

Here are the original reads again with the portion immediately following the HIV match highlighted:

>SRR14530740.1578405 1578405/2
AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
GGTAACTAGAGATCCCTCAGACCCTTTTAGTCCTGTCTCTTATACACATCT
GACGCTGCCGACGACCTTCGTGATGTGTAGATCTCGGGGGGCGGCGGGG
>SRR14530740.1578405 1578405/1
GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
GCACACACTACTTGAAGCACTCAAGGCAAGCTCTGTCTCTTATACACATCT
CCGAGCCCACGAGACCTTGCCATTAATCTCGTATGCCGTCTTCTGCTTG

For the kit used in this paper this sequence is the start of the adapter, and nothing from this bit on is part of our input fragment.

With the adapters removed, we're left with just:

>SRR14530740.1578405 1578405/2
AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
GGTAACTAGAGATCCCTCAGACCCTTTTAGTC
>SRR14530740.1578405 1578405/1
GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
GCACACACTACTTGAAGCACTCAAGGCAAGCT

This is now an exact match for HIV, without any junk at the end. Most quality control pipelines contain a step where you remove adapters, just like they remove the poly-G sequences I described last time.

Referenced in:

Comment via: facebook, lesswrong, substack

Recent posts on blogs I like:

Food Fridays: Blueberry Cobbler

Here is my blueberry cobber recipe, by request of William Friedman.

via Thing of Things January 16, 2026

Why I Don't Think My Braces Were Worth It

A couple weeks ago, I got my braces off. I kind of wish I had never had them, though. When I was younger, two of my teeth were sticking out, and they looked kind of funny. I thought that my teeth were just fine, and I didn't want to get braces. But s…

via Anna Wise's Blog Posts January 3, 2026

Family Christmas

Unlike many families my family celebrates Christmas with really really a lot of our family. This past year there were about 29 people at my Grandfather's house in the week around Christmas. I know what you're thinking: how does that work? It's…

via Lily Wise's Blog Posts January 3, 2026

more     (via openring)