• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Sequencing Intro II: Adapters

    September 14th, 2022
    bio, tech
    A couple weeks ago I wrote a short sequencing intro. Here's a bit more, starting with a minor puzzle. I'm working with California wastewater sequencing data (Rothman et al 2021) and I found a read that was a partial match for HIV:

    >SRR14530740.1578405 1578405/2
    AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
    GGTAACTAGAGATCCCTCAGACCCTTTTAGTCCTGTCTCTTATACACATCT
    GACGCTGCCGACGACCTTCGTGATGTGTAGATCTCGGGGGGCGGCGGGG
    

    The 83 highlighted bases at the beginning of the read are an exact match for this section near the beginning of the HIV genome:

    >AF033819.3 HIV-1, complete genome
    GGTCTCTCTGGTTAGACCAGATCTGAGCCTGGGAGCTCTCTGGCTAACTA
    GGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGT
    AGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGAC
    CCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGAC
    ...
    

    This dataset was sequenced with paired-end reads, which means there's more information we can get on this particular genetic snippet. This was the 'reverse' read, so let's look at the corresponding forward read:

    >SRR14530740.1578405 1578405/1
    GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
    GCACACACTACTTGAAGCACTCAAGGCAAGCTCTGTCTCTTATACACATCT
    CCGAGCCCACGAGACCTTGCCATTAATCTCGTATGCCGTCTTCTGCTTG
    

    When working with paired-end reads, they're sequenced in opposite directions working towards each other:

       forward read -->
    5' --------------------------------------------- 3'
       |||||||||||||||||||||||||||||||||||||||||||||
    3' --------------------------------------------- 5'
                                    <-- reverse read
    

    Because they're reading in different directions you need to reverse one of the reads, and since they're reading complementary strands you need to take the genetic complement. Here's the reverse complement of the forward read, to match the reverse read we were already looking at:

    >SRR14530740.1578405 1578405/1, reverse complement
    CAAGCAGAAGACGGCATACGAGATTAATGGCAAGGTCTCGTGGGCTCGGAG
    ATGTGTATAAGAGACAGAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCC
    GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTC
    

    Most commonly your paired end reads go together like:

    [read 1] [gap you didn't sequence] [read 2]
    

    In this case, however, they overlap, allowing us to assemble a larger sequence. Sometimes you might have a read error, where the two don't perfectly match, but we're lucky here and there's no disagreement:

    CAAGCAGAAGACGGCATACGAGATTAATGGCAAGGTCTCGTGGGCTCGGAG
    ATGTGTATAAGAGACAGAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCC
    GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTCCT
    GTCTCTTATACACATCTGACGCTGCCGACGACCTTCGTGATGTGTAGATCT
    CGGGGGGCGGCGGGG
    

    Now here's the puzzle: the overlapping portion of the two happens to be exactly the sequence that matches HIV. This isn't something you'd expect to see by chance, right? How much overlap (or distance) you get between two sequences should be unpredictable. So, why is this happening?

    In the sequencing process, your input DNA fragment gets more bits of DNA ("adapters") stuck on its ends, to allow the sequencer to manipulate it. At the beginning (5' end) of the target sequence this works well: sequencing uses the adapter to determine where to start the read, which then will nearly always start with the first base of your original fragment. If your initial fragment is very short, however, it will run past the end of the original sequence and into the adapter. Illumina has some documentation with figures explaining the process.

    Here are the original reads again with the portion immediately following the HIV match highlighted:

    >SRR14530740.1578405 1578405/2
    AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
    GGTAACTAGAGATCCCTCAGACCCTTTTAGTCCTGTCTCTTATACACATCT
    GACGCTGCCGACGACCTTCGTGATGTGTAGATCTCGGGGGGCGGCGGGG
    >SRR14530740.1578405 1578405/1
    GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
    GCACACACTACTTGAAGCACTCAAGGCAAGCTCTGTCTCTTATACACATCT
    CCGAGCCCACGAGACCTTGCCATTAATCTCGTATGCCGTCTTCTGCTTG
    

    For the kit used in this paper this sequence is the start of the adapter, and nothing from this bit on is part of our input fragment.

    With the adapters removed, we're left with just:

    >SRR14530740.1578405 1578405/2
    AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
    GGTAACTAGAGATCCCTCAGACCCTTTTAGTC
    >SRR14530740.1578405 1578405/1
    GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
    GCACACACTACTTGAAGCACTCAAGGCAAGCT
    

    This is now an exact match for HIV, without any junk at the end. Most quality control pipelines contain a step where you remove adapters, just like they remove the poly-G sequences I described last time.

    Comment via: facebook, lesswrong

    Recent posts on blogs I like:

    Vegan nutrition notes

    I just got comprehensive blood test results and it seems my nutritional numbers are in decent shape (vitamin D, B12, etc) after being vegan for over a year, which is a good sign that I’m probably doing most things okay. Also, I feel good, my weight hasn’t…

    via Home June 2, 2023

    How much to coerce children?

    What's "for their own good"? The post How much to coerce children? appeared first on Otherwise.

    via Otherwise May 29, 2023

    Some mistakes I made as a new manager

    the trough of zero dopamine • managing the wrong amount • procrastinating on hard questions • indefinitely deferring maintenance • angsting instead of asking

    via benkuhn.net April 23, 2023

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact