• Posts
  • RSS
  • ◂◂RSS
  • Contact

  • Sequencing Intro II: Adapters

    September 14th, 2022
    bio, tech
    A couple weeks ago I wrote a short sequencing intro. Here's a bit more, starting with a minor puzzle. I'm working with California wastewater sequencing data (Rothman et al 2021) and I found a read that was a partial match for HIV:

    >SRR14530740.1578405 1578405/2
    AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
    GGTAACTAGAGATCCCTCAGACCCTTTTAGTCCTGTCTCTTATACACATCT
    GACGCTGCCGACGACCTTCGTGATGTGTAGATCTCGGGGGGCGGCGGGG
    

    The 83 highlighted bases at the beginning of the read are an exact match for this section near the beginning of the HIV genome:

    >AF033819.3 HIV-1, complete genome
    GGTCTCTCTGGTTAGACCAGATCTGAGCCTGGGAGCTCTCTGGCTAACTA
    GGGAACCCACTGCTTAAGCCTCAATAAAGCTTGCCTTGAGTGCTTCAAGT
    AGTGTGTGCCCGTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGAC
    CCTTTTAGTCAGTGTGGAAAATCTCTAGCAGTGGCGCCCGAACAGGGAC
    ...
    

    This dataset was sequenced with paired-end reads, which means there's more information we can get on this particular genetic snippet. This was the 'reverse' read, so let's look at the corresponding forward read:

    >SRR14530740.1578405 1578405/1
    GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
    GCACACACTACTTGAAGCACTCAAGGCAAGCTCTGTCTCTTATACACATCT
    CCGAGCCCACGAGACCTTGCCATTAATCTCGTATGCCGTCTTCTGCTTG
    

    When working with paired-end reads, they're sequenced in opposite directions working towards each other:

       forward read -->
    5' --------------------------------------------- 3'
       |||||||||||||||||||||||||||||||||||||||||||||
    3' --------------------------------------------- 5'
                                    <-- reverse read
    

    Because they're reading in different directions you need to reverse one of the reads, and since they're reading complementary strands you need to take the genetic complement. Here's the reverse complement of the forward read, to match the reverse read we were already looking at:

    >SRR14530740.1578405 1578405/1, reverse complement
    CAAGCAGAAGACGGCATACGAGATTAATGGCAAGGTCTCGTGGGCTCGGAG
    ATGTGTATAAGAGACAGAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCC
    GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTC
    

    Most commonly your paired end reads go together like:

    [read 1] [gap you didn't sequence] [read 2]
    

    In this case, however, they overlap, allowing us to assemble a larger sequence. Sometimes you might have a read error, where the two don't perfectly match, but we're lucky here and there's no disagreement:

    CAAGCAGAAGACGGCATACGAGATTAATGGCAAGGTCTCGTGGGCTCGGAG
    ATGTGTATAAGAGACAGAGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCC
    GTCTGTTGTGTGACTCTGGTAACTAGAGATCCCTCAGACCCTTTTAGTCCT
    GTCTCTTATACACATCTGACGCTGCCGACGACCTTCGTGATGTGTAGATCT
    CGGGGGGCGGCGGGG
    

    Now here's the puzzle: the overlapping portion of the two happens to be exactly the sequence that matches HIV. This isn't something you'd expect to see by chance, right? How much overlap (or distance) you get between two sequences should be unpredictable. So, why is this happening?

    In the sequencing process, your input DNA fragment gets more bits of DNA ("adapters") stuck on its ends, to allow the sequencer to manipulate it. At the beginning (5' end) of the target sequence this works well: sequencing uses the adapter to determine where to start the read, which then will nearly always start with the first base of your original fragment. If your initial fragment is very short, however, it will run past the end of the original sequence and into the adapter. Illumina has some documentation with figures explaining the process.

    Here are the original reads again with the portion immediately following the HIV match highlighted:

    >SRR14530740.1578405 1578405/2
    AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
    GGTAACTAGAGATCCCTCAGACCCTTTTAGTCCTGTCTCTTATACACATCT
    GACGCTGCCGACGACCTTCGTGATGTGTAGATCTCGGGGGGCGGCGGGG
    >SRR14530740.1578405 1578405/1
    GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
    GCACACACTACTTGAAGCACTCAAGGCAAGCTCTGTCTCTTATACACATCT
    CCGAGCCCACGAGACCTTGCCATTAATCTCGTATGCCGTCTTCTGCTTG
    

    For the kit used in this paper this sequence is the start of the adapter, and nothing from this bit on is part of our input fragment.

    With the adapters removed, we're left with just:

    >SRR14530740.1578405 1578405/2
    AGCTTGCCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGTTGTGTGACTCT
    GGTAACTAGAGATCCCTCAGACCCTTTTAGTC
    >SRR14530740.1578405 1578405/1
    GACTAAAAGGGTCTGAGGGATCTCTAGTTACCAGAGTCACACAACAGACGG
    GCACACACTACTTGAAGCACTCAAGGCAAGCT
    

    This is now an exact match for HIV, without any junk at the end. Most quality control pipelines contain a step where you remove adapters, just like they remove the poly-G sequences I described last time.

    Comment via: facebook, lesswrong

    Recent posts on blogs I like:

    Futurist prediction methods and accuracy

    I've been reading a lot of predictions from people who are looking to understand what problems humanity will face 10-50 years out (and sometimes longer) in order to work in areas that will be instrumental for the future and wondering how accurate thes…

    via Posts on September 12, 2022

    History of group sleeping

    Not as normal as it once was The post History of group sleeping appeared first on Otherwise.

    via Otherwise August 10, 2022

    On the Beach

    I really like going in the water and this beach is a great place for building sand castles and boogie boarding. I also like trying to float on top of big waves. I'm not very good at it. I only float on the flat waves.

    via Anna Wise's Blog Posts July 12, 2022

    more     (via openring)


  • Posts
  • RSS
  • ◂◂RSS
  • Contact