Multispecies Metagenomic Calibration

June 24th, 2025
nao
Cross-posted from my NAO Notebook.

This is something I wrote internally in late-2022. Sharing it now with light edits, additional context, and updated links after the idea came up at the Microbiology of the Built Environment conference I'm attending this week.

Metagenomic sequencing data is fundamentally relative: each observation is a fraction of all the observations in a sample. If you want to make quantitative observations, however, like understanding whether there's been an increase in the number of people with some infection, you need to calibrate these observations. For example, there could be variation between samples due to variation in:

  • Changes in how many humans are contributing to a sample.
  • Has it been raining? (Especially in areas with combined sewers, but also a factor with nominally separated sewers.)
  • What have people been eating lately?
  • What temperature has it been?
  • How concentrated is the sample?
  • etc

If you're trying to understand growth patterns all of this is noise; can we reverse this variation? I'm using "calibration" to refer to this process of going from raw per-sample pathogen read counts to estimates of how much of each pathogen was originally shed into sewage.

The simplest option is not to do any calibration, and just consider raw relative abundance: counts relative to the total number of reads in the sample. For example, this is what Marc Johnson and Dave O'Connor are doing.

It seems like you ought to be able to do better if you normalize by the number of reads matching some other species humans excrete. It's common to use PMMoV for this: peppers are commonly infected with PMMoV, people eat peppers, people excrete PMMoV. All else being equal, the amount of PMMoV in a sample should be proportional to the human contribution to the sample. This is especially common in PCR work, where you take a PCR measurement of your target, and then present it relative to a PCR measurement of PMMoV. For example, this is what WastewaterSCAN does.

Because the NAO is doing very deep metagenomic sequencing, around 1B read pairs (300Gbp) per sample, we ought to be able to calibrate against many species at once. PMMoV is commonly excreted, but so are other tobamoviruses, crAssphage, other human gut bacteriophages, human gut bacteria, etc. We pick up thousands of other species, and should be able to combine those measurements to get a much less noisy measurement of the human contribution to a sample.

This isn't something the NAO has been able to look into yet, but I still think it's quite promising.

Comment via: facebook, lesswrong, mastodon, bluesky, substack

Recent posts on blogs I like:

What Effective Altruists Believe: An Unmanifesto

This is a rerun of an old post, now with links!

via Thing of Things June 23, 2025

Elixir's Last Dance

On May 18th, the contra dance band Elixir had their last gig ever. The dance was packed: there were three hundred people. It was the only dance BIDA has ever done where they sold tickets. People flew from across the country just to hear Elixir play one la…

via Lily Wise's Blog Posts June 5, 2025

Workshop House case study

Lauren Hoffman interviewed me about Workshop House and wrote this post about a community I’m working on building in DC.

via Home April 30, 2025

more     (via openring)