Multispecies Metagenomic Calibration

June 24th, 2025
nao
Cross-posted from my NAO Notebook.

This is something I wrote internally in late-2022. Sharing it now with light edits, additional context, and updated links after the idea came up at the Microbiology of the Built Environment conference I'm attending this week.

Metagenomic sequencing data is fundamentally relative: each observation is a fraction of all the observations in a sample. If you want to make quantitative observations, however, like understanding whether there's been an increase in the number of people with some infection, you need to calibrate these observations. For example, there could be variation between samples due to variation in:

  • Changes in how many humans are contributing to a sample.
  • Has it been raining? (Especially in areas with combined sewers, but also a factor with nominally separated sewers.)
  • What have people been eating lately?
  • What temperature has it been?
  • How concentrated is the sample?
  • etc

If you're trying to understand growth patterns all of this is noise; can we reverse this variation? I'm using "calibration" to refer to this process of going from raw per-sample pathogen read counts to estimates of how much of each pathogen was originally shed into sewage.

The simplest option is not to do any calibration, and just consider raw relative abundance: counts relative to the total number of reads in the sample. For example, this is what Marc Johnson and Dave O'Connor are doing.

It seems like you ought to be able to do better if you normalize by the number of reads matching some other species humans excrete. It's common to use PMMoV for this: peppers are commonly infected with PMMoV, people eat peppers, people excrete PMMoV. All else being equal, the amount of PMMoV in a sample should be proportional to the human contribution to the sample. This is especially common in PCR work, where you take a PCR measurement of your target, and then present it relative to a PCR measurement of PMMoV. For example, this is what WastewaterSCAN does.

Because the NAO is doing very deep metagenomic sequencing, around 1B read pairs (300Gbp) per sample, we ought to be able to calibrate against many species at once. PMMoV is commonly excreted, but so are other tobamoviruses, crAssphage, other human gut bacteriophages, human gut bacteria, etc. We pick up thousands of other species, and should be able to combine those measurements to get a much less noisy measurement of the human contribution to a sample.

This isn't something the NAO has been able to look into yet, but I still think it's quite promising.

Comment via: facebook, lesswrong, mastodon, bluesky, substack

Recent posts on blogs I like:

Linkpost for March

Effective Altruism

via Thing of Things March 2, 2026

The Newest Technology in Frozen

There are lots of different things in Frozen that are new-ish, but my dad and I were wondering: what is the actual newest thing in Frozen? This led me to watch Frozen a lot while taking notes. Some of the things I found included: Elastic hair-ties A safety …

via Lily Wise's Blog Posts March 1, 2026

2025-26 New Year review

This is an annual post reviewing the last year and setting intentions for next year. I look over different life areas (work, health, parenting, effectiveness, etc) and analyze my life tracking data. Highlights include a minimal group house, the usefulness…

via Victoria Krakovna January 19, 2026

more     (via openring)