Multispecies Metagenomic Calibration

June 24th, 2025
nao
Cross-posted from my NAO Notebook.

This is something I wrote internally in late-2022. Sharing it now with light edits, additional context, and updated links after the idea came up at the Microbiology of the Built Environment conference I'm attending this week.

Metagenomic sequencing data is fundamentally relative: each observation is a fraction of all the observations in a sample. If you want to make quantitative observations, however, like understanding whether there's been an increase in the number of people with some infection, you need to calibrate these observations. For example, there could be variation between samples due to variation in:

  • Changes in how many humans are contributing to a sample.
  • Has it been raining? (Especially in areas with combined sewers, but also a factor with nominally separated sewers.)
  • What have people been eating lately?
  • What temperature has it been?
  • How concentrated is the sample?
  • etc

If you're trying to understand growth patterns all of this is noise; can we reverse this variation? I'm using "calibration" to refer to this process of going from raw per-sample pathogen read counts to estimates of how much of each pathogen was originally shed into sewage.

The simplest option is not to do any calibration, and just consider raw relative abundance: counts relative to the total number of reads in the sample. For example, this is what Marc Johnson and Dave O'Connor are doing.

It seems like you ought to be able to do better if you normalize by the number of reads matching some other species humans excrete. It's common to use PMMoV for this: peppers are commonly infected with PMMoV, people eat peppers, people excrete PMMoV. All else being equal, the amount of PMMoV in a sample should be proportional to the human contribution to the sample. This is especially common in PCR work, where you take a PCR measurement of your target, and then present it relative to a PCR measurement of PMMoV. For example, this is what WastewaterSCAN does.

Because the NAO is doing very deep metagenomic sequencing, around 1B read pairs (300Gbp) per sample, we ought to be able to calibrate against many species at once. PMMoV is commonly excreted, but so are other tobamoviruses, crAssphage, other human gut bacteriophages, human gut bacteria, etc. We pick up thousands of other species, and should be able to combine those measurements to get a much less noisy measurement of the human contribution to a sample.

This isn't something the NAO has been able to look into yet, but I still think it's quite promising.

Comment via: facebook, lesswrong, mastodon, bluesky, substack

Recent posts on blogs I like:

Ozy at LessOnline!

I will once again be a guest at LessOnline, alongside many other writers whom you no doubt like less than you like me: Scott Alexander, dynomight, Georgia Ray, David Friedman, Nicholas Decker, Jacob Falkovich, Kelsey Piper, Alicorn, Aella, etc.

via Thing of Things March 23, 2026

Daycares and the Brown School

As someone in Somerville I notice that there are quite high prices regarding childcare. The average family in Somerville pays $1,100 to $3,500 for daycare per month, and I want to make the costs more affordable. I have also noticed that housing is quite …

via Lily Wise's Blog Posts March 22, 2026

2025-26 New Year review

This is an annual post reviewing the last year and setting intentions for next year. I look over different life areas (work, health, parenting, effectiveness, etc) and analyze my life tracking data. Highlights include a minimal group house, the usefulness…

via Victoria Krakovna January 19, 2026

more     (via openring)