Multispecies Metagenomic Calibration |
June 24th, 2025 |
nao |
This is something I wrote internally in late-2022. Sharing it now with light edits, additional context, and updated links after the idea came up at the Microbiology of the Built Environment conference I'm attending this week.
Metagenomic sequencing data is fundamentally relative: each observation is a fraction of all the observations in a sample. If you want to make quantitative observations, however, like understanding whether there's been an increase in the number of people with some infection, you need to calibrate these observations. For example, there could be variation between samples due to variation in:
- Changes in how many humans are contributing to a sample.
- Has it been raining? (Especially in areas with combined sewers, but also a factor with nominally separated sewers.)
- What have people been eating lately?
- What temperature has it been?
- How concentrated is the sample?
- etc
If you're trying to understand growth patterns all of this is noise; can we reverse this variation? I'm using "calibration" to refer to this process of going from raw per-sample pathogen read counts to estimates of how much of each pathogen was originally shed into sewage.
The simplest option is not to do any calibration, and just consider raw relative abundance: counts relative to the total number of reads in the sample. For example, this is what Marc Johnson and Dave O'Connor are doing.
It seems like you ought to be able to do better if you normalize by the number of reads matching some other species humans excrete. It's common to use PMMoV for this: peppers are commonly infected with PMMoV, people eat peppers, people excrete PMMoV. All else being equal, the amount of PMMoV in a sample should be proportional to the human contribution to the sample. This is especially common in PCR work, where you take a PCR measurement of your target, and then present it relative to a PCR measurement of PMMoV. For example, this is what WastewaterSCAN does.
Because the NAO is doing very deep metagenomic sequencing, around 1B read pairs (300Gbp) per sample, we ought to be able to calibrate against many species at once. PMMoV is commonly excreted, but so are other tobamoviruses, crAssphage, other human gut bacteriophages, human gut bacteria, etc. We pick up thousands of other species, and should be able to combine those measurements to get a much less noisy measurement of the human contribution to a sample.
This isn't something the NAO has been able to look into yet, but I still think it's quite promising.
Comment via: facebook, lesswrong, mastodon, bluesky, substack