I'm Jeff Kaufman, and I co-lead the Nucleic Acid Observatory at SecureBio. Today I'm going to be talking about the computational methods we use in our pathogen-agnostic early warning system.

Before I get started, however, this is all the work of a team:

I'm honored to work with this group!

So: what are we up against?

A typical pandemic starts with an infected individual.

They start having symptoms and become infectious.

Someone else gets infected.

Then a few more.

Those people start having symptoms and become infectious.

More people get infected.

At this point doctors notice something unusual is happening and raise the alarm.

But what if there's a long pre-symptomatic period? Imagine something like a faster-spreading HIV, maybe airborne.

You'd still start with one infected individual.

But then you get another.

And more.

Still without the symptoms that would tell us something was wrong.

No one notices.

Then the first person starts showing symptoms, but that's unlikely to be enough for someone to notice something is wrong: there are individuals who report unusual symptoms all the time.

Another person starts showing the same unsual symptoms, but they're far enough away that no one puts the pieces together yet.

By the time you have symptom clusters large enough to be noticed, a very large praction of the population has been infected with something seriously dangerous. We call this scenariou a "stealth" pandemic.

I've talked about this as presymptomatic spread, but note you'd also get the same pattern if the initial symptoms were very generic and it just looked like a cold going around.

Barriers to engineering and synthesis are falling.

There are dozens of companies offering synthesis, many with inadequate screening protocols. And benchtop synthesizers are becoming more practical.

Reverse genetics is getting easier as well as more students learn how to do it and the knowledge spreads.

And frontier AI systems can provide surprisingly useful advice, helping an attacker bypass pitfalls.

What if a stealth pandemic were spreading now? Could our military personnel and general public already be infected? How would we know?

We probably wouldn't. We are not prepared for this threat. How can we become prepared?

We need an early warning system for stealth pandemics. We're builing one.

We're taking a multi-step approach, starting with wastewater.

And pooled nasal swabs.

We extract the nucleic acids, physically enrich for viruses, and run deep untargeted metagenomic sequencing.

Then we analyze that data, looking for evidence of genetic engineering.

When this flags a cluster of sequencing reads indicative of a potential threat, analysts evauate the decision and the data the decision was based on.

Lets go over each of these steps in more detail.

Where and what should we sample?

We've done the most work with municipal wastewater. It has some serious advantages, but also serious disadvantages.

The best thing about it is the very low cost per covered individual.

A whole city, in a single sample.

On the other hand, only a tiny fraction of the nucleic acids in wastewater come from human-infecting pathogens, and even fewer come from non-gastrointestinal viruses.

We looked into this in a paper I link at the end. We were trying to estimate the fraction of sequencing reads that might come from covid when 1% of people had been infected in the last week.

While our distribution is still wide, we estimate about one in ten million sequencing reads would come from covid.

With influenza it's one to two orders of magnitude lower.

You need a lot of sequencing to see something that's only one in 100M reads!

On the other hand, bulk sequencing keeps getting cheaper. The list price of a NoveSeq 25B is $16k for about 18B reads, or about $900 per billion reads. This is low enough to make this approach practical.

The largest cost at scale, however, is still sequencing.

You still need a lot of sequencing.

A lot of sequencing.

Very deep sequencing.

We sequencing a typical sample to between one and two billion reads.

For an approach with the opposite tradeoffs we're also exploring pooled nasals swabs. We do anterior nasal, like a covid test.

With swabs, the fraction of sequencing reads that are useful is far higher.

We looked into this in our swab sampling report, which I'll also link at the end. With covid, a swab from a single sick person might have one in a thousand sequencing reads match the virus.

Instead of sampling a single person, though...

We sample pools. This is a lot cheaper, since you only need prepare a single sample.

We estimate that with a pool of 200 people, if 1% of the population got covid in the last week we estimate about one in 5,000 sequencing reads would come from the virus.

Because relative abundance is higher, cost-per-read isn't a driving factor. Instead of sending libraries out to run a big Illumina sequencer...

We can sequence them on a Nanopore, right on the bench.

The big downside of pooled nasal swab sampling, though, is you need to get swabs from a lot of people.

A lot of people.

A lot a lot.

We go out to busy public places and ask people for samples. We get about 30 swabs per hour. This isn't yet cost-competitive with wastewater sequencing as a way to see non-GI viruses, but as we keep iterating on our sample collection and processing we're optimistic that we'll get there.

So you collect your samples and sequence them to learn what nucleic acids they contain. What do you do with that data?

How do you figure out what pathogens are present?

The traditional way is to match sequencing reads against known things. We do this! It's great that sequencing lets us check for all known viruses that infect humans, or that could evolve or be modified to infect humans. But this isn't pathogen agnostic, and it doesn't detect novel pathogens.

Our primary production approach is a kind of genetic engineering detection.

If you were trying to cause harm today, you'd very likely start with something known and modify it. Many ways of doing this create a detectable signature:

A junction between a modified and unmodified portions of a genome. We can look for reads that cover these junctions.

Here's a real sequencing read we saw on February 27th

The first part is a solid match for HIV

But the second part is not.

This is a common lentiviral vector, likely lab contamination. But it's genetically engineered, and the system flagged it. We've flagged three of these so far, all derived from HIV.

To validate this approach of looking for surprising junctions we started in silico

We picked a range of viruses, simulated engineered insertions, simulated reads from these genomes and verified system flags reads. The system flagged 71% of what a simple model says would be flaggable.

Then we validated real world performance with spike-ins. We took raw municipal wastewater influent.

And added viral particles with engineered genomes. With three different HIV-derived genomes we flagged 79% to 89% of reads the matched the junctions.

In addition to detecting pathogens via...

direct match to known things and by flagging suspicious junctions, we can flag based on growth patterns.

Initially we're taking a reference-based approach, detecting pathogens based on known things are introduced or becoming common. This can be more sensitive than junction detection because you don't have to see the junction to know something's wrong. This means a larger fraction of sequencing reads are informative.

This onlt works if the base genome is very rare in your sample type, but it still limits the adversary's options.

We've also explored a reference-free approach, looking for exponentially-growing k-mers. This requires more data than we have so far, but it's the hardest one for an adversary to bypass: it tracks the fundamental signal of a pathogen spreading through the population.

The final method we're exploring is novelty detection.

If we can understand these samples well enough, we can flag something introduced just for being unexpected in the sample. This is a major research problem, and we're excited to share our data and collaborate with others here. For example, Willie Neiswanger and Oliver Liu at USC have trained a metagenomic foundation model on this data, which they're hoping to apply to detecting engineered sequences. We're also collaborating with Ryan Teo in Nicole Wheeler's lab at the University of Birmingham on better forms of metagenomic assembly.

How big would a system like this need to be? It depends heavily on our desired sensitivity.

To detect when 1:100 people are infected

We need a medium amount of wastewater sequencing

Or swab collection.

To detect

when 1:1,000 people have been infected

We'd need 10x the sequencing

Or 10x the swabbing

We've made a simulator for estimating the sensitivity of different system designs.

Given a system configuration, it estimates how likely it is to flag a novel pathogen before some fraction of the population has ever been infected

Let's say we target flagging before 1:1,000 people have ever been infected, we're looking at something that sheds and spreads like covid, and we're using the genetic engineering approach of looking for suspicious junctions.

To detect this with wastewater our simulator estimates you'd need about 12 NovaSeqX runs weekly, costing about $16M/y.

On the other hand, to detect this with nasal swabs our simulator estimates you'd need about ten sites each with daily sampling and sequencing, costing about $5M/y.

Since $5M is less than $16M, should we focus just on on swabs? No, we should explore both. There's a lot of uncertainty in both estimates, which could be larger than the difference in estimated cost. Swabbing is also harder to scale, because you need a large collection operation. Finally, different pathogens are more abundant in different sample types.

Today I've described an approach to pathogen-agnostic detection, starting with wastewater and nasal swabs, extracting and sequencing the nucleic acids, and then applying a range of computational methods to identify engineered pathogens. This is a system that we need to defend against stealth pathogens, and we're building it. Some parts are ready to go today, such as junction-based genetic engineering detection. Other parts we're close to, and are a matter of engineering and tuning, such as reference-based growth detection. And others need serious research, such as understanding complex samples so well that we can flag things for being new. We're excited to be working on this, and we're interested in collaborating with others who are thinking along similar lines!

Thank you.

I've prepared a webpage with links to the resources I discussed in this talk:

Does anyone have any questions?

And if I don't get to your question, I can be reached at jeff@securebio.org