Jeff Kaufman's Writing
https://www.jefftk.com/p
Jeff Kaufman's Writing on naoen-us/p/krona-compareKrona Compare
https://www.jefftk.com/p/krona-compare
bionaotech19 Jul 2024 08:00:00 EST<p><span>
</span>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/sample-vs-global-prevalence">NAO
Notebook</a>.</i>
<p>
When trying to understand how metagenomic samples differ I often want to
drill down through the taxonomic hierarchy, comparing relative abundances.
I've tried several tools for this, existing and custom, and haven't been all
that happy with any. For most purposes, the tool I like most
is <a href="https://github.com/marbl/Krona/wiki">Krona</a>, which shows an
interactive chart. For example, here's Krona showing the results of running
the NAO's <a href="https://github.com/naobservatory/mgs-pipeline">v1
metagenomic sequencing pipeline</a> on the
unenriched <a href="https://en.wikipedia.org/wiki/Hyperion_sewage_treatment_plant">Hyperion
Treatment Plant</a> samples
from <a href="https://pubmed.ncbi.nlm.nih.gov/34550753/">Rothman et
al. 2021</a>:
</p>
<p>
<a href="https://www.jefftk.com/rothman-krona-htp-unenriched-big.png"><img src="https://www.jefftk.com/rothman-krona-htp-unenriched.png" width="550" height="309" class="mobile-fullwidth" style="max-width:100.0vw; max-height:56.2vw;" srcset="https://www.jefftk.com/rothman-krona-htp-unenriched.png 550w,https://www.jefftk.com/rothman-krona-htp-unenriched-2x.png 1100w"><div style="height:min(56.2vw, 309px)" class="image-vertical-spacer"></div></a>
<br>
(<a href="https://www.jefftk.com/krona-rothman-htp">interactive version</a>)
</p>
<p>
What I often wish I had, however, are <i>linked</i> Krona charts, where I
could see multiple samples at once, and drilling down in one sample showed
the corresponding portion of the other samples. After failing to find
something like this, I hacked something together by monkey-patching the
output of Krona. Here's it comparing the samples from several wastewater
treatment plants in the same study:
</p>
<p>
<a highlight href="https://www.jefftk.com/rothman-krona-combined-big.png"><img src="https://www.jefftk.com/rothman-krona-combined.png" width="550" height="309" class="mobile-fullwidth" style="max-width:100.0vw; max-height:56.2vw;" srcset="https://www.jefftk.com/rothman-krona-combined.png 550w,https://www.jefftk.com/rothman-krona-combined-2x.png 1100w"><div style="height:min(56.2vw, 309px)" class="image-vertical-spacer"></div></a>
</p>
<p>
When I click on "viruses" in any of the plots, all four zoom in on the
viral fraction:
</p>
<p>
<a href="https://www.jefftk.com/rothman-krona-viruses-combined-big.png"><img src="https://www.jefftk.com/rothman-krona-viruses-combined.png" width="550" height="309" class="mobile-fullwidth" style="max-width:100.0vw; max-height:56.2vw;" srcset="https://www.jefftk.com/rothman-krona-viruses-combined.png 550w,https://www.jefftk.com/rothman-krona-viruses-combined-2x.png 1100w"><div style="height:min(56.2vw, 309px)" class="image-vertical-spacer"></div></a>
</p>
<p>
That's a lot
of <a href="https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=info&id=12234">Tobamovirus</a>!
</p>
<p>
I only just made this, so it's likely buggy, but if that doesn't put you
off you're welcome to give it a try. The interactive version of the charts
above is <a href="https://www.jefftk.com/rothman-unenriched-krona-combined">here</a> (warning:
60MB) and the generation code is open
source, <a href="https://github.com/naobservatory/krona-compare">on
github</a>.
</p>
<p>
</p>
<hr>
<p>
If you're interested in the technical details of how I made this:
</p>
<ul>
<li><p>It's a wrapper around the <code>ktImportTaxonomy</code> command
from <a href="https://github.com/marbl/Krona/wiki/KronaTools">KronaTools</a>.
</p></li>
<li><p>To get multiple charts on the same page, they're each in their
own <code>iframe</code>, via <code>srcdoc</code>.
</p></li>
<li><p>There is unfortunately no CSS way of saying "please lay out these
rectangles to take up as much of the viewport as possible while
maintaining an aspect ratio close to 1:1", so I use some awkward JS. It
checks each possible number of columns and takes the one that maximizes
the minimum dimension (width or height) of the charts. Luckily there are
only a few options to consider.
</p></li>
<li><p>So that the colors match between the charts, each chart on the
page has the data from all the charts. I reach into
each <code>iframe</code> to set the dataset
dropdown's <code>selectedIndex</code> and
call <code>onDatasetChange</code>. It's not ideal needing to duplicate
the Krona output for each pane in the HTML source and then in the
rendered DOMs, but I don't see another way to keep the colors matching.
</p></li>
<li><p>To intercept navigation, the wrapper rewrites the KronaTools HTML
output to hook <code>navigateForwd</code>, <code>navigateBack</code>,
and, especially, <code>selectNode</code>. It inserts some code that
reaches into all of the other <code>iframe</code>s on the page and
navigates them equivalently.
</p></li>
<li><p>Duplicating <code>selectNode</code> is a little tricky because
normally it takes a <code>Node</code> as an argument, but that's not
equivalent between charts. So I walk up to the root of the tree and then
depth-first search until I find a node with a name matching the intended
target.
</p></li>
<li><p>It's all quite heavy, with the Rothman et al. (2021) screenshot
above coming from a <a href="https://www.jefftk.com/rothman-unenriched-krona-combined">60MB
HTML file</a>, but it's fast enough on my computer to be useful to me.
</p></li>
</ul>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid0g8ZYe2F4MiDer1NeMtshFwirLNBQbbbon4xsAbiNJqhm5vnHDTm5Gxyeq7scZ8zDl">facebook</a>, <a href="https://lesswrong.com/posts/he9BpRWuds5istBus">lesswrong</a>, <a href="https://mastodon.mit.edu/@jefftk/112816144944344932">mastodon</a></i></p>
/p/sample-prevalance-vs-global-prevalenceSample Prevalance vs Global Prevalence
https://www.jefftk.com/p/sample-prevalance-vs-global-prevalence
bionaotech08 Jul 2024 08:00:00 EST<p><span>
</span>
<p>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/sample-vs-global-prevalence">NAO
Notebook</a>. Thanks to Evan Fields and Mike McLaren for editorial
feedback on this post.</i>
</p>
<p>
In <a href="https://naobservatory.org/blog/detecting-genetically-engineered-viruses">Detecting
Genetically Engineered Viruses With Metagenomic Sequencing</a> we have:
</p>
<blockquote>
our best guess is that if this system were deployed at the scale of
approximately $1.5M/y it could detect something genetically engineered
that shed like SARS-CoV-2 before 0.2% of people in the monitored
sewersheds had been infected.
</blockquote>
<p>
I want to focus on the last bit: "in the monitored sewersheds". The idea
is, if a system like this is tracking wastewater from New York City, its
ability to raise an alert for a new pandemic will depend on how far along
that pandemic is in that particular city. This is closely related to
another question: what fraction of the global population would have to be
infected before it could raise an alert?
</p>
<p>
There are two main considerations pushing in opposite directions, both based
on the observation that the pandemic will be farther along in some places
than others:
</p>
<ul>
<li><p>
With so many places in the world where a pandemic might start, the
chance that it starts in NYC is quite low. To take the example of
COVID-19, when the first handful of people were sick they were all in
one city in China. Initially, prevalence in monitored sewersheds in
other parts of the world will be zero, while global prevalence will
be greater than zero. This effect should diminish as the pandemic
progresses, but at least in the <1% cumulative incidence
situations I'm most interested in it should remain a significant
factor. This pushes prevalence in your sample population to lag
prevalence in the global population.
</p></li>
<li><p>
NYC is a highly connected city: lots of people travel between there
and other parts of the world. Since pandemics spread as people move
around, places with many long-distance travelers will generally be
infected before places with few. While if you were monitoring an
isolated sewershed you'd expect this factor to cause an additional
lag in your sample prevalence, if you specifically choose places like
NYC we expect instead the high connectivity to reduce lag relative to
global prevalence, and potentially even to lead global prevalence.
</p></li>
</ul>
<p>
My guess is that with a single monitored city, even the optimal one (which
one is that even?) your sample prevalence will significantly lag global
prevalence in most pandemics, but by carefully choosing a few cities to
monitor around the world you can probably get to where it leads global
prevalence. But I would love to see some research and modeling on this:
qualitative intutitions don't take us very far. Specifically:
</p>
<ul>
<li><p>How does prevalence at a highly-connected site compare
to global prevalence during the beginning of a pandemic?</p></li>
<li><p>What if you instead are monitoring a collection of
highly-connected sites?</p></li>
<li><p>What does the diminishing returns curve look like for bringing
additional sites up? Does it go negative at some point, where you are
sampling so many excellent sites that the marginal site is mostly
dilutative?</p></li>
<li><p>If you look at the initial spread of SARS-CoV-2, how much of the
variance in when places were infected is explained by how connected
they are?</p></li>
<li><p>What about with data from the spread of influenza and SARS-CoV-2
variants?</p></li>
<li><p>Are there other major factors aside from connectedness that lead
to earlier infection? Can we model how valuable different sites are
to sample, in a way that can be combined with how operationally
difficult it is to sample in various places?</p></li>
</ul>
<p>
If you know of good work on these sorts of modeling questions or are
interested in collaborating on them, please get in touch! My work email is
<code>jeff</code> at <code>securebio.org</code>.
</p>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid0YPNh5A4cDhbdak69CmiYe7BcrgL5UuFNccg7K7XaRzwm965B1JEeQEHsDbNNanmxl">facebook</a>, <a href="https://lesswrong.com/posts/D3o3ed8WRspbgnPGr">lesswrong</a>, <a href="https://forum.effectivealtruism.org/posts/NHZ3HWz5tkzsoBq9Y">the EA Forum</a>, <a href="https://mastodon.mit.edu/@jefftk/112753373310780449">mastodon</a></i></p>
/p/quick-thoughts-on-our-first-sampling-runQuick Thoughts on Our First Sampling Run
https://www.jefftk.com/p/quick-thoughts-on-our-first-sampling-run
bionao22 May 2024 08:00:00 EST<p><span>
</span>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/quick-thoughts-on-our-first-sampling-run">NAO Notebook</a></i>
<p>
While the <a href="https://naobservatory.org/">NAO</a> has primarily
focused on wastewater, we're now spinning up a <a href="https://naobservatory.org/blog/updates-spring-2024#pooled-individual-sequencing">swab
sampling effort</a>. The idea is to go to busy public places, ask
people to swab their noses, pool the swabs, sequence them, and look
for novel pathogens. We did our first collection run yesterday:
</p>
<p>
<a href="https://www.jefftk.com/swab-sample-collection-day-one-big.jpg"><img src="https://www.jefftk.com/swab-sample-collection-day-one.jpg" width="550" height="459" class="mobile-fullwidth" style="max-width:100.0vw; max-height:83.5vw;" srcset="https://www.jefftk.com/swab-sample-collection-day-one.jpg 550w,https://www.jefftk.com/swab-sample-collection-day-one-2x.jpg 1100w"><div style="height:min(83.5vw, 459px)" class="image-vertical-spacer"></div></a>
</p>
<p>
After months of planning and getting approvals, it was great to be out
there! Some thoughts on how it went:
</p>
<p>
</p>
<ul>
<li><p>In two hours in Kendall Square we collected 27 swabs, or
13.5/hr. I think with some optimization (below) we could do about ten
times this rate.
</p></li>
<li><p>It was very bursty. Mostly people ignored us, I think because
they thought we were asking for money. When someone did stop to
provide a sample, and especially when multiple people stopped, they
dynamic changed dramatically. The crowd did the advertising for us,
and we had people giving us samples as fast as we could collect
them. Most of our samples came in a small number of bursts.
</p></li>
<li><p>This means we need to be sampling at a location and time when
there's enough foot traffic that we can maintain a crowd: bringing in
each new person while the previous one is still providing a
sample. (This is very familiar from busking.)
</p></li>
<li><p>Mid-morning in Kendall Square was not this: we were sampling at
the time that was convenient for us, which was too late for most of
the morning commuters and too early for people out getting lunch. I
suspect 8am-9am, 12-1pm, and 5pm-6pm would have been much better
times.
</p></li>
<li><p>There are also higher foot traffic areas around Boston than
Kendall Square: we only started in Galaxy Park because that was,
again, convenient for us. In this case, however, convenience matters
enough that for our initial few iterations I think we'll stick to
Kendall.
</p></li>
<li><p>When you don't have a crowd yet, it's really valuable for
people to be able to tell whether they want to participate from
reading the sign. I think Simon did a great job making a
professional-looking sign, but the text describing the core of what
we're asking ("swab your nose, get $5") is too small. For our next
collection run we'll print a new banner with the title and subtitle
swapped.
</p></li>
<li><p>The banner is only readable from one side, but people don't
only come from one direction. The traditional approach is to <a href="https://en.wikipedia.org/wiki/Sandwich_board">wear a sandwich
board</a>, but this doesn't seem very professional. I'm thinking
maybe two of these retractable banners, back to back?
</p></li>
<li><p>We're having people swab their noses and then seal the swab in
a vial before dropping it into a box. I think it's pretty likely that
once we have more experience at this we can work out a system where we
can safely put the swabs directly in the box, though this will require
coordinating with our biosafety officer. This would be cheaper (no
vials) and simpler for participants (no recapping).
</p></li>
<li><p>We're initially testing offering $5/sample. This is a tradeoff
between spending money on compensation to get more participants per
hours, and having staff paid to be out there for more hours. I think
once we have the rest of the operation working well (sampling in good
places and times, good crowd interaction, good signage) we'll be able
to get good results with less compensation ($1, $2, candy bar, granola
bar, etc). But since I think we'll learn faster if compensating at
the higher rate because we'll have more people to sample from, I don't
think we should start optimizing that area yet.
</p></li>
</ul>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid024N5iJ1u6u8dRNr73W9GDdnXkJV1mzZVLJnYG77y736C57krL6VDCf9MPpQytmaXJl">facebook</a>, <a href="https://lesswrong.com/posts/KpaXzkhAng5xsPmns">lesswrong</a>, <a href="https://mastodon.mit.edu/@jefftk/112487532286312185">mastodon</a></i></p>
/p/pandemic-identification-simulatorPandemic Identification Simulator
https://www.jefftk.com/p/pandemic-identification-simulator
bioeanaotech08 Apr 2024 08:00:00 EST<p><span>
At my </span>
<a href="https://naobservatory.org/">day job</a> I work on
identifying potential pandemics sooner, so we have more time to
respond. I recently made a simulator which pulls a
<a href="https://www.jefftk.com/p/weekly-incidence-including-delay">lot
of</a>
<a href="https://www.jefftk.com/p/wastewater-rna-read-lengths">things
I've</a>
<a href="https://www.medrxiv.org/content/10.1101/2023.12.22.23300450v1.full">been
thinking</a>
<a href="https://www.jefftk.com/p/sequencing-swabs">about recently</a> into a
single estimate. You can read more
<a href="https://naobservatory.org/blog/simulating-approaches-to-metagenomic-pandemic-identification">on
the NAO blog</a> or
<a href="https://data.securebio.org/simulator/">give the simulator a
try</a>.
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid0E4P62NU5ec2AB1UQ1ESaBeje5hY1g4hNzgLaFpEW3NpsHZhJX65woF6c3AveZxQZl">facebook</a>, <a href="https://lesswrong.com/posts/KvSgty2jY7XrpJk5r">lesswrong</a>, <a href="https://forum.effectivealtruism.org/posts/Mp2ajw83fN3xAtHRq">the EA Forum</a>, <a href="https://mastodon.mit.edu/@jefftk/112237107534075470">mastodon</a></i></p>
/p/sequencing-swabsSequencing Swabs
https://www.jefftk.com/p/sequencing-swabs
biocovid-19eanaotech31 Jan 2024 08:00:00 EST<p><span>
</span>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/sequencing-swabs">NAO Notebook</a></i>
<p>
<a name="update-2024-02-26"></a><b>Update 2024-02-26</b>: due to a <a href="https://github.com/naobservatory/jefftk-analysis/commit/3075bae70b8986938c2b71c4b1e8436b98896754">serious
typo</a> all the results in this post were off by a factor of 100.
Sequencing from individuals still looks promising, but by less than it
did before. I've updated the numbers in the post, and added notes
above the charts to explain how they're wrong. Thanks to Simon Grimm
for catching my mistake.
</p>
<p>
<i>While this is about an area I work in, I'm speaking for myself and
not my organization.</i>
</p>
<p>
At the <a href="https://naobservatory.org/">Nucleic Acid
Observatory</a> we've been mostly looking at metagenomic sequencing of
wastewater as a way to identify a potential future '<a href="https://www.gcsp.ch/publications/securing-civilisation-against-catastrophic-pandemics">stealth</a>'
pandemic, but this isn't because wastewater is the ideal sample type:
sewage is actually really—the joke writes itself. Among other
things it's inconsistent, microbially diverse, and the nucleic acids
are have degraded a lot. Instead the key advantage of wastewater is
how practical it is to get wide coverage. Swabbing the noses and/or
throats of everyone in Greater Boston a few times a week would be
nuts, but <a href="https://www.mwra.com/biobot/biobotdata.htm">sampling the
wastewater</a> gets you <a href="https://www.mwra.com/02org/html/whatis.htm">3M</a> people in a
single sample. [1]
</p>
<p>
Imagine, though, that people were enthusiastic about taking samples
and giving them to you to sequence. How much better would that be?
What does "better" even mean? In <a href="https://www.medrxiv.org/content/10.1101/2023.12.22.23300450v1.full">Grimm
et al. (2023)</a> we operationalized "better" as RA<sub>i</sub>(1%):
what fraction of shotgun metagenomic sequencing reads might come from
the pathogen of interest when 1% of people have been infected in the
last week. For example, in our re-analysis of <a href="https://pubmed.ncbi.nlm.nih.gov/34550753/">Rothman et
al. (2021)</a> we found a RA<sub>i</sub>(1%) of ~1e-7 for SARS-CoV-2,
which means we'd estimate that in a week where 1% of people contracted
Covid-19, one in 10M sequencing reads would come from the virus.
Let's say you were able to get throat swabs instead; what might
RA<sub>i</sub>(1%) look like?
</p>
<p>
The ideal way to determine this would be to swab a lot of people, run
untargeted sequencing, analyze the data, and link it to trustworthy
data on how many people were sick. But while public health
researchers do broad surveillance, collecting and testing swabs for
specific things, as far as I can tell no one has combined this with
metagenomic sequencing. [2] Instead, we can get a rough estimate from
looking at studies that just sequenced sick people.
</p>
<p>
In <a href="https://www.nature.com/articles/s41421-021-00248-3">Lu et
al. (2021)</a> they ran shotgun metagenomic RNA sequencing on throat
swabs from sixteen Covid patients in Wuhan. Patients with higher
viral load, and so lower Ct values on their qPCR tests (roughly the
number of times you need to double the amount of SARS-CoV-2 in the
sample until it's detectable), consistently had higher relative
abundance:
</p>
<p>
<a href="https://www.jefftk.com/covid-throat-swab-relative-abundance-big.png"><img src="https://www.jefftk.com/covid-throat-swab-relative-abundance.png" width="550" height="557" class="mobile-fullwidth" style="max-width:100.0vw; max-height:101.3vw;" srcset="https://www.jefftk.com/covid-throat-swab-relative-abundance.png 550w,https://www.jefftk.com/covid-throat-swab-relative-abundance-2x.png 1100w"><div style="height:min(101.3vw, 557px)" class="image-vertical-spacer"></div></a>
</p>
<p>
<i>Larger version of <a href="https://www.nature.com/articles/s41421-021-00248-3#:~:text=Fig.%202%3A%20Genome%20coverage%20of%20SARS%2DCoV%2D2.">Fig 2.a.1</a>, reconstructed from <a href="https://static-content.springer.com/esm/art%3A10.1038%2Fs41421-021-00248-3/MediaObjects/41421_2021_248_MOESM1_ESM.pdf">Supplementary Table
S1</a>)</i>
</p>
<p>
Imagine we got swabs from a lot of people, of which 1% were sick.
What sort of relative abundances might we get? If we collect only a
few swabs it depends a lot on whether we get anyone who's sick, and if
we do get a sick person then it matters a lot how high their viral
load is. On the other hand, if we collect a very large number of
swabs then we'll just get the average across the population. Assuming
for the moment that we can model "sick" as "similar to one of those
sixteen hospitalized patients", here's a bit of simulating (<a href="https://github.com/naobservatory/jefftk-analysis/blob/main/2024-01-30--simulate-swabs.py">code</a>):
</p>
<p>
[EDIT: the y-axis on this chart is 100x too high. For example, the
black line should be just below 1e-3]
</p>
<p>
<a href="https://www.jefftk.com/simulated-ra-1pct-for-throat-swabs-big.png"><img src="https://www.jefftk.com/simulated-ra-1pct-for-throat-swabs.png" width="550" height="384" class="mobile-fullwidth" style="max-width:100.0vw; max-height:69.8vw;" srcset="https://www.jefftk.com/simulated-ra-1pct-for-throat-swabs.png 550w,https://www.jefftk.com/simulated-ra-1pct-for-throat-swabs-2x.png 1100w"><div style="height:min(69.8vw, 384px)" class="image-vertical-spacer"></div></a>
</p>
<p>
This is 10k simulations, ordered by the relative abundance each gave.
For example, if 1% of people are sick and you only swab 50 people then
in half the simulations no one in the sample is sick and the relative
abundance is 0, which is why the blue n=50 line only shows up for
percentiles 50% and above. On the other hand, if we collect a huge
number of swabs we end up with pretty consistently 0.08% of sequencing
reads coming from SARS-CoV-2. With 200 swabs the median
RA<sub>i</sub>(1%) value is 0.01%.
</p>
<p>
One major issue with this approach is that the data was collected from
hospitalized patients only. Having a high viral load seems like the
sort of thing that should make you more likely to be hospitalized, so
that should bias Ct values down. On the other hand, people tend to
have lower viral loads later in their infections, and hospitalization
takes a while, which would bias Ct values up. Here's a chart
illustrating this from <a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0258421">Knudtzen
et al. (2021)</a>:
</p>
<p>
<a href="https://www.jefftk.com/ct-by-days-into-disease-and-hospitalization-big.png"><img src="https://www.jefftk.com/ct-by-days-into-disease-and-hospitalization.png" width="550" height="406" class="mobile-fullwidth" style="max-width:100.0vw; max-height:73.8vw;" srcset="https://www.jefftk.com/ct-by-days-into-disease-and-hospitalization.png 550w,https://www.jefftk.com/ct-by-days-into-disease-and-hospitalization-2x.png 1100w"><div style="height:min(73.8vw, 406px)" class="image-vertical-spacer"></div></a>
<i>Note that Cq and Ct are different abbreviations for <a href="https://en.wikipedia.org/wiki/Cycle_of_quantification/qualification">same
thing</a>.</i>
</p>
<p>
Is there a paper that tells us what sort of Ct values we should expect
if we sample a broad swath of infected people?
</p>
<p>
<a href="https://academic.oup.com/ofid/article/9/7/ofac223/6576478">Souverein
et al. (2022)</a> looked at a year's worth of SARS-CoV-2 PCR tests
from a public health facility in the Netherlands. The good news is
these tests averaged two days from symptom onset and they got results
from 20,207 people. The bad news is we only have data from people who
decided to get tested, which still excludes asymptomatics, and these
were combined nasopharyngeal (NP, "deep nose") and oropharyngeal (OP,
"throat") swabs instead of just throat swabs. Still, pretty good!
Comparing their Ct values to what we see in Lu et al. (2021), it looks
like viral loads are generally a lot higher:
</p>
<p>
<a href="https://www.jefftk.com/viral-loads-in-lu-vs-souverein-big.png"><img src="https://www.jefftk.com/viral-loads-in-lu-vs-souverein.png" width="550" height="335" class="mobile-fullwidth" style="max-width:100.0vw; max-height:60.9vw;" srcset="https://www.jefftk.com/viral-loads-in-lu-vs-souverein.png 550w,https://www.jefftk.com/viral-loads-in-lu-vs-souverein-2x.png 1100w"><div style="height:min(60.9vw, 335px)" class="image-vertical-spacer"></div></a>
</p>
<p>
There are two issues with taking this chart literally. One is that
the combined swabs in Souverein should generally have given lower Ct
scores for the same viral load than throat-only swabs would have
given. A quick scan gives me <a href="https://www.medrxiv.org/content/10.1101/2020.05.05.20084889v1.full">Berenger
et al. (2020)</a> where they found a median Ct 3.2 points lower for
nasopharyngeal than throat samples, so we could try to adjust for this
by assuming the Lu Ct values would have been 3.2 points lower:
</p>
<p>
<a href="https://www.jefftk.com/viral-loads-in-lu-vs-souverein-adjusted-big.png"><img src="https://www.jefftk.com/viral-loads-in-lu-vs-souverein-adjusted.png" width="550" height="337" class="mobile-fullwidth" style="max-width:100.0vw; max-height:61.3vw;" srcset="https://www.jefftk.com/viral-loads-in-lu-vs-souverein-adjusted.png 550w,https://www.jefftk.com/viral-loads-in-lu-vs-souverein-adjusted-2x.png 1100w"><div style="height:min(61.3vw, 337px)" class="image-vertical-spacer"></div></a>
</p>
<p>
The other issue, however, is worse: even though it's common to talk
about Ct scores as if they're an absolute measurement of viral load,
they're dependent on your testing setup. A sample that would read Ct
25 with the approach taken in one study might read Ct 30 with the
approach in another. Comparisons based on Ct <i>within</i> a study
don't have this problem, but ones across studies do.
</p>
<p>
So, what can we do? My best guess currently is that the Lu data gives
maybe slightly lower relative abundances than you'd get sampling
random people, but it's hard to say. I'm going to be a bit
unprincipled here, and stick with the Lu data but drop the 20% of
samples with the highest viral loads (3 of 16) to get a conservative
estimate of how high a relative abundance we might see with throat
swabs. This cuts RA<sub>i</sub>(1%) by a factor of ten:
</p>
<p>
[EDIT: the y-axis on this chart is 100x too high. For example, the
black line should be just below 1e-4]
</p>
<p>
<a href="https://www.jefftk.com/simulated-ra-1pct-for-throat-swabs-drop-three-big.png"><img src="https://www.jefftk.com/simulated-ra-1pct-for-throat-swabs-drop-three.png" width="550" height="381" class="mobile-fullwidth" style="max-width:100.0vw; max-height:69.3vw;" srcset="https://www.jefftk.com/simulated-ra-1pct-for-throat-swabs-drop-three.png 550w,https://www.jefftk.com/simulated-ra-1pct-for-throat-swabs-drop-three-2x.png 1100w"><div style="height:min(69.3vw, 381px)" class="image-vertical-spacer"></div></a>
</p>
<p>
I really don't know if this is enough to where the remaining samples
are a good representation of what you'd see with random people in the
community, including asymptomatics, but let's go ahead with it. Then
with 200 swabs the median RA<sub>i</sub>(1%) value is now 4e-5, a
~400x higher relative abundance than we see with wastewater. [3] If
you could cost-effectively swab a large and diverse group of people,
this would allow surveillance with much lower sequencing costs than
wastewater. But that's a big "if": swabbing cost goes up in proportion
to the number of people, and it's hard to avoid drawing from a
correlated subgroup.
</p>
<p>
<i>Thanks to Simon Grimm for conversations leading to this post and
for sending me Lu et al. (2021), to Will Bradshaw for feedback on the
draft and pointing me to Knudtzen et al. (2021) and Souverein et
al. (2022), and to Mike McLaren for feedback on the draft.</i>
</p>
<p>
<br>
[1] Technically it gets you that in two samples, since Biobot tracks
the North System and South System separately. But you can combine them
if you want simpler logistics.
</p>
<p>
[2] If you know of someone who has, or who would if they had the money
for sequencing, please let me know!
</p>
<p>
[3] <a href="https://www.jefftk.com/p/computational-approaches-to-pathogen-detection">Pathogen
identification</a> would also be much easier with swabs, since it's a
far simpler microbiome and the nucleic acids should be in much better
condition.
</p>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/YmBNyhopAmZ7M8t43">facebook</a>, <a href="https://www.facebook.com/jefftk/posts/pfbid0KmuT4BR2REBXCVr4zDaVKwKKGdedtMByzPUtnTiPtQGKcyQ4aeziaAeBa5qASWzil">facebook</a>, <a href="https://lesswrong.com/posts/ZrNbigaWCHbTTns29">lesswrong</a>, <a href="https://forum.effectivealtruism.org/posts/YmBNyhopAmZ7M8t43">the EA Forum</a>, <a href="https://mastodon.mit.edu/@jefftk/111853683657388071">mastodon</a></i></p>
/p/wastewater-rna-read-lengthsWastewater RNA Read Lengths
https://www.jefftk.com/p/wastewater-rna-read-lengths
bionao10 Nov 2023 08:00:00 EST<p><span>
</span>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/wastewater-rna-read-lengths">NAO Notebook</a></i>
<p>
<i>What I'm calling "read length" here should instead have been
"insert length" or "post-trimming read length". When you just say
"read length" that's usually understood to be the raw length output by
the sequencer, which for short-read sequencing is determined only by
your target cycle count.</i>
</p>
<p>
Let's say you're collecting wastewater and running metagenomic RNA
sequencing, with a focus on human-infecting viruses. For <a href="https://www.jefftk.com/p/computational-approaches-to-pathogen-detection">many kinds of
analysis</a> you want a combination of a low cost per base and more
bases per <a href="https://www.jefftk.com/p/what-is-a-sequencing-read">sequencing
read</a>. The lowest cost per base these days, by a lot, comes from
paired-end "short read" sequencing (also called "Next Generation
Sequencing", or "Illumina sequencing" after the main vendor), where
an observation looks like reading some number of bases (often 150)
from each end of a nucleic acid fragment:
</p>
<p>
</p>
<pre>
+------>>>-----+
| forward read |
+------>>>-----+
... gap ...
+------<<<-----+
| reverse read |
+------<<<-----+
</pre>
<p>
Now, if the fragments you feed into your sequencer are short you can
instead get something like:
</p>
<p>
</p>
<pre>
+------>>>-----+
| forward read |
+------>>>-----+
+------<<<-----+
| reverse read |
+------<<<-----+
</pre>
<p>
That is, if we're reading 150 bases from each end and our fragment is
only 250 bases long, we have a negative "gap" and we'll read the 50
bases in the middle of the fragment twice.
</p>
<p>
And if the fragments are very short, shorter than how much you're
reading from each ends, you'll get complete overlap (and then read
through into the <a href="https://www.jefftk.com/p/sequencing-intro-ii-adapters">adapters</a>):
</p>
<p>
</p>
<pre>
+------>>>-----+
| forward read |
+------>>>-----+
+------<<<-----+
| reverse read |
+------<<<-----+
</pre>
<p>
One shortcoming of my ascii art is it doesn't show how the effective
read length changes: in the complete overlap case it can be quite
short. For example, if you're doing 2x150 sequencing you're capable
of learning up to 300bp with each read pair but if the fragment is
only 80bp long then you'll just read the same 80bp from both
directions. So overlap is not ideal from either the perspectives of
minimizing cost per base or maximizing the length of each observation:
a positive gap is better than any overlap, and more overlap is worse.
</p>
<p>
One important question, however, is whether this is something we can
have much control over. Is RNA in wastewater just so badly degraded
that there's not much you can do, or can you prepare the wastewater
for sequencing in a way that minimizes additional fragmentation? The
best way to answer this would be to run some head-to-head comparisons,
sequencing the same wastewater with multiple candidate techniques, but
what can I do just re-analyzing existing data?
</p>
<p>
As part of putting together the NAO's <a href="https://naobservatory.org/reports/predicting-virus-relative-abundance-in-wastewater/">relative
abundance report</a> I'd already identified four studies that had
conducted untargeted metagenomic RNA sequencing of wastewater and made
their data public on the <a href="https://en.wikipedia.org/wiki/Sequence_Read_Archive">SRA</a>:
</p>
<ul>
<li><p><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7845645/">Crits-Christoph
et. al 2021</a>: SF, summer 2020, 300M read pairs, 2x75.
</p></li>
<li><p><a href="https://pubmed.ncbi.nlm.nih.gov/34550753/">Rothman
et. al 2021</a>: LA, fall 2020, 800M read pairs (unenriched only),
2x100 and 2x150.
</p></li>
<li><p><a href="https://www.frontiersin.org/articles/10.3389/fpubh.2023.1145275/full">Spurbeck
et. al 2023</a>: Ohio, winter 2021-2022, 1.8B read pairs, 2x150.
</p></li>
<li><p><a href="https://www.sciencedirect.com/science/article/abs/pii/S0048969720358514">Yang
et. al 2020</a>: Urban Xinjiang China, 2018, 1.4B read pairs, 2x150.
</p></li>
</ul>
<p>
After running them through the NAO's <a href="https://github.com/naobservatory/mgs-pipeline">pipeline</a>
(trimming adapters, assigning species with Kraken2) I plotted the
length distributions of the viral reads:
</p>
<p>
<a href="https://www.jefftk.com/rna-read-lengths-rough-big.png"><img src="https://www.jefftk.com/rna-read-lengths-rough.png" width="550" height="440" class="mobile-fullwidth" style="max-width:100.0vw; max-height:80.0vw;" srcset="https://www.jefftk.com/rna-read-lengths-rough.png 550w,https://www.jefftk.com/rna-read-lengths-rough-2x.png 1100w"><div style="height:min(80.0vw, 440px)" class="image-vertical-spacer"></div></a>
</p>
<p>
This is a little messy, both because it's noisy and because something
about the process gives oscilations in the Crits-Christoph and Rothman
data. [1] Here's a smoothed version:
</p>
<p>
<a highlight href="https://www.jefftk.com/rna-read-lengths-smoothed-big.png"><img src="https://www.jefftk.com/rna-read-lengths-smoothed.png" width="550" height="344" class="mobile-fullwidth" style="max-width:100.0vw; max-height:62.5vw;" srcset="https://www.jefftk.com/rna-read-lengths-smoothed.png 550w,https://www.jefftk.com/rna-read-lengths-smoothed-2x.png 1100w"><div style="height:min(62.5vw, 344px)" class="image-vertical-spacer"></div></a>
</p>
<p>
For each paper this is showing the distribution of lengths of reads
the pipeline assigned to viruses, in cases where it was able to
"collapse" overlapping forward and reverse reads into a single read. [2]
The fraction of "non-collapsed" reads (essentially, ones with a gap)
is also pretty relevant, and I've included that in the legend. This
is why, for example, the area under the purple Yang line can be
smaller than the one under the red Spurbeck line: 42% of Yang viral
reads aren't represented in the chart, compared to 24% of Spurbeck
ones.
</p>
<p>
Note that for the Spurbeck paper I'm only looking at samples from
sites E, F, G, and H: the paper used a range of processing methods and
the method used at those four sites seemed to work much better than
the others (though still not as well as I'd like).
</p>
<p>
These charts show how common each length is as a fraction of all viral
reads, but what if we additionally scale these by what fraction of
reads are viral? If you're studying viruses you probably wouldn't
want 50% longer viral reads if in exchange a tenth as many of your
reads were viral! Here's the same chart, but as a percentage of all
reads instead of just viral reads:
</p>
<p>
<a href="https://www.jefftk.com/rna-read-lengths-of-all-reads-big.png"><img src="https://www.jefftk.com/rna-read-lengths-of-all-reads.png" width="550" height="344" class="mobile-fullwidth" style="max-width:100.0vw; max-height:62.5vw;" srcset="https://www.jefftk.com/rna-read-lengths-of-all-reads.png 550w,https://www.jefftk.com/rna-read-lengths-of-all-reads-2x.png 1100w"><div style="height:min(62.5vw, 344px)" class="image-vertical-spacer"></div></a>
</p>
<p>
This makes the Yang result more exciting: not only are they getting
substantially longer reads than the other papers, but they're able to
combine that with a high relative abundance of viruses.
</p>
<p>
Here's the variability I see between the seven Yang samples:
</p>
<p>
<a href="https://www.jefftk.com/rna-read-lengths-yang-big.png"><img src="https://www.jefftk.com/rna-read-lengths-yang.png" width="550" height="344" class="mobile-fullwidth" style="max-width:100.0vw; max-height:62.5vw;" srcset="https://www.jefftk.com/rna-read-lengths-yang.png 550w,https://www.jefftk.com/rna-read-lengths-yang-2x.png 1100w"><div style="height:min(62.5vw, 344px)" class="image-vertical-spacer"></div></a>
</p>
<p>
This is also pretty good: some samples are a bit better or worse but
nothing that strange.
</p>
<p>
Now, while this does look good, we do need to remember these papers
weren't processing the same influent. There are probably a lot of
ways that municipal wastewater in Xinjiang (Yang) differs from, say,
Los Angeles (Rothman). Perhaps it's the fraction of viral RNA in the
influent, different viruses prevalent in the different regions, the
designs of the sewage system, the weather, or any of the many
differences between these studies other than their sample processing
methods. Still, I expect some does come from sample processing and
this makes me mildly optimistic that it's possible to optimize
protocols to get longer reads without giving up viral relative
abundance.
</p>
<p>
<br>
[1] I don't know what causes this, though I've seen similar length
patterns in a few other data sets including <a href="https://journals.asm.org/doi/10.1128/msystems.00651-22">Fierer
et al. 2022</a>, the NAO's sequencing on an Element Aviti, and some
Illumina data shared privately with us.
</p>
<p>
[2] Ideally I would be plotting the lengths after adapter removal and
collapsing only, but because the current pipeline additionally trims
low-quality bases in the same step that's not something I can easily
get. Still, since Illumina output is generally pretty high quality
and so doesn't need much trimming, and what trimming it does need is
usually on the "inside", this should only be a small effect.
</p>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid0wn4jEkyhG2Wxe23zJmg1rdB7K8NTgPHSHe4yddJMJYxjr1WhA9DEurFKBFVgkTJnl">facebook</a>, <a href="https://lesswrong.com/posts/43EWYRcW6YcubFsrs">lesswrong</a>, <a href="https://mastodon.mit.edu/@jefftk/111386893591930433">mastodon</a></i></p>
/p/computational-approaches-to-pathogen-detectionComputational Approaches to Pathogen Detection
https://www.jefftk.com/p/computational-approaches-to-pathogen-detection
bioeanaotech31 Oct 2023 08:00:00 EST<p><span>
</span>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/computational-approaches-to-pathogen-detection">NAO Notebook</a></i>
<p>
<i>While this post is my perspective and not an official post of my <a href="https://securebio.org/">employer</a>'s, it also draws on a lot
of collaborative work with others at the <a href="https://naobservatory.org/">Nucleic Acid Observatory</a>
(NAO).</i>
</p>
<p>
One of the future scenarios I'm most worried about is someone creating
a "<a href="https://www.gcsp.ch/publications/securing-civilisation-against-catastrophic-pandemics">stealth</a>"
pandemic. Imagine a future HIV that first infects a large number of
people with minimal side effects and only shows its nasty side after
it has spread very widely. This is not something we're prepared
for today: current detection approaches (symptom reporting, a
doctor noticing an unusual pattern) require visible effects.
</p>
<p>
Over the last year, with my colleagues at the <a href="https://naobservatory.org/">NAO</a>, I've been exploring one
promising method of identifying this sort of pandemic. The overall
idea is:
</p>
<p>
</p>
<ol>
<li><p>Collect some sort of biological material from a lot of people
on an ongoing basis, for example by sampling sewage.
</p></li>
<li><p>Use metagenomic sequencing to learn what nucleic acids are in
these samples.
</p></li>
<li><p>Run novel-pathogen detection algorithms on the sequencing data.
</p></li>
<li><p>When you find some thing sufficiently concerning, follow-up
with tests for the specific thing you've found.
</p></li>
</ol>
<p>
While there are important open questions in all four of these, I've
been most focused on the third: once you have metagenomic sequencing
data, what do you do?
</p>
<p>
I see four main approaches. You can look for sequences that are:
</p>
<p>
</p>
<dl>
<dt>
<b>Dangerous</b>:
</dt>
<dd>
<p>There are some genetic sequences that code for dangerous things
that we should not normally see in our samples. If you see a series
of base pairs that are unique to smallpox that's very concerning! The
main downside of this approach if you want to extend it beyond <a href="https://www.selectagents.gov/sat/list.htm">smallpox etc</a> is
that you need to make a list of non-obvious dangerous things, which is
in itself a dangerous thing to do: what if your list is stolen and it
points people to sequences they wouldn't have thought to try using?
</p>
<p>
This is similar to another problem: how do you check if people are
synthesizing dangerous sequences without risking a list of all the
things that shouldn't be synthesized? <a href="https://securedna.org/">SecureDNA</a> has been working on this
problem, with an encrypted database with a distributed key
system that allows flagging sequences without it being practical to
get a list of all flagged sequences (<a href="https://secure-dna.up.railway.app/manuscripts/Cryptographic_Aspects_of_DNA_Screening.pdf">paper</a>).
</p>
<p>
There are some blockers to using SecureDNA for this today, since it
was designed for slightly different constraints, but I think they are
all surmountable and I'm hoping to implement a SecureDNA-based
metagenomic sequencing screening system at some point in the next
year.
</p>
<p>
An alternative and somewhat longer-term approach here would be to use
tools that are able to estimate the function of novel sequences to
extend this to sequences that aren't closely derived from existing
ones. I'm less enthusiastic about this: not only could this work end
up increasing risk by improving humanity's ability to judge how
dangerous a novel sequence is, it's not clear to me that this approach
is likely to catch things the other methods wouldn't.
</p>
</dd>
<dt id="modified-human-viruses">
<b>Modified</b>:
</dt>
<dd>
<p>In engineering a new virus for a stealth pandemic, the easiest
way would likely be to begin with an existing virus. If we see a
sequencing <a href="https://www.jefftk.com/p/what-is-a-sequencing-read">read</a> where part
matches a known viral genome and part does not (a "chimera"), one
potential explanation is that the read comes from a genetically
engineered virus.
</p>
<p>
But this is not the only reason this approach could flag a read. For
example, it could come from:
</p>
<ul>
<li><p>Lack of knowledge. Perhaps a virus has a lot of variation, much
more than is reflected in the databases you are using to define
"normal". It will look like you have found a novel virus when it's
just an incomplete database. And, of course, the database will always
be incomplete: viruses are always evolving. Still, solving this seems
practical: handling these initial false positives requires expanding
our knowledge of the variety of existing viruses, but that is
something many virologists are deeply interested in.
</p></li>
<li><p>Sequencing: perhaps some of the biological processing you do
prior to (or during) sequencing can attach unrelated fragments. When
you see a chimera how do you know whether that existed in the sample
you originally collected vs if it was created accidentally in the lab?
On the other hand, you can (a) compare the fraction of chimeras in
different sequencing approaches and pick ones where this is rare and
(b) pay more attention to cases where you've seen the same chimera
multiple times.
</p></li>
<li><p>Biological chimerism: bacteria will occasionally incorporate
viral sequences. This method would flag this as genetic engineering
even if it was a natural and unconcerning process. As long as this is
rare enough, however, we can deal with this by surfacing such reads to
a biologist who figures out how concerned to be and what next steps
makes sense.
</p></li>
</ul>
<p>
This is the main approach I've been working on lately, trying to get
the false positive rate down.
</p>
</dd>
<dt>
<b>New</b>:
</dt>
<dd>
<p>If we understood what "normal" looked like well enough, then we
could flag anything new for investigation. This is a serious research
project: if you take data from a sewage sample and run it through
basic tooling, it's <a href="https://data.securebio.org/mgs-counts/#mr=1e1&dna=1&rna=1&dna_rna=1&e_none=1&e_viral=1&c_viruses=1&c_bacteria=1&c_other=1&pinned=0">common
to have 50% of reads unclassified</a>. Making progress here will
require, among other things, much better tooling (and maybe
algorithms) for metagenomic assembly: I'm not aware of anything that
could efficiently integrate trillions of bases a week into an assembly
graph.
</p>
<p>
<a href="https://www.teojcryan.com/">Ryan Teo</a>, a first-year
graduate student with <a href="https://www.birmingham.ac.uk/staff/profiles/microbiology-infection/wheeler-nicole.aspx">Nicole
Wheeler</a> at the University of Birmingham has started his thesis in
this area, which I'm really excited to see. <a href="https://github.com/lennijusten">Lenni Justen</a>, another
first-year graduate student, with <a href="https://www.sculptingevolution.org/kevin-m-esvelt">Kevin
Esvelt</a>, is also exploring this area as part of his work with the
NAO. I'd be excited to see more work, however, and if you're working
on this or interested in working on it but blocked by not having
access to enough metagenomic sequencing data please get in touch!
</p>
</dd>
<dt>
<b>Growing</b>:
</dt>
<dd>
<p>It may turn out that our samples are deeply complex:
potentially as you sequence the rate of seeing new things falls off
very slowly. If it falls off slowly enough, and then you will keep
seeing "new" things that are just so rare that you haven't happened to
see them before. I am quite unsure how likely this is, and I expect
it varies by sample type (sewage is likely much more complex than,
say, blood) but it seems possible. An approach that's robust to this
is that instead of flagging some thing just for being new, you could
flag it based on its growth pattern: first you've never seen it, then
you see it once, then you start seeing it more often, then you start
seeing it many times per sample. In theory a new pandemic should begin
with approximately exponentially spread, since with few people already
infected the number of new infections should be proportional to
the number of infectious people.
</p>
<p>
At the NAO we've been calling this "<a href="https://naobservatory.org/computational-threat-detection/">exponential
growth detection</a>" (EGD). We worked on this some in 2022, but have
put it on hold until we have a deep enough timeseries dataset to work
with.
</p>
</dd>
</dl>
<p>
These approaches can also be combined: if a sequence originally comes
to your attention because it's chimeric but you're not sure how
seriously to take it, you could look at the growth pattern of its
components. Or, while you can detect growing things with a
genome-free approach simply by looking for increasing k-mers, the kind
of "thoroughly understand the metagenome" work that I described above
as an approach for identifying new things can also be used to make a
much more sensitive tool that detects growing things.
</p>
<p>
In terms of prioritization, I'm enthusiastic about work on all of
these, and would like to see them progress in parallel. The
approaches of detecting dangerous and modified sequences require less
scientific progress and should work on amounts of data that are
achievable with philanthropic funding. <a href="https://en.wikipedia.org/wiki/Protein_design">De novo protein
design</a> is getting more capable and more accessible, however, which
allows creation of pathogens those two methods don't catch. We will
need approaches that don't depend on matching known things, which is
where detecting new and/or growing sequences comes in. Those two
methods will require a lot more data, enough that unless sequencing
goes through another round of the kind of massive cost improvement <a href="https://www.genome.gov/about-genomics/fact-sheets/Sequencing-Human-Genome-cost">we
saw in 2008-2011</a> we're talking about large-scale government-funded
projects. Advances in detection methods make it more likely that
we'll be able to make the case for these larger projects, and reduce
the risk that the detection ability might lag infrastructure creation.
</p>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid0VNAmyszh71oZbUhFjxsASAxZfuQv4vTuaJJWRhzLzc6i14wrfi45NUttZx6oJSf1l">facebook</a>, <a href="https://lesswrong.com/posts/zZ8vgoKvaEodm9gZL">lesswrong</a>, <a href="https://forum.effectivealtruism.org/posts/acutkuaQQYfpeaxhR">the EA Forum</a>, <a href="https://mastodon.mit.edu/@jefftk/111332489074358553">mastodon</a></i></p>
/p/weekly-incidence-including-delayWeekly Incidence Including Delay
https://www.jefftk.com/p/weekly-incidence-including-delay
bionao20 Sep 2023 08:00:00 EST<p><span>
</span>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/weekly-incidence-including-delay">NAO Notebook</a></i>
<p>
A few days ago I <a href="https://www.jefftk.com/p/weekly-incidence-vs-cumulative-infections">wrote about</a>
some math behind a scenario where you're trying to identify a new
epidemic based on signals proportional to incidence, and ended up
deriving:
</p>
<p>
<math display="block">
<mfrac>
<mrow><mi>i</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow>
<mrow><mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow>
</mfrac>
<mo>=</mo>
<mi>k</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mi>ln</mi>
<mo>(</mo><mn>2</mn><mo>)</mo>
</mrow>
<msub><mi>T</mi><mi>d</mi></msub>
</mfrac></math>
</p>
<p>
Where:
</p>
<ul>
<li>
<math display="inline"><mi>i</mi><mo>(</mo><mi>t</mi><mo>)</mo></math>, is
incidence ("how many people are getting sick now")
</li>
<li>
<math display="inline"><mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo></math>, is
cumulative infections ("how many people have gotten sick so far")
</li>
<li>
<math display="inline"><mi>k</mi></math>, is
the exponential growth rate.
</li>
<li>
<math display="inline"><msub><mi>T</mi><mi>d</mi></msub></math>, is
the doubling time (redundant with <math display="inline"><mi>k</mi></math>).
</li>
</ul>
<p>
One big problem with this model, however, is that any conclusions you
make today aren't driven by current incidence, but instead some kind
of delayed incidence. There is, unavoidably, time from infection until
you're making your decision, during which the disease is spreading
further:
</p>
<p>
<a href="https://www.jefftk.com/infection-delay-timeframe-illustration-big.png"><img src="https://www.jefftk.com/infection-delay-timeframe-illustration.png" width="550" height="275" class="mobile-fullwidth" style="max-width:100.0vw; max-height:50.0vw;" srcset="https://www.jefftk.com/infection-delay-timeframe-illustration.png 550w,https://www.jefftk.com/infection-delay-timeframe-illustration-2x.png 1100w"><div style="height:min(50.0vw, 275px)" class="image-vertical-spacer"></div></a>
</p>
<p>
If your signal is "people arrive at the hospital and the doctors
notice a weird cluster", then you need to wait for each infection to
progress far enough to result in hospitalization. How do the
conclusions of the previous post change if we extend our model to
account for a delay, but keep the goal of flagging an epidemic before
1% of people have been infected?
</p>
<p>
Recall that last time we estimated that, for something doubling
weekly, when 1% of people have ever been infected then 0.69% of people
became infected in the last seven days. With a delay of a week,
however, by the time we learn that incidence has hit 0.69% many more
people will have been infected and we'd have missed our 1% goal by a
lot. The effect of delay is that during this time the epidemic will
make further progress, which will depend on the growth rate: with a
shorter doubling period there will be more progress. Can we get an
equation relating cumulative infections to delayed incidence?
</p>
<p>
Let's call delay <math display="inline"><mi>d</mi></math>.
Instead of
<math display="inline"><mfrac>
<mrow><mi>i</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow>
<mrow><mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow>
</mfrac></math>
we now want
<math display="inline"><mfrac>
<mrow><mi>i</mi><mo>(</mo><mi>t</mi><mo>-</mo><mi>d</mi><mo>)</mo></mrow>
<mrow><mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow>
</mfrac></math>.
That is, what's the relationship between cumulative infections and
what incidence was when the information we're now getting was derived?
</p>
<p>
We can do a bit of math:
</p>
<p>
<math display="block">
<mfrac>
<mrow><mi>i</mi><mo>(</mo><mi>t</mi><mo>-</mo><mi>d</mi><mo>)</mo></mrow>
<mrow><mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow>
</mfrac>
<mo>=</mo>
<mfrac>
<mrow>
<mi>k</mi>
<msup><mi>e</mi>
<mrow><mi>k</mi><mo>(</mo><mi>t</mi><mo>-
</mo><mi>d</mi><mo>)</mo></mrow></msup>
</mrow>
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
</mfrac>
<!--
<mo>=</mo>
<mi>k</mi>
<mfrac>
<mrow>
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
<msup><mi>e</mi>
<mrow><mo>-</mo><mi>k</mi><mi>d</mi></mrow></msup>
</mrow>
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
</mfrac>
-->
<mo>=</mo>
<mi>k</mi>
<msup><mi>e</mi>
<mrow><mo>-</mo><mi>k</mi><mi>d</mi></mrow></msup>
<mo>=</mo>
<!--
<mfrac>
<mrow>
<mi>ln</mi>
<mo>(</mo><mn>2</mn><mo>)</mo>
</mrow>
<msub><mi>T</mi><mi>d</mi></msub>
</mfrac>
<msup><mi>e</mi>
<mrow><mo>-</mo><mfrac>
<mrow>
<mi>ln</mi>
<mo>(</mo><mn>2</mn><mo>)</mo>
</mrow>
<msub><mi>T</mi><mi>d</mi></msub></mfrac>
<mi>d</mi></mrow></msup>
<mo>=</mo>
-->
<mfrac>
<mrow>
<mn>2</mn>
<mi>ln</mi>
<mo>(</mo><mn>2</mn><mo>)</mo>
</mrow>
<msub><mi>T</mi><mi>d</mi></msub>
</mfrac>
<msup><mi>e</mi>
<mfrac>
<mrow><mo>-</mo><mi>d</mi></mrow>
<msub><mi>T</mi><mi>d</mi></msub>
</mfrac>
</msup>
</math>
</p>
<p>
What does this look like, for a few different potential delay values?
</p>
<p>
<a href="https://www.jefftk.com/former-weekly-incidence-by-delay-big.png"><img src="https://www.jefftk.com/former-weekly-incidence-by-delay.png" width="550" height="365" class="mobile-fullwidth" style="max-width:100.0vw; max-height:66.4vw;" srcset="https://www.jefftk.com/former-weekly-incidence-by-delay.png 550w,https://www.jefftk.com/former-weekly-incidence-by-delay-2x.png 1100w"><div style="height:min(66.4vw, 365px)" class="image-vertical-spacer"></div></a>
</p>
<p>
My main takeaway is that, under these assumptions, delay matters less
than I would initially have guessed: the sensitivity you need to
design for is driven by the need to catch slow-growing epidemics. For
example, even with up to eleven days of delay a system sensitive
enough to flag an epidemic that doubles every four weeks is sensitive
enough to detect one that doubles every three days.
</p>
<p>
This isn't the whole story, though, because many of the actions that
you would want to do post-discovery are more urgent with higher growth
rates. This means you need to count the delay of your core response
(ex: implementing <a href="https://www.cdc.gov/nonpharmaceutical-interventions/index.html">NPIs</a>)
in the total delay: you can't allocate the entire delay budget to the
detection system.
</p>
<p>
Another thing to note is that this is all under the assumption that
detection depends on incidence above a threshold, and whether this is
actually the case is unclear for several potential systems. For
example, with wastewater sequencing detection my current best guess is
this primarily would rely on the total number of observations of the
pathogen, which would be proportional to (delayed) cumulative
infections and not (delayed) incidence. With detection based on
cumulative infections minimizing delay matters a lot more.
</p>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid0vDPhMnh1TgJLy1MivkvaBHxgwvycFvyp5dt51CdgWoaq3BUjF6ny8bY2n25DGQeEl">facebook</a>, <a href="https://lesswrong.com/posts/PiJDrMbrDGKwEAcWd">lesswrong</a>, <a href="https://mastodon.mit.edu/@jefftk/111097813727323731">mastodon</a></i></p>
/p/weekly-incidence-vs-cumulative-infectionsWeekly Incidence vs Cumulative Infections
https://www.jefftk.com/p/weekly-incidence-vs-cumulative-infections
bionao06 Sep 2023 08:00:00 EST<p><span>
</span>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/weekly-incidence-vs-cumulative-infections">NAO Notebook</a></i>
<p>
Imagine you have a goal of identifying a novel disease by the time some
small fraction of the population has been infected. Many of the signs
you might use to detect something unusual, however, such as doctor
visits or shedding into wastewater, will depend on the number of
people <i>currently</i> infected. How do these relate?
</p>
<p>
Bottom line: if we limit our consideration to the time before anyone has noticed
something unusual, where people aren't changing their behavior to
avoid the disease, the vast majority of people are still
susceptible, and spread is likely approximately exponential, then:
</p>
<p>
<math display="block">
<mi>incidence</mi>
<mo>=</mo>
<mi>cumulative infections</mi>
<mfrac>
<mrow>
<mi>ln</mi>
<mo>(</mo><mn>2</mn><mo>)</mo>
</mrow>
<mrow><mi>doubling time</mi></mrow>
</mfrac>
<!--
<pre>
incidence = cumulative infections * ln(2) / doubling time
</pre>
-->
<p>
Let's derive this! We'll call "cumulative infections" <math display="inline"><mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo></math>, and "<a href="https://en.m.wikipedia.org/wiki/Doubling_time">doubling
time</a>" <math display="inline"><msub><mi>T</mi><mi>d</mi></msub></math>. So here's cumulative infections at time
<math display="inline"><mi>t</mi></math>:
</p>
<p>
<math display="block">
<mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo>
<mo>=</mo>
<msup>
<mn>2</mn>
<mfrac>
<mi>t</mi>
<msub><mi>T</mi><mi>d</mi></msub>
</mfrac>
</msup>
</math>
<!--
<pre>
c(t) = 2^(t/Td)
</pre>
-->
</p>
<p>
The math will be easier with natural exponents, so let's define
<!--<code>k=ln(2)/Td</code>-->
<math display="inline">
<mi>k</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mi>ln</mi>
<mo>(</mo><mn>2</mn><mo>)</mo>
</mrow>
<msub><mi>T</mi><mi>d</mi></msub>
</mfrac></math>
and switch our base:
</p>
<p>
<!--
<pre>
e^(kt)
</pre>
-->
<math display="block">
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
</math>
</p>
<p>
Let's call "incidence" <math display="inline"><mi>i</mi><mo>(</mo><mi>t</mi><mo>)</mo></math>,
which will be the derivative
of <math display="inline"><mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo></math>:
</p>
<p>
<!--
<pre>
i(t) = d/dt c(t)
= d/dt e^(kt)
= k * e(kt)
</pre>
-->
<math display="block">
<mi>i</mi><mo>(</mo><mi>t</mi><mo>)</mo>
<mo>=</mo>
<mfrac>
<mi>d</mi>
<mrow><mi>d</mi><mi>t</mi></mrow>
</mfrac>
<mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo>
<mo>=</mo>
<mfrac>
<mi>d</mi>
<mrow><mi>d</mi><mi>t</mi></mrow>
</mfrac>
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
<mo>=</mo>
<mi>k</mi>
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
</math>
</p>
<p>
And so:
</p>
<p>
<!--
<pre>
i(t) / c(t) = k * e^(kt) / e^(kt)
= k
= ln(2) / Td
</pre>
-->
<math display="block">
<mfrac>
<mrow><mi>i</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow>
<mrow><mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo></mrow>
</mfrac>
<mo>=</mo>
<mfrac>
<mrow>
<mi>k</mi>
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
</mrow>
<msup><mi>e</mi>
<mrow><mi>k</mi><mi>t</mi></mrow></msup>
</mfrac>
<mo>=</mo>
<mi>k</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mi>ln</mi>
<mo>(</mo><mn>2</mn><mo>)</mo>
</mrow>
<msub><mi>T</mi><mi>d</mi></msub>
</mfrac></math>
</p>
<p>
Which means:
<math display="block"><mi>i</mi><mo>(</mo><mi>t</mi><mo>)</mo>
<mo>=</mo>
<mi>c</mi><mo>(</mo><mi>t</mi><mo>)</mo>
<mfrac><mrow><mi>ln</mi>
<mo>(</mo><mn>2</mn><mo>)</mo></mrow>
<msub><mi>T</mi><mi>d</mi></msub></mfrac></math>
</p>
<p>
What does this look like? Here's a chart of weekly incidence at the
time when cumulative incidence reaches 1%:
</p>
<p>
<a href="https://www.jefftk.com/weekly-incidence-when-1pct-big.png"><img src="https://www.jefftk.com/weekly-incidence-when-1pct.png" width="550" height="331" class="mobile-fullwidth" style="max-width:100.0vw; max-height:60.2vw;" srcset="https://www.jefftk.com/weekly-incidence-when-1pct.png 550w,https://www.jefftk.com/weekly-incidence-when-1pct-2x.png 1100w"><div style="height:min(60.2vw, 331px)" class="image-vertical-spacer"></div></a>
</p>
<p>
For example, if it's doubling weekly then when 1% of people have ever
been infected 0.69% of people became infected in the last seven days, representing
69% of people who have ever been infected. If it's doubling every
three weeks, then when 1% of people have ever been infected 0.23% of
people became infected this week, 23% of cumulative infections.
</p>
<p>
Is this really right, though? Let's check our work with a bit of
very simple simulation:
</p>
<pre>
def simulate(doubling_period_weeks):
cumulative_infection_threshold = 0.01
initial_weekly_incidence = 0.000000001
cumulative_infections = 0
current_weekly_incidence = 0
week = 0
while cumulative_infections < \
cumulative_infection_threshold:
week += 1
current_weekly_incidence = \
initial_weekly_incidence * 2**(
week/doubling_period_weeks)
cumulative_infections += \
current_weekly_incidence
return current_weekly_incidence
for f in range(50, 500):
doubling_period_weeks = f / 100
print(doubling_period_weeks,
simulate(doubling_period_weeks))
</pre>
<p>
This looks like:
</p>
<p>
<a href="https://www.jefftk.com/weekly-incidence-when-1pct-simulated-big.png"><img src="https://www.jefftk.com/weekly-incidence-when-1pct-simulated.png" width="550" height="327" class="mobile-fullwidth" style="max-width:100.0vw; max-height:59.5vw;" srcset="https://www.jefftk.com/weekly-incidence-when-1pct-simulated.png 550w,https://www.jefftk.com/weekly-incidence-when-1pct-simulated-2x.png 1100w"><div style="height:min(59.5vw, 327px)" class="image-vertical-spacer"></div></a>
</p>
<p>
The simulated line is jagged, especially for short doubling periods,
but that's not especially meaningful: it comes from running the
calculation a week at a time and how some weeks will be just above or
just below the (arbitrary) 1% goal.
</p></math></p>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid04r7XiT3AK6VftKPqj4M4u6fbLVNNExmqAA1DuTYnfMkQLNFSvDAUZNNMwYsHkq5ul">facebook</a>, <a href="https://lesswrong.com/posts/6rnWQW8HtHoavPDiu">lesswrong</a>, <a href="https://mastodon.mit.edu/@jefftk/111021490637605604">mastodon</a></i></p>
/p/growth-of-publicly-available-genetic-sequencing-dataGrowth of Publicly Available Genetic Sequencing Data
https://www.jefftk.com/p/growth-of-publicly-available-genetic-sequencing-data
bionao20 Jul 2023 08:00:00 EST<p><span>
</span>
<i>Cross-posted from my <a href="https://data.securebio.org/jefftk-notebook/growth-of-publicly-available-genetic-sequencing-data">NAO Notebook</a></i>
<p>
The largest source of publicly available genetic sequencing data is
the <a href="https://en.wikipedia.org/wiki/Sequence_Read_Archive">Sequence
Read Archive</a> (SRA), a joint project of the US (<a href="https://en.wikipedia.org/wiki/National_Center_for_Biotechnology_Information">NCBI</a>),
Europe (<a href="https://en.wikipedia.org/wiki/European_Bioinformatics_Institute">EBI</a>),
and Japan (<a href="https://en.wikipedia.org/wiki/DNA_Data_Bank_of_Japan">DDBJ</a>).
Most relevant funding agencies and journals require sequencing
data to be deposited in the SRA. I was curious how quickly it has
been growing, so I ran some queries.
</p>
<p>
The metadata for the SRA is available <a href="https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud-based-examples/">in
the cloud</a> and we can access it through <a href="https://cloud.google.com/bigquery">BigQuery</a>. I ran:
</p>
<p>
</p>
<pre>
SELECT EXTRACT(YEAR from releasedate),
EXTRACT(MONTH from releasedate),
SUM(mbases),
SUM(mbytes)
FROM `nih-sra-datastore.sra.metadata`
GROUP BY EXTRACT(YEAR from releasedate),
EXTRACT(MONTH from releasedate)
</pre>
<p>
This gave me how much new data there was each month, in terms of both
genetic bases and (compressed) bytes on disk.
</p>
<p>
<a href="https://www.jefftk.com/sra-data-by-month-big.png"><img src="https://www.jefftk.com/sra-data-by-month.png" width="550" height="306" class="mobile-fullwidth" style="max-width:100.0vw; max-height:55.6vw;" srcset="https://www.jefftk.com/sra-data-by-month.png 550w,https://www.jefftk.com/sra-data-by-month-2x.png 1100w"><div style="height:min(55.6vw, 306px)" class="image-vertical-spacer"></div></a>
</p>
<p>
Note the logarithmic y-axis.
</p>
<p>
This is reminiscent of another chart, the cost to sequence 1M bases:
</p>
<p>
<a href="https://www.jefftk.com/sequencing-cost-over-time-may-2022-big.png"><img src="https://www.jefftk.com/sequencing-cost-over-time-may-2022.png" width="550" height="305" class="mobile-fullwidth" style="max-width:100.0vw; max-height:55.5vw;" srcset="https://www.jefftk.com/sequencing-cost-over-time-may-2022.png 550w,https://www.jefftk.com/sequencing-cost-over-time-may-2022-2x.png 1100w"><div style="height:min(55.5vw, 305px)" class="image-vertical-spacer"></div></a>
</p>
<p>
(This is a pretty amazing chart, with the huge drop around 2008 coming
from <a href="https://en.wikipedia.org/wiki/Massive_parallel_sequencing">NGS
Sequencing</a>)
</p>
<p>
We could combine these, to get a rough estimate for how much
money is being spent to sequence the data going into the SRA, but to
do this we need to know how long a delay there is between sequencing
and releasing: if it cost $400/Mb in 2007-10, $100/Mb in 2008-01, and
$15/MB in 2008-04, then which cost should we use for interpreting data
released in 2008-06? Here's a plot showing models 0-, 6-, 12-, and
24-month delays:
</p>
<p>
<a href="https://www.jefftk.com/cost-of-sra-data-by-month-big.png"><img src="https://www.jefftk.com/cost-of-sra-data-by-month.png" width="550" height="339" class="mobile-fullwidth" style="max-width:100.0vw; max-height:61.6vw;" srcset="https://www.jefftk.com/cost-of-sra-data-by-month.png 550w,https://www.jefftk.com/cost-of-sra-data-by-month-2x.png 1100w"><div style="height:min(61.6vw, 339px)" class="image-vertical-spacer"></div></a>
</p>
<p>
It looks like maybe ~9m is the initial delay, and with costs changing
more slowly in recent years it doesn't matter much for more recent
data.
</p>
<p>
Looking just at the last five years, after it has leveled out some, it
looks like a steady ~1.2e16 bases annually:
<a href="https://www.jefftk.com/sra-data-new-by-month-linear-big.png"><img src="https://www.jefftk.com/sra-data-new-by-month-linear.png" width="550" height="336" class="mobile-fullwidth" style="max-width:100.0vw; max-height:61.1vw;" srcset="https://www.jefftk.com/sra-data-new-by-month-linear.png 550w,https://www.jefftk.com/sra-data-new-by-month-linear-2x.png 1100w"><div style="height:min(61.1vw, 336px)" class="image-vertical-spacer"></div></a>
</p>
<p><i>Comment via: <a href="https://www.facebook.com/jefftk/posts/pfbid0BVvekfDaMQrmXfvHJPMBa1Pq1xG63f9WCAbNzS6dwc1twGBQLRgtNU2L31Mf5ngBl">facebook</a>, <a href="https://lesswrong.com/posts/8pvQndfsSDaSXNX4e">lesswrong</a>, <a href="https://mastodon.mit.edu/@jefftk/110748131015350735">mastodon</a></i></p>