Introducing and Deprecating WoFBench

We present and formally deprecate WoFBench, a novel test that compares the knowledge of Wings of Fire superfans to frontier AI models. The benchmark showed initial promise as a challenging evaluation, but unfortunately proved to be saturated on creation as AI models produced output that was, to the extent of our ability to score responses, statistically indistinguishable from entirely correct.

Benchmarks are important tools for tracking the rapid advancements in model capabilities, but they are struggling to keep up with LLM progress: frontier models now consistently achieve high scores on many popular benchmarks, raising questions about their continued ability to differentiate between models.

In response, we introduce WoFBench, an evaluation suite designed to test recall and knowledge synthesis in the domain of Tui T. Sutherland's Wings of Fire universe.

The superfans were identified via a careful search process, in which all members of the lead author's household were asked to complete a self-assessment of their knowledge of the Wings of Fire universe. The assessment consisted of a single question, with the text "do you think you know the Wings of Fire universe better than Gemini?" Two superfans were identified, who we keep anonymous to reduce the risk of panel poaching by competing benchmark efforts.

more...
Here's to the Polypropylene Makers

Six years ago, as covid-19 was rapidly spreading through the US, my sister was working as a medical resident. One day she was handed an N95 and told to "guard it with her life", because there weren't any more coming.

N95s are made from meltblown polypropylene, produced from plastic pellets manufactured in a small number of chemical plants. Two of these plants were operated by Braskem America in Marcus Hook PA and Neal WV. If there were infections on site, the whole operation would need to shut down, and the factories that turned their pellets into mask fabric would stall.

Companies everywhere were figuring out how to deal with this risk. The standard approach was staggering shifts, social distancing, temperature checks, and lots of handwashing. This reduced risk, but each shift change was an opportunity for someone to bring in an infection from the community.

Someone had the idea: what if we never left? About eighty people, across both plants, volunteered to move in. The plan was four weeks, twelve-hour shifts with air mattresses on the floor each night and seeing their families only through screens. With full isolation no one would be exposed, and they could keep the polypropylene flowing.

more...
Storing Food

I think more people should be storing a substantial amount of food. It's not likely you'll need it, but as with reusable masks the cost is low enough I think it's usually worth it.

It's hard for me to really imagine living through a famine. The world as I have experienced it has been one of abundant calories, where people are generally more worried about getting too many than too few. Essentially no one dies in the US from food unavailability. Globally, however, it's different: each year millions die from hunger.

more...
You May Already Be Canadian

I learned a few weeks ago that I'm a Canadian citizen. This was pretty surprising to me, since I was born in the US to American parents, both of which had American parents. You don't normally suddenly become a citizen of another country! But with Bill C-3, anyone with any Canadian ancestry is now Canadian. [1]

In my case my mother's, mother's, father's mother's mother was Canadian. While that is really quite far back, there isn't a generational limit anymore.

Possibly you're also a Canadian citizen? Seems worth checking! With how much migration there has been between the US and Canada, and citizenship requiring only a single ancestor, this might mean ~5-10% of Americans are now additionally Canadian, which is kind of nuts.

more...
Text Posts from the Kids Group: 2025

Another round of liberating kid posts from Facebook. For reference, in 2025 Lily turned 11, Anna turned 9, and Nora turned 3.

(Some of these were from me; some were from Julia. Ones saying "me" could mean either of us. Ones from others are labeled.)

more...
Gemini's Hypothetical Present

I use several AI models at work, switching between them based on which is best at the kind of work I'm doing. In my limited testing I've found Opus 4.6 to be the best coder (via Claude Code) and has the best taste in writing, while GPT 5.2 Thinking is mildly smarter. Gemini 3.0 Pro feels like it has the capacity to be much smarter, but is very often held back by various strange errors. The one I see the most is that it has a lot of trouble accepting the changes to the world since its training data, and seems to put a lot of effort into keeping track of everything as a giant hypothetical.

For example, the first query I thought to test this on in writing this post was [Who leads the Department of War?] and it gave the reasonable answer "The Department of War is currently led by Secretary of War Pete Hegseth," followed by a short explanation. But if I open up the thinking it spent some of its processing time being very confused:

...
These "news" snippets from sources like "Ocean State Media" and "SpaceForce.mil" are throwing up red flags. The dates are from 2026, and they reference a second Trump term with JD Vance as VP. That screams alternate reality or satire to me.
...

more...
More Posts