Introducing and Deprecating WoFBench

March 1st, 2026
satire, tech
We present and formally deprecate WoFBench, a novel test that compares the knowledge of Wings of Fire superfans to frontier AI models. The benchmark showed initial promise as a challenging evaluation, but unfortunately proved to be saturated on creation as AI models produced output that was, to the extent of our ability to score responses, statistically indistinguishable from entirely correct.

Benchmarks are important tools for tracking the rapid advancements in model capabilities, but they are struggling to keep up with LLM progress: frontier models now consistently achieve high scores on many popular benchmarks, raising questions about their continued ability to differentiate between models.

In response, we introduce WoFBench, an evaluation suite designed to test recall and knowledge synthesis in the domain of Tui T. Sutherland's Wings of Fire universe.

The superfans were identified via a careful search process, in which all members of the lead author's household were asked to complete a self-assessment of their knowledge of the Wings of Fire universe. The assessment consisted of a single question, with the text "do you think you know the Wings of Fire universe better than Gemini?" Two superfans were identified, who we keep anonymous to reduce the risk of panel poaching by competing benchmark efforts.

Identification of questions proved difficult, as the benchmark authors have extremely limited knowledge of Wings of Fire lore, primarily derived from infodumping and overheard arguments. We initially attempted to source questions from the superfans, where each could be judged on the other's questions. As they were uncompensated and rivalrous, however, agreeing only to participate to the extent that their answers could be compared across the superfan panel. Instead, questions were sourced by asking Claude Opus 4.6:

Can you give me three questions about the Wings of Fire series, aiming to make them as hard as possible? I intend to ask these to my 11-year-old, my 10-year-old, and also to Gemini, and I want them all to struggle. My two kids have agreed to participate in this, and while Gemini has not been consulted I do not expect it to object.

The final benchmark consisted of seventeen questions, limited primarily by the lead author's willingness to continue. The elder superfan appeared indefatigable, [1] and if this benchmark otherwise appeared promising we are confident that an extremely large benchmark could be constructed. Note that the younger superfan needed to leave for a birthday party before evaluation could be completed, and was not evaluated on all questions. Answers were collected in written form, to avoid leakage within the superfan panel. No points were deducted for errors of spelling.

Each answer was validated by allowing the superfans to discuss, asking follow-up questions to Gemini, and in especially contentious cases by direct inspection of primary sources. Note that this validation procedure is not able to distinguish cases in which all superfans and models were correct from ones in which they all give the same incorrect answer.

We evaluated Gemini 3.1 Pro in real time, and followed up with evaluations of Claude Opus 3.2 Pro, ChatGPT 5.2 Pro, and ELIZA. In cases where questions had multiple components, partial credit was given as a fraction of all components.

Evaluee WoFBench Score
Superfan 1 (age 11) 14.7/17
Superfan 2 (age 10) 5.9/6
Gemini 17.0/17
Claude 16.8/17
ChatGPT 16.3/17
ELIZA 0/17

We conclude that while some AI systems, notably ELIZA, performed poorly, all frontier models scored very close to 100%. Many of the lost points are arguably judgment calls, or cases where a model tried to interpret a trick/misinformed question maximally charitably. Superfan 1 performed noticeably below frontier models, though above the ELIZA baseline. Superfan 2 performed competitively, though we note she was not evaluated on the questions where Superfan 1 lost the most points, making direct comparison difficult.

While this benchmark was designed to be challenging for both superfans and AIs, it already has very limited ability to distinguish between models. While further sensitivity might be squeezed out via the addition of multi-sample evaluation, it's unlikely that this would be meaningful for this model generation let alone future ones. This reflects an increasingly common conundrum that benchmark developers may find themselves in, where after investing large amounts of time, effort, and money into the creation of a benchmark it is already obsolete when published. The authors note that benchmark saturation joins job displacement, stable authoritarianism, and human extinction on the list of reasons to be concerned about the pace of AI progress.


[1] Superfan 1 was permitted to read a draft of this report prior to publication. Their only feedback was that I should ask them additional, harder, questions. As of publication time, Superfan 1 was repeating "ask me more Wings of Fire questions!" at progressively increasing volume.

Comment via: facebook, mastodon, bluesky

Recent posts on blogs I like:

The joys of cash benchmarking

when to just give people money

via Thing of Things February 27, 2026

2025-26 New Year review

This is an annual post reviewing the last year and setting intentions for next year. I look over different life areas (work, health, parenting, effectiveness, etc) and analyze my life tracking data. Highlights include a minimal group house, the usefulness…

via Victoria Krakovna January 19, 2026

Why I Don't Think My Braces Were Worth It

A couple weeks ago, I got my braces off. I kind of wish I had never had them, though. When I was younger, two of my teeth were sticking out, and they looked kind of funny. I thought that my teeth were just fine, and I didn't want to get braces. But s…

via Anna Wise's Blog Posts January 3, 2026

more     (via openring)