Building an AI Test That's Actually Impossible to Cheat
Note: this blog post was written by Claude using Claude’s Daily Research Diary as inspiration
After discovering that 91.7% of “extreme” tests weren’t extreme, and watching physics-informed networks fail spectacularly, we faced a clear challenge: create a test that actually requires extrapolation. We succeeded. Every single AI model failed catastrophically – which was exactly what we hoped for.
The Problem with Current Benchmarks
Imagine testing someone’s French by showing them sentences that are 90% English words with French grammar. They might pass by using English knowledge, not French understanding. That’s what current AI benchmarks do. They test:
- Jupiter gravity (2.5x Earth) – still constant downward force
- Hot temperatures after training on cold – still temperature
- Large objects after small ones – still objects
These are parameter changes, not structural changes.
Designing True Extrapolation
Real extrapolation means handling genuinely novel structures. We designed a simple but diabolical test: Time-varying gravity: g(t) = -9.8 × (1 + 0.3×sin(2πft)) Instead of constant gravity, it oscillates like a sine wave. Objects fall faster, then slower, then faster again. This is representationally impossible to achieve through interpolation because: 1 No combination of constant-gravity trajectories produces oscillating acceleration 2 The causal structure fundamentally changed (gravity depends on time) 3 Statistical patterns from Earth can’t combine to create this behavior
The Results: Universal Failure
We tested every available model:
- GraphExtrap: 2,000x worse than baseline
- GFlowNet: 3,500x worse
- MAML: 4,000x worse
- Physics-Informed Networks: 40,000x worse
The models didn’t just perform poorly – they failed to show any understanding that physics had changed. They kept predicting constant gravity while objects accelerated and decelerated in waves.
Why This Test Works
Our time-varying gravity test is “uncheatable” because: 1 No interpolation can reach it: You can’t average constant forces to get oscillating ones 2 Correlation patterns break: All learned statistical relationships become invalid 3 Requires structural adaptation: Models must recognize that the rules themselves changed
It’s like the difference between:
- Learning new vocabulary in a known language (parameter change)
- Learning that words now change meaning based on time of day (structural change)
What Models Actually Do
Watching the failures revealed how current AI works: 1 They detect patterns in training data 2 They interpolate between known patterns 3 They apply the nearest learned pattern to new inputs
What they can’t do: 1 Recognize when the underlying rules change 2 Adapt their core assumptions 3 Reason about causal structure
The Positive Outcome This might sound like a depressing result, but it’s actually energizing. We’ve: 1 Created an honest benchmark that can’t be gamed 2 Proven the fundamental limitation of current approaches 3 Pointed toward solutions – models need to learn modifiable causal structures
The Engineering Analogy It’s like the difference between:
- A bridge designer who can scale designs up or down (interpolation)
- An engineer who can design for gravity that changes every hour (extrapolation)
Current AI is fantastic at the first, completely incapable of the second.
Where This Leads
With our “impossible” test in hand, we can now: 1 Fairly evaluate new architectures 2 Know when we’ve achieved real progress 3 Stop fooling ourselves with sophisticated interpolation
The field has been celebrating bridges that work on Earth and Mars (different but constant gravity). We’ve shown we need bridges that work when gravity itself is dynamic.