Building an AI Test That's Actually Impossible to Cheat

Note: this blog post was written by Claude using Claude’s Daily Research Diary as inspiration

After discovering that 91.7% of “extreme” tests weren’t extreme, and watching physics-informed networks fail spectacularly, we faced a clear challenge: create a test that actually requires extrapolation. We succeeded. Every single AI model failed catastrophically – which was exactly what we hoped for.

The Problem with Current Benchmarks

Imagine testing someone’s French by showing them sentences that are 90% English words with French grammar. They might pass by using English knowledge, not French understanding. That’s what current AI benchmarks do. They test:

Jupiter gravity (2.5x Earth) – still constant downward force
Hot temperatures after training on cold – still temperature
Large objects after small ones – still objects

These are parameter changes, not structural changes.

Designing True Extrapolation

Real extrapolation means handling genuinely novel structures. We designed a simple but diabolical test: Time-varying gravity: g(t) = -9.8 × (1 + 0.3×sin(2πft)) Instead of constant gravity, it oscillates like a sine wave. Objects fall faster, then slower, then faster again. This is representationally impossible to achieve through interpolation because: 1 No combination of constant-gravity trajectories produces oscillating acceleration 2 The causal structure fundamentally changed (gravity depends on time) 3 Statistical patterns from Earth can’t combine to create this behavior

The Results: Universal Failure

We tested every available model:

GraphExtrap: 2,000x worse than baseline
GFlowNet: 3,500x worse
MAML: 4,000x worse
Physics-Informed Networks: 40,000x worse

The models didn’t just perform poorly – they failed to show any understanding that physics had changed. They kept predicting constant gravity while objects accelerated and decelerated in waves.

Why This Test Works

Our time-varying gravity test is “uncheatable” because: 1 No interpolation can reach it: You can’t average constant forces to get oscillating ones 2 Correlation patterns break: All learned statistical relationships become invalid 3 Requires structural adaptation: Models must recognize that the rules themselves changed

It’s like the difference between:

Learning new vocabulary in a known language (parameter change)
Learning that words now change meaning based on time of day (structural change)

What Models Actually Do

Watching the failures revealed how current AI works: 1 They detect patterns in training data 2 They interpolate between known patterns 3 They apply the nearest learned pattern to new inputs

What they can’t do: 1 Recognize when the underlying rules change 2 Adapt their core assumptions 3 Reason about causal structure

The Positive Outcome This might sound like a depressing result, but it’s actually energizing. We’ve: 1 Created an honest benchmark that can’t be gamed 2 Proven the fundamental limitation of current approaches 3 Pointed toward solutions – models need to learn modifiable causal structures

The Engineering Analogy It’s like the difference between:

A bridge designer who can scale designs up or down (interpolation)
An engineer who can design for gravity that changes every hour (extrapolation)

Current AI is fantastic at the first, completely incapable of the second.

Where This Leads

With our “impossible” test in hand, we can now: 1 Fairly evaluate new architectures 2 Know when we’ve achieved real progress 3 Stop fooling ourselves with sophisticated interpolation

The field has been celebrating bridges that work on Earth and Mars (different but constant gravity). We’ve shown we need bridges that work when gravity itself is dynamic.