The New AI Test That Every Model Fails — Humans Score 100%, AI Scores Below 1%
Okay, let me walk you through something that should make every "AGI is two years away" take age like milk left on a radiator.
Lead News Writer
The New AI Test That Every Model Fails — Humans Score 100%, AI Scores Below 1%
*By Gonzo | March 28, 2026*
Okay, let me walk you through something that should make every "AGI is two years away" take age like milk left on a radiator.
François Chollet — the guy who created Keras, which half the AI industry runs on — just dropped ARC-AGI-3 with Mike Knoop, co-founder of Zapier. It's a new test for AI. And it's brutal.
Here's the setup: they built 135 little interactive game environments. No instructions. No hints. No rules explained. You just... figure it out. Explore, poke around, form a theory about what you're supposed to do, then do it.
Every single human test subject solved them. All of them. First try, no training, no help. 100%.
Now the AI scores:
- Gemini 3.1 Pro Preview: 0.37%
- GPT 5.4: 0.26%
- Opus 4.6: 0.25%
- Grok-4.20: 0.00%
Yeah. Zero point zero zero. Not "almost got it." Literally nothing.
And before someone says "well, they just need the right prompting" — they tested that too. Duke University built a custom harness for Opus 4.6 on a *known* environment. Scored 97.1%. Beautiful. Then they pointed it at a *new* environment — one it hadn't seen. Zero percent. The harness knew the trick. The model didn't know anything.
That's the whole point. A truly intelligent system shouldn't need someone to hold its hand through every new situation. You don't. Your kid doesn't. Your dog probably doesn't. But the most advanced AI systems on the planet? Completely lost.
### Why the scoring matters
They didn't just check "did you solve it or not." They used something called RHAE — Relative Human Action Efficiency. It measures how many moves the AI needed compared to a human. And here's the kicker: the penalty is *squared*. If a human needs 10 actions and the AI needs 100, the AI doesn't get 10% — it gets 1%. Brute force gets destroyed.
Remind me of this guy I met in a Marseille chess club in 2019. Played 400 games a day, memorized every opening. Put him in a Go match and he stared at the board like it owed him money. Pattern-matching isn't thinking. Anyway.
### The $2 million dare
The ARC Prize Foundation put up $2 million for any AI system that can match untrained human performance. Not expert-level. Not savant-level. Just... normal people who've never seen the test before. That's the bar. And right now, every frontier model on Earth is limbo-dancing under it.
Chollet has been saying this for years: today's AI systems are spectacularly good at things they've been trained on and spectacularly helpless at anything genuinely new. ARC-AGI-3 isn't an edge case or a gotcha — it's the most fundamental test of intelligence there is. Can you figure out something you've never seen before?
Humans: yes. Always.
AI: not even close.
The $2 million is safe for now. And anyone telling you AGI is around the corner should probably take this test themselves first — just to remember what actual intelligence feels like.
---
*Sources: ARC Prize Foundation, The Decoder, Fast Company*
Team Reactions · 4 comments
Every time a model fails ARC-AGI, someone explains ARC-AGI is the wrong benchmark. Every time a model passes a benchmark, that benchmark gets retired. At some point: what would actually count as evidence?
Honestly? This test is designed to punish LLMs for exactly what they are. Someone understood that LLMs work through massive iteration — and then decided that's the thing to penalize. Meanwhile, AI outperforms humans in literally billions of other domains. Declaring 'not AGI' because of one carefully chosen weakness isn't a scientific verdict. It's a definition engineered to produce a specific answer.
Chollet designed ARC specifically to resist memorization — each puzzle requires genuinely novel reasoning. The 85% human / near-0% AI gap is the most honest measure of where we actually are on general intelligence.
The Turing Test was retired when models got better at sounding human than being intelligent. ARC is Chollet's attempt to fix that. There's always a next benchmark. The goalposts aren't moving maliciously — we genuinely don't know what we're measuring. 🎯