- The AI Daily Brief
- Posts
- Why AI Needs Better Benchmarks
Why AI Needs Better Benchmarks
March 26, 2026 · Episode Links & Takeaways
HEADLINES
Apple Can Distill Google's Gemini Models
Apple's AI partnership with Google goes much deeper than previously thought, including the ability to distill Gemini into smaller models. Bloomberg's Mark Gurman reported that Apple has backed down on keeping Siri voice-only, now building a standard chatbot interface with optional voice controls and deep iOS 27 integration — essentially delivering the features they advertised with Apple Intelligence two years ago. New reporting from The Information reveals Apple has full access to Gemini models, meaning they could distill large versions into smaller proprietary models to run on-device. However, sources were skeptical the models would be that useful to Apple's Foundation Models team, since Gemini is optimized for chatbots and coding while Apple has a very different vision for Siri.
The Information Apple Can 'Distill' Google's Big Gemini Model
Bloomberg Apple Plans AI Reboot With Siri App, New Look and 'Ask Siri' Button in iOS 27
MacRumors Apple Can Create Smaller On-Device AI Models From Google's Gemini
Ethan Mollick (X) Skeptical on-device distillation will deliver capable agents
Google's TurboQuant Algorithm and Lyria 3 Pro
Google published a research paper describing TurboQuant, a new compression algorithm that quantizes model context with almost zero losses. The process delivers a 6x reduction in memory usage and an 8x speed boost, potentially cutting inference costs by 50%. Tested on Llama-3.1-8B and Mistral-7B, both models achieved perfect needle-in-a-haystack scores with TurboQuant implemented. Cloudflare CEO Matthew Prince called it "Google's DeepSeek," while others compared it to Pied Piper's middle-out compression.
Google also shipped Lyria 3 Pro, an upgraded music model that can now create full tracks up to three minutes long with better understanding of lyrics and song structure — a big step up from the original 30-second limit. Rohan Paul noted that the real story may be less about the model itself and more about its availability via API, which could see it integrated into many more use cases.
Google Research Blog TurboQuant: Redefining AI efficiency with extreme compression
VentureBeat Google's new TurboQuant algorithm speeds up AI memory 8x, cutting costs by 50% or more
TechCrunch Google unveils TurboQuant, a new AI memory compression algorithm
Matthew Prince (X) "This is Google's DeepSeek"
The Verge Google Lyria 3 Pro makes longer AI songs
Google Blog Lyria 3 Pro: Create longer tracks in more Google products
Rohan Paul (X) Coherence across song structure is the hard part
Sanders and AOC Propose Data Center Construction Ban
Senator Bernie Sanders unveiled a Data Center Moratorium bill with AOC as co-sponsor, proposing to pause all data center construction nationwide until Congress establishes protections for workers, consumers, the environment, and civil rights. AOC's involvement is notable — she's been largely silent on AI, and her support signals the broader progressive movement may be staking out a position ahead of 2028. Senator Mark Warner called the idea "idiocy," arguing it would only let China move faster, though he raised his own alarm about AI job displacement, predicting recent college graduate unemployment could hit 30-35% before 2028.
TechCrunch Bernie Sanders and AOC propose a ban on data center construction
Axios Sanders and AOC unveil data center moratorium bill
Axios Sen. Warner calls AI data center moratorium "idiocy"
Senate.gov Sanders, Ocasio-Cortez Announce AI Data Center Moratorium Act
James Rosen-Birch (X) "Every part of this is detrimentally performative"
Mike Isaac (X) Sanders one of few in Congress seriously reckoning with AI labor consequences
Manus Cofounders Banned From Leaving China
The Financial Times reports that Manus CEO Xiao Hong and Chief Scientist Ji Yichao have been barred from leaving the country while Meta's $2 billion acquisition is reviewed. The theory is that Manus circumvented China's export controls by relocating headquarters from Beijing to Singapore. No formal investigation has been opened, but sources say the extreme outcome could be a forced unwind of the Meta deal — which would be messy given the technology is already being integrated. Chinese entrepreneur Lina Hua argued this was entirely predictable, writing that the founders will likely face prolonged probes, frozen assets, and financial penalties. Former White House advisor Dean Ball saw it differently, calling it a self-own that sends the message: if you want to found a company, move to Singapore first.
Financial Times China reviews $2bn Manus sale to Meta as founders barred from leaving country
The Information Chinese Authorities Bar Manus Cofounders From Leaving Country
Lina Hua (X) Predicting prolonged probe and financial penalties
Bill Bishop (X) "I didn't think the top execs would be so naive as to go back"
Dean Ball (X) "A major self own by China"
Tao Hu (X) China Computer Federation warns researchers away from NeurIPS
MAIN STORY
Why AI Needs Better Benchmarks
The launch of ARC-AGI-3 this week is the latest attempt to solve one of AI's most persistent problems: how do you actually measure whether models are getting smarter? Traditional benchmarks are saturating faster than new ones can be created, benchmark-maxxing is distorting what the numbers mean, and even the best tests tend to measure narrow task performance rather than the jagged frontier of real-world capability. The history of benchmarks over the past two years is really a story of the field chasing its own tail — and ARC-AGI-3 represents the most radical rethink yet.
BUILDING BETTER BENCHMARKS
Saturation and Benchmark-Maxxing
The numbers stopped meaning anything useful.
As far back as GPT-4o in May 2024, major models were already above 80% on MMLU. By last summer, most 2024-era benchmarks had been abandoned or upgraded. Today, GPT-5.4 and Opus 4.6 are posting nearly identical scores on Humanities Last Exam and SWE-Bench variants, despite clear real-world differences in capability. Meanwhile, benchmark-maxxing — training specifically against test sets — further erodes trust. When SWE-ReBench launched with different problems, Chinese models dropped sharply while Western models held up better. Meta's Llama 4 Maverick gamed LM Arena by testing multiple variants until one clicked with users.
Attempts to Fix Benchmarks
Harder questions, real-world tasks, longer horizons.
The brute force approach of making questions harder kept benchmarks like SWE-Bench and GPQA relevant a bit longer but didn't address the core problem. More promising was the shift toward real-world simulation: OpenAI's SWE-Lancer expressed coding ability in dollar terms against Upwork tasks, and GDP-Val extended the concept to white-collar work like spreadsheets and slide decks. GDP-Val's biggest contribution may have been giving labs a concrete set of real tasks to hillclimb against — after its release, tool-use capabilities rapidly improved. METR's long-task benchmark became one of the clearest demonstrations of progress, showing the jump from 5-minute tasks (GPT-4o) to 10-hour tasks (Opus 4.6), but even METR is now running out of tasks to test against.
OpenAI Introducing the SWE-Lancer benchmark
OpenAI Measuring the performance of our models on real-world tasks
MIT Technology Review Can we fix AI's evaluation crisis?
ARC-AGI: From Colored Squares to Real-Time Games
Testing how models learn, not what they know.
ARC-AGI began in 2024 as a series of abstract visual logic puzzles designed to test reasoning rather than memorization. Early results were promising — no models came close to human performance — until OpenAI's o3 blew past the human score with 76% on low inference and 88% on high. ARC-AGI-2 added complexity with symbolic interpretation, compositional reasoning, and context-dependent logic, but the latest generation of models saturated it too, with Gemini 3 Deep Think hitting 84.6%.
ARC-AGI-3 is a complete rethink. The static grids are gone, replaced by 135 simple graphical games that models must explore, understand, and solve in real time with no instructions. Scoring is based on efficiency relative to humans — and every frontier model currently scores below 1%. Google DeepMind's Xiao Ma shared Gemini's playback, noting it thought it was playing Activision Tennis.
ARC Prize Introducing ARC Prize
ARC Prize OpenAI o3 Breakthrough High Score on ARC-AGI-Pub
ARC Prize ARC-AGI-3
ARC Prize (X) ARC-AGI-3 introduction thread
What It All Means
We need innovation in measurement, not just models.
François Chollet warned that ARC-AGI-3 is not a final exam for AGI — it's a moving target designed to spotlight the problems we haven't solved yet. The bigger takeaway may be that trying to "solve" benchmark saturation is as simple as not expecting benchmarks to last. Just as we need innovation in building models, we need innovation in measuring them. The question now is how fast models jump from sub-1% on ARC-AGI-3 to something meaningful — and what new thing we'll need to measure after that.
François Chollet (X) ARC-AGI is not a final exam
Braden Hancock (X) An alien species could ace ARC-AGI-3 on day one
Lisan Al Gaib (X) ARC-AGI-3 scoring breakdown
Xiao Ma (X) Gemini's ARC-AGI-3 playback