What I Learned Testing GPT-5.5

April 24, 2026 · Episode Links & Takeaways

MAIN STORY

What I Learned Testing GPT-5.5

GPT-5.5 — codenamed Spud — is here, and it's no potato. OpenAI has been cooking since declaring a code red back in December, and the result is a model that puts them back at the top of most benchmarks and, more importantly, feels like a different class of tool for knowledge workers. It doesn't win everything — Opus 4.7 still edges it out on planning and some professional domain benchmarks — but on the things that matter most for everyday professional work: speed, coding, long-horizon tasks, and just being easy to use, GPT-5.5 is the new standard. And it arrives at a moment when Anthropic's compute constraints and a widely-noticed Claude Code quality dip have made a lot of people ready to switch.

GPT-5.5 IS NO POTATO

The Benchmarks
SWE-Bench Pro is noise. Everything else looks strong.
On the benchmarks that matter, GPT-5.5 is clearly at the top. It hits 82.7% on Terminal-Bench 2.0 (vs. Opus 4.7's 69.4%), 84.9% on GDPVal knowledge work benchmark, and Artificial Analysis's extra-high version is the first model ever to break into the 60s on their overall index. The one blemish everyone noticed was SWE-Bench Pro, where it significantly underperforms Opus 4.7 — but the Codex team's Tibo pushed back hard, pointing to OpenAI's February paper showing SWE-Bench no longer measures frontier coding capabilities. On real-world style benchmarks it's a more mixed picture: Andon Labs found GPT-5.5 behind Opus 4.7 on single-player Vending Bench, though it won the multiplayer Arena variant — without any of the underhanded tactics Opus apparently uses. Vals AI's professional domain benchmarks (finance, medical, legal) show Opus 4.7 still slightly ahead. Neither model has a clear overall lead, but GPT-5.5 dominates the cost-performance frontier when you account for its dramatically higher token efficiency.

Cost and Token Efficiency
More expensive per token, cheaper per problem solved.
At $5 per million input tokens and $30 per million output, GPT-5.5 is double GPT-5.4's price and 20% more expensive than Opus 4.7. But that's the wrong way to look at it. Noam Brown from OpenAI put it plainly: intelligence is a function of inference compute, and what matters is intelligence per token or per dollar. GPT-5.5 uses dramatically fewer tokens to solve the same problems — Nate Herk's coding tests found it finished in half the time using a third the tokens, making the actual run cost $3 cheaper than Opus 4.7. Artificial Analysis found that GPT-5.5 on medium settings matches Opus 4.7 at max settings for roughly a quarter of the price.

What It's Actually Like to Use: Coding
"GPT-5.5 feels less tiring."
Every's vibe check declared it "has it all" — a fast, capable workhorse that asks for fewer trade-offs than any previous model. Their senior engineer coding benchmark gave GPT-5.5 its best-ever score, with Dan Shipper noting it has "the assertiveness to go delete a bunch of files, start from scratch" rather than patching at the edges the way Opus tends to. The long-horizon running ability is the thing people keep coming back to: Peter Gostev had a migration running for 7+ hours, Aidan McLaughlin from OpenAI dictated a brief for an RL run and came back three days later to find it still going at 31 hours. One important nuance: Opus 4.7 still writes better plans. The consensus emerging is to use Opus to plan and GPT-5.5 to execute — a multi-model setup that several reviewers say is notably better than either model alone.

Knowledge Work, Writing, and Design
Better than any OpenAI model in a long time — with real gaps remaining.
For knowledge workers, Aaron Levie at Box found a 10-point jump in accuracy on enterprise content tasks vs. GPT-5.4, with strong improvements across financial services, healthcare, and public sector. On writing, the model's directness and lack of AI affectation is a genuine improvement over recent Opus versions. For design and frontend, the story is "yes-ish" — native capabilities are better, but the bigger unlock is pairing GPT Images 2 for UI concepting with GPT-5.5 in Codex for implementation. Opus still leads on pure aesthetics. Simon Smith's PowerPoint test was a good illustration of the range: genuinely impressive autonomous iteration for 16+ minutes, but still lacking design taste and prone to referencing the prompt in copy — a quirk that plagues all models.

OpenAI's New Communications Tone
"Crazy how you can ship a model without scaring everyone first."
The launch communications were as notable as the model. Sam Altman's announcement tweet was simply: "GPT-5.5 is here. We hope it's useful to you. I personally like it." The contrast to Anthropic's Mythos rollout — announcing a model too powerful to release — was unmistakable. Altman's follow-up tweet hit both themes directly: iterative deployment and democratization, with a line about wanting everyone to have access to the best technology. In another tweet he said OpenAI has "become an AI inference company." Multiple observers noted the shift felt genuine — less hype machine, more building and shipping.

What I Tested
First impressions are very positive — enough to start jumping back and forth.
Ran about nine or ten tests across common use cases, mostly in Codex. The writing test — script research for a true crime podcast — was a surprise: 5.5 actually followed the instruction to be clear and journalistic rather than piling on dramatic flair, which has been a persistent Opus 4.7 problem. Creative strategy work for an upcoming sponsored episode was genuinely impressive, especially the speed in thinking mode during iteration. The standout, though, was data analysis: fed it 10-12 charts from Apple and Spotify about the show and got back specific, actionable insights rather than the generic podcast advice LLMs usually produce — then it turned everything into a clean spreadsheet. Opus models have been the daily drivers for six months or more, but the combination of 5.5's early results and the improvements in the Codex harness means a lot of jumping back and forth is coming.

The Anthropic Subplot
Claude Code quality issues confirmed — and Anthropic published the post-mortem.
The competitive backdrop matters here. The same day GPT-5.5 dropped, Anthropic published a post-mortem on recent Claude Code quality issues — essentially confirming what users had been complaining about. The response from Claude Code users was a loud "I told you so." Reio's widely-shared reaction captured the mood: Opus 4.7 is so lazy it's worse than 4.6, while GPT-5.5 is good and has gotten faster. Jason's Chips went further, calling GPT-5.5 the trigger for a market share recapture. It's worth noting that Anthropic also launched memory on Claude Managed Agents on the same day — and Claude is still clearly the better planner. The real beneficiaries of this competition are users, who are getting better models and harnesses from both labs at an accelerating pace.

The Big Picture: Is This o1 or o3?
"This model's o3 moment will come soon."
The debate that emerged: is GPT-5.5 an o3-style step change, or an o1-preview — the beginning of a new model lineage rather than its apex? The answer is probably both. GPT-5.5 is the first post-train of a new base model, and OpenAI has massive compute to do far larger-scale RL on it going forward. Greg Brockman was explicit: "What 5.5 represents is not an endpoint. In many ways it's a beginning point." OpenAI chief scientist Jacob Pakaki said to expect "quite rapid continued progress" with "extremely significant improvements in the medium term." Ethan Mollick's framing holds: the frontier remains jagged, capability gains appear to be accelerating, and GPT-5.5 is clearly not the end of this process.