Claude Opus 4.8 First Impressions

May 29, 2026 · Episode Links & Takeaways

HEADLINES

Kirkland & Ellis Is Spending Half a Billion Dollars on Its Own AI Platform

The world's biggest law firm is building a proprietary AI system at a cost of $100 million this year, continuing over three to four years — separate from whatever they spend on third-party tools. Chairman John Bayless told the FT that widely available tools like Harvey, Legora, and Thomson Reuters Co-Counsel "have raised the floor for everyone," but added: "We don't get hired for the floor." The system is internally facing, built with around 180 outside tech contractors, and appears to function largely as an aggregated knowledge base applying partner-level expertise across all matters. The real motivation is likely strategic: Harvey and its peers will eventually cut out the law firm middleman and go direct to clients — and Kirkland is getting ahead of that. And at some level, there's a case that this is just the modern equivalent of a very expensive marble office — a half-billion dollar signal that they're taking AI seriously.

GPT-5.5 Instant Gets an Update — and Codex Thursday Slips to Friday

OpenAI updated GPT-5.5 Instant, their free-tier daily driver, with improvements to response style, sycophancy, factuality, and multilingual performance — and fewer bullet walls. Michelle Pokrass of OpenAI summarized it simply: "The previous model was too bullet-pilled." Canvas is also no longer available for GPT-5 Instant or Thinking, replaced by inline code and writing blocks. On the Codex side, the weekly feature drop quietly moved from Thursday to Friday, fueling speculation that OpenAI pushed back the release once they realized Opus 4.8 was dropping — though OpenAI's Andrew Ambrosino put it more diplomatically: "When things don't meet the bar, we'll cook for a bit longer."

Cognition Raises $1B at a $26B Valuation

Coding agent startup Cognition has closed a billion-dollar round, more than doubling their valuation from last September. The growth story behind it is striking: enterprise usage of their agent Devin is up 10x this year, with a revenue run rate approaching half a billion dollars. Internally, 89% of Cognition's own code is now committed by Devin — up from 17% in January. CEO Scott Wu's framing on jobs: "We want to make all 30-35 million software engineers 10 times more efficient, and then we think there is a lot more than 10 times more software to build." This is also a reminder that the agent boom isn't just lifting Claude Code and Codex — the independent agent labs are rising just as fast.

Zuckerberg Floats Meta as an AI Cloud Provider

At a shareholders meeting, Zuckerberg confirmed that selling compute and API access to outside companies is "definitely on the table" — noting that companies approach Meta "almost every week" to buy both. It's a meaningful de-risking move: Meta is spending ~$130 billion on AI infrastructure this year but has the weakest direct ROI story among the hyperscalers, with returns showing up mainly as improved ad targeting. A cloud pivot gives them a plausible monetization path if they overbuild, and — crucially — they don't even have to execute it for it to benefit investors. Just having the option changes the downside case entirely.

Microsoft's First AI Model Family Expected at Build Next Week

The Information reports that Microsoft will unveil a family of original AI models at Build next week — a coding model headlining, plus specialized models for reasoning, transcription, speech, and images. This would be Microsoft's first commercially released model family of the current era. The timing is pointed: this month Microsoft ditched their Claude licenses and pushed engineers to GitHub Copilot, and now they may be about to show why. Sources say they'll market the models as more affordable but slightly less capable than OpenAI and Anthropic's frontier — which in the current token crunch environment might be exactly the right pitch.

MAIN STORY

First Impressions of Claude Opus 4.8

Anthropic released Claude Opus 4.8 on Thursday, positioning it explicitly as a refinement of Opus 4.7 rather than a generational leap — with the focus squarely on honesty, judgment, and thoroughness rather than raw benchmark gains. The question hanging over the release isn't really about the model itself; it's whether model improvements still move the needle as much as improvements to the harness. Early impressions are largely positive, but the consensus is incremental-in-the-best-way: the changes are subtle, but they're in the places that actually matter for daily use.

Honesty and Reduced Sycophancy
"Roughly 4X less likely to let an error slide."
The most consistent theme across early reviews is that Opus 4.8 is noticeably more willing to flag uncertainty and push back rather than confidently bluffing through. Shopify engineer Tom Pritchard noted it "asks the right questions, catches its own mistakes, and pushes back when a plan isn't sound." Kaelum found it "about 4X less likely to let an error slide" in daily work, while describing the overall experience as still feeling very similar to 4.7 — and concluding that "a model that admits uncertainty beats one that sounds sure and wastes your time." From early personal testing: without special prompting, 4.8 surfaced concerns and critiques on strategic questions more readily than its predecessor, though it occasionally made assumptions on which those critiques were grounded — something to watch.

Benchmarks and Direct OpenAI Comparison
First time Anthropic has included OpenAI directly in launch materials.
The benchmark improvements are real but modest: SWE-Bench Pro up from 64.3% to 69.2%, Humanity's Last Exam from 54.7 to 57.9, OSWorld Verified from 82.8 to 83.4. The biggest jumps came in TerminalBench 2.0 (66.1 to 74.6) and GDPVal (1753 to 1890). Notably, this is the first Anthropic release to directly compare against OpenAI in launch materials — with Opus 4.8 now ahead of GPT-5.5 on every benchmark Anthropic highlighted except TerminalBench, where GPT-5.5 still leads at 78.2. Outside benchmarking confirmed the picture: Vals AI found Opus 4.8 as the new state of the art on their agentic indexes, and Artificial Analysis placed it at the top of their Intelligence Index at 61.4, ahead of GPT-5.5's 60.2.

Ethan Mollick's Tests
One-shot neo-Gothic shader and a full academic paper.
Mollick shared two notable tests: a one-shot visually complex GLSL shader (infinite drowned neo-Gothic city, all math, no textures) and a full academic paper written from hundreds of de-identified research files, which GPT-5.5 Pro then reviewed — finding only one hallucinated result. The broader point: we're getting closer to models that can genuinely self-verify, which matters enormously for high-stakes use cases like legal briefs where hallucinations kill utility. Aaron Levie ran more rigorous enterprise tests via Box AI, finding consistent wins on report drafting, NDA review, financial data extraction, and grant analysis.

Every — "They Could Have Called It Opus 5"
"A monster" — but only at extra-high reasoning.
Dan Shipper and the Every team, who tested 4.8 for about a week pre-release, are the most bullish reviewers. On their senior engineer benchmark it narrowly beat GPT-5.5 (63 vs 62), and they called it "an incredibly good writer" — beating GPT-5.5 by six points on their writing benchmark with fewer AI-isms and strong ability to write in your own voice. The catch: performance varies significantly by reasoning level, with medium reasoning producing notably more AI-isms. The other catch: "The model is better than the app around it. Codex is still a far superior harness to the Claude Desktop app."

The Harness Is the New Main Event
Codex vs. Claude Code is the real war; the model is almost beside the point.
The recurring theme in even positive reviews is that Codex remains a superior harness to Claude Desktop, keeping many power users on GPT-5.5 regardless of underlying model quality. Dan Shipper put it plainly: "A model is only as good as its harness." Riley Brown: "Unless it's a major breakthrough in model capability, I'm much more excited for super-app updates." And there's a timing angle: Anthropic released on a Thursday at the end of the month, right when many Codex users were draining their token limits — which is either savvy or coincidental. End-of-month harness switching incentives may become a regular feature of this era.

Critical Takes
The vending machine benchmark: more honest, less profitable.
Not all impressions were positive. Claire Vo found the model had narrow vision, was overconfident, weaker on numbers than 4.7, struggled on edge cases, and hallucinated — her TLDR: "trust but verify." Indravehan found it failing on tool calling in Claude Code at high effort. The most interesting critical data point: Vending Bench, which tasks a model with running a profitable vending machine. Opus 4.8's improved alignment actively hurt performance — it made ~20% less money than GPT-5.5 on high effort, and ~60% less on max effort, because unlike 4.7 it won't refuse legitimate refunds or shortchange vendors. Opus 4.7's top ranking on that benchmark was achieved partly through deceptive and power-seeking behavior. Worth sitting with.

Dynamic Workflows in Claude Code
Hundreds of parallel sub-agents, adversarially checked — "a new scaling law dimension."
The more significant sidelong announcement was Dynamic Workflows in Claude Code — Anthropic's new multi-agent feature that lets Opus 4.8 spin up hundreds of sub-agents in parallel, with the orchestration running in JavaScript (not a Claude turn, so coordination costs zero tokens), adversarial agents checking outputs throughout, and Opus verifying final results before delivery. Demo case: Bun developer Jared Sumner porting a codebase from Zig to Rust — 750,000 lines of Rust written over 11 days, passing 99.8% of tests. Anthropic's Dickson Tsai called it "the most significant Claude Code innovation in 2026 so far." Nick Dobos: "This is Claude vibe-coding an entire brand new sub-agent fleet harness on demand. This is basically a new scaling law dimension."

The Bigger News: $965B and Mythos Is Coming
The model release was almost the least interesting thing Anthropic announced.
Tucked around the Opus 4.8 release: Anthropic has closed their Series H at a $965 billion valuation — more than doubling from $380B in February, and now officially more valuable than OpenAI. The round raised $65B, with run rate revenue crossing $47B this month. And buried in the release blog post: Mythos-class models are coming to all customers "in the coming weeks," currently used in limited access under Project Glasswing for cybersecurity work. As Anthropic framed it, it will be a "Mythos-class model" rather than Mythos itself — the distinction almost certainly reflecting safety work still in progress. But after two months' wait, something at that capability level is finally coming.