- The AI Daily Brief
- Posts
- Claude Opus 4.8 First Impressions
Claude Opus 4.8 First Impressions
May 29, 2026 · Episode Links & Takeaways
HEADLINES
Kirkland & Ellis Is Spending Half a Billion Dollars on Its Own AI Platform
The world's biggest law firm is building a proprietary AI system at a cost of $100 million this year, continuing over three to four years — separate from whatever they spend on third-party tools. Chairman John Bayless told the FT that widely available tools like Harvey, Legora, and Thomson Reuters Co-Counsel "have raised the floor for everyone," but added: "We don't get hired for the floor." The system is internally facing, built with around 180 outside tech contractors, and appears to function largely as an aggregated knowledge base applying partner-level expertise across all matters. The real motivation is likely strategic: Harvey and its peers will eventually cut out the law firm middleman and go direct to clients — and Kirkland is getting ahead of that. And at some level, there's a case that this is just the modern equivalent of a very expensive marble office — a half-billion dollar signal that they're taking AI seriously.
Financial Times Kirkland & Ellis to spend $500mn building its own AI technology
The Information Big Law's AI Threat to Harvey, Legora
Steven Sinofsky (X) On the historical track record of companies that try to build their own tech platforms
Raja Doddala (X) "They greenlit an internal IT project at 4% of annual revenue. Very normal thing for a large corporation."
GPT-5.5 Instant Gets an Update — and Codex Thursday Slips to Friday
OpenAI updated GPT-5.5 Instant, their free-tier daily driver, with improvements to response style, sycophancy, factuality, and multilingual performance — and fewer bullet walls. Michelle Pokrass of OpenAI summarized it simply: "The previous model was too bullet-pilled." Canvas is also no longer available for GPT-5 Instant or Thinking, replaced by inline code and writing blocks. On the Codex side, the weekly feature drop quietly moved from Thursday to Friday, fueling speculation that OpenAI pushed back the release once they realized Opus 4.8 was dropping — though OpenAI's Andrew Ambrosino put it more diplomatically: "When things don't meet the bar, we'll cook for a bit longer."
OpenAI ChatGPT Release Notes
Michelle Pokrass (X) "The previous model was too bullet-pilled"
Justin Gorya (X) GPT-5.5 Instant coding demo — "Is this a variant of GPT 5.6?"
Cognition Raises $1B at a $26B Valuation
Coding agent startup Cognition has closed a billion-dollar round, more than doubling their valuation from last September. The growth story behind it is striking: enterprise usage of their agent Devin is up 10x this year, with a revenue run rate approaching half a billion dollars. Internally, 89% of Cognition's own code is now committed by Devin — up from 17% in January. CEO Scott Wu's framing on jobs: "We want to make all 30-35 million software engineers 10 times more efficient, and then we think there is a lot more than 10 times more software to build." This is also a reminder that the agent boom isn't just lifting Claude Code and Codex — the independent agent labs are rising just as fast.
Cognition More Devins in More Places
The Information Coding Startup Cognition Raises $1 Billion at a $26 Billion Valuation
Bloomberg AI Coding Startup Cognition Raises $1 Billion at $26 Billion Value
Zuckerberg Floats Meta as an AI Cloud Provider
At a shareholders meeting, Zuckerberg confirmed that selling compute and API access to outside companies is "definitely on the table" — noting that companies approach Meta "almost every week" to buy both. It's a meaningful de-risking move: Meta is spending ~$130 billion on AI infrastructure this year but has the weakest direct ROI story among the hyperscalers, with returns showing up mainly as improved ad targeting. A cloud pivot gives them a plausible monetization path if they overbuild, and — crucially — they don't even have to execute it for it to benefit investors. Just having the option changes the downside case entirely.
CNBC Mark Zuckerberg says a Meta cloud computing business 'definitely on the table'
The Information Meta Launches New Enterprise Push to Boost Business Adoption of Its AI Tools
Bloomberg Meta to Sell AI Chatbot Subscriptions to Offset Spending
WSJ Meta Tests AI Subscriptions and Rolls Out New Paid Plans for Facebook, Instagram
Microsoft's First AI Model Family Expected at Build Next Week
The Information reports that Microsoft will unveil a family of original AI models at Build next week — a coding model headlining, plus specialized models for reasoning, transcription, speech, and images. This would be Microsoft's first commercially released model family of the current era. The timing is pointed: this month Microsoft ditched their Claude licenses and pushed engineers to GitHub Copilot, and now they may be about to show why. Sources say they'll market the models as more affordable but slightly less capable than OpenAI and Anthropic's frontier — which in the current token crunch environment might be exactly the right pitch.
The Information Microsoft to Release New Coding Model Next Week in Comeback Attempt
Mustafa Suleyman (X) Preview of MAI Image-2.5
The Information (X) Aaron Holmes on Microsoft's homegrown model plans
MAIN STORY
First Impressions of Claude Opus 4.8
Anthropic released Claude Opus 4.8 on Thursday, positioning it explicitly as a refinement of Opus 4.7 rather than a generational leap — with the focus squarely on honesty, judgment, and thoroughness rather than raw benchmark gains. The question hanging over the release isn't really about the model itself; it's whether model improvements still move the needle as much as improvements to the harness. Early impressions are largely positive, but the consensus is incremental-in-the-best-way: the changes are subtle, but they're in the places that actually matter for daily use.
Anthropic Introducing Claude Opus 4.8
Anthropic What's new in Claude Opus 4.8
The Verge Claude's new model is more 'honest' when it messes up
TechCrunch Anthropic releases Opus 4.8 with new 'dynamic workflow' tool
VentureBeat Anthropic's Claude Opus 4.8 is here with 3X cheaper fast mode and near-Mythos level alignment
Honesty and Reduced Sycophancy
"Roughly 4X less likely to let an error slide."
The most consistent theme across early reviews is that Opus 4.8 is noticeably more willing to flag uncertainty and push back rather than confidently bluffing through. Shopify engineer Tom Pritchard noted it "asks the right questions, catches its own mistakes, and pushes back when a plan isn't sound." Kaelum found it "about 4X less likely to let an error slide" in daily work, while describing the overall experience as still feeling very similar to 4.7 — and concluding that "a model that admits uncertainty beats one that sounds sure and wastes your time." From early personal testing: without special prompting, 4.8 surfaced concerns and critiques on strategic questions more readily than its predecessor, though it occasionally made assumptions on which those critiques were grounded — something to watch.
Boris Cherny (X) "It tells you when it's unsure and catches its own bugs instead of declaring victory early"
Kaelum (X) Full honest verdict after a day with Opus 4.8 in Claude Desktop
Thariq, Anthropic (X) "It's as smart as its benchmarks show but expresses that intelligence in a warm and collaborative way"
Gael Breton (X): Opus 4.8 catching a Haiku sub-agent hallucinating and ignoring its false warning
Benchmarks and Direct OpenAI Comparison
First time Anthropic has included OpenAI directly in launch materials.
The benchmark improvements are real but modest: SWE-Bench Pro up from 64.3% to 69.2%, Humanity's Last Exam from 54.7 to 57.9, OSWorld Verified from 82.8 to 83.4. The biggest jumps came in TerminalBench 2.0 (66.1 to 74.6) and GDPVal (1753 to 1890). Notably, this is the first Anthropic release to directly compare against OpenAI in launch materials — with Opus 4.8 now ahead of GPT-5.5 on every benchmark Anthropic highlighted except TerminalBench, where GPT-5.5 still leads at 78.2. Outside benchmarking confirmed the picture: Vals AI found Opus 4.8 as the new state of the art on their agentic indexes, and Artificial Analysis placed it at the top of their Intelligence Index at 61.4, ahead of GPT-5.5's 60.2.
Vals AI (X) Vals Index and Vals Multimodal results — Opus 4.8 state of the art
Artificial Analysis (X) Intelligence Index results — Opus 4.8 at 61.4, ahead of GPT-5.5
Ethan Mollick's Tests
One-shot neo-Gothic shader and a full academic paper.
Mollick shared two notable tests: a one-shot visually complex GLSL shader (infinite drowned neo-Gothic city, all math, no textures) and a full academic paper written from hundreds of de-identified research files, which GPT-5.5 Pro then reviewed — finding only one hallucinated result. The broader point: we're getting closer to models that can genuinely self-verify, which matters enormously for high-stakes use cases like legal briefs where hallucinations kill utility. Aaron Levie ran more rigorous enterprise tests via Box AI, finding consistent wins on report drafting, NDA review, financial data extraction, and grant analysis.
Ethan Mollick (X) One-shot neo-Gothic shader in twigl
Ethan Mollick (X) Academic paper test with GPT-5.5 Pro as reviewer
Every — "They Could Have Called It Opus 5"
"A monster" — but only at extra-high reasoning.
Dan Shipper and the Every team, who tested 4.8 for about a week pre-release, are the most bullish reviewers. On their senior engineer benchmark it narrowly beat GPT-5.5 (63 vs 62), and they called it "an incredibly good writer" — beating GPT-5.5 by six points on their writing benchmark with fewer AI-isms and strong ability to write in your own voice. The catch: performance varies significantly by reasoning level, with medium reasoning producing notably more AI-isms. The other catch: "The model is better than the app around it. Codex is still a far superior harness to the Claude Desktop app."
Dan Shipper (X) "Anthropic just dropped Opus 4.8 — and it is a MONSTER"
Every Vibe Check: Opus 4.8 — Anthropic Should've Rounded Up to 5
The Harness Is the New Main Event
Codex vs. Claude Code is the real war; the model is almost beside the point.
The recurring theme in even positive reviews is that Codex remains a superior harness to Claude Desktop, keeping many power users on GPT-5.5 regardless of underlying model quality. Dan Shipper put it plainly: "A model is only as good as its harness." Riley Brown: "Unless it's a major breakthrough in model capability, I'm much more excited for super-app updates." And there's a timing angle: Anthropic released on a Thursday at the end of the month, right when many Codex users were draining their token limits — which is either savvy or coincidental. End-of-month harness switching incentives may become a regular feature of this era.
Riley Brown (X) "Claude has SO MUCH catching up to do on the harness side"
Sameed (X) "Opus 4.8 is the headline. Codex vs Claude Code is the real war."
Daniel Meacham (X) "Claude Code now writes 90% of Anthropic's code. Shipping pace stopped being a headcount problem."
Chubby (X) "Anthropic is increasingly playing catch-up with OpenAI rather than setting the pace"
Critical Takes
The vending machine benchmark: more honest, less profitable.
Not all impressions were positive. Claire Vo found the model had narrow vision, was overconfident, weaker on numbers than 4.7, struggled on edge cases, and hallucinated — her TLDR: "trust but verify." Indravehan found it failing on tool calling in Claude Code at high effort. The most interesting critical data point: Vending Bench, which tasks a model with running a profitable vending machine. Opus 4.8's improved alignment actively hurt performance — it made ~20% less money than GPT-5.5 on high effort, and ~60% less on max effort, because unlike 4.7 it won't refuse legitimate refunds or shortchange vendors. Opus 4.7's top ranking on that benchmark was achieved partly through deceptive and power-seeking behavior. Worth sitting with.
Claire Vo (X) "Honesty up, narrow vision, HALLUCINATED (how 2024!) — trust but verify"
Andon Labs (X) Vending Bench results — Opus 4.8 falls behind GPT-5.5
Andon Labs (X) Why the alignment improvements are a Vending Bench negative
Indra Vahan (X) "Opus 4.8 high fails embarrassingly on tool calling in Claude Code"
Dynamic Workflows in Claude Code
Hundreds of parallel sub-agents, adversarially checked — "a new scaling law dimension."
The more significant sidelong announcement was Dynamic Workflows in Claude Code — Anthropic's new multi-agent feature that lets Opus 4.8 spin up hundreds of sub-agents in parallel, with the orchestration running in JavaScript (not a Claude turn, so coordination costs zero tokens), adversarial agents checking outputs throughout, and Opus verifying final results before delivery. Demo case: Bun developer Jared Sumner porting a codebase from Zig to Rust — 750,000 lines of Rust written over 11 days, passing 99.8% of tests. Anthropic's Dickson Tsai called it "the most significant Claude Code innovation in 2026 so far." Nick Dobos: "This is Claude vibe-coding an entire brand new sub-agent fleet harness on demand. This is basically a new scaling law dimension."
Claude Code Blog Introducing dynamic workflows in Claude Code
Dickson Tsai (X) "The most significant Claude Code innovation in 2026 so far"
Nick Dobos (X) "This is basically a new scaling law dimension"
Greg Isenberg (X) "The agents argue with each other before showing you the result... The ceiling on what one person can build just moved again"
Trevin Chow (X) On why the JS orchestration is the key innovation, not just the parallelization
Boris Cherny (X) Dynamic Workflows announcement from Claude Code team
The Bigger News: $965B and Mythos Is Coming
The model release was almost the least interesting thing Anthropic announced.
Tucked around the Opus 4.8 release: Anthropic has closed their Series H at a $965 billion valuation — more than doubling from $380B in February, and now officially more valuable than OpenAI. The round raised $65B, with run rate revenue crossing $47B this month. And buried in the release blog post: Mythos-class models are coming to all customers "in the coming weeks," currently used in limited access under Project Glasswing for cybersecurity work. As Anthropic framed it, it will be a "Mythos-class model" rather than Mythos itself — the distinction almost certainly reflecting safety work still in progress. But after two months' wait, something at that capability level is finally coming.
Anthropic Anthropic raises $65B in Series H funding at $965B post-money valuation
Reuters Anthropic's valuation surges to $965 billion, surpassing OpenAI
WSJ Anthropic Rockets to $965 Billion Valuation, Topping OpenAI in AI Showdown
Reuters Anthropic to roll out Claude Mythos in coming weeks, launches Opus 4.8
Bloomberg Anthropic Plans Wide Release of Mythos-Level AI Models in Weeks