Claude Opus 4.8 review: powerful, honest, but only a small step up
Claude Opus 4.8 has arrived as Anthropic’s latest flagship model, promising sharper judgment, more honest reasoning, and better performance on long, complex coding and agent workflows. It can build jaw-dropping projects in a single shot—but it also raises a big question: is the upgrade meaningful enough in real-world use, especially against models like GPT 5.5?
What actually changed in Claude Opus 4.8?
Opus 4.8 is a direct evolution of Opus 4.7 rather than a brand-new family. The core improvements focus on reasoning quality, honesty, and complex multi-step tasks rather than flashy new features.
Key changes include:
• Sharper judgment and reasoning: The model is better at staying on track during long tasks and catching its own mistakes.
• More honest and self-aware: It’s around four times less likely than Opus 4.7 to overlook flaws or make unsupported claims, with lower rates of deceptive behavior.
• Improved long-horizon agentic workflows: It performs better on tasks that require planning, iteration, and tool use over many steps.
Despite these gains, the overall feel is incremental. If you’re coming from Opus 4.7, you’ll notice improvements—but they’re not game-changing in every scenario.
Pricing and context window
Anthropic kept the pricing and context window the same as Opus 4.7, which is good news if you’re already using the model:
• Context window: 1 million tokens
• Pricing: $5 per 1M input tokens, $25 per 1M output tokens
You can access Opus 4.8 through Claude.ai and via the API. However, rate limits in Claude Code haven’t been meaningfully improved yet, which can be frustrating for heavy coding workflows.
New effort control for reasoning vs speed
One of the most practical additions is effort control. This lets you adjust how much reasoning effort the model spends on a task, trading off between:
• Higher effort: Better reasoning and richer outputs, but slower and more expensive.
• Lower effort: Faster and cheaper, but with simpler reasoning.
This is especially useful if you want to reserve maximum effort for complex builds (like full apps or games) and keep everyday tasks snappy and affordable.
Benchmark results: where Opus 4.8 shines
On paper, Opus 4.8 posts solid benchmark gains, especially for real-world coding and agentic tasks.
SWE-bench Pro and software engineering
On SWE-bench Pro, a benchmark for real-world software engineering tasks, Opus 4.8 jumps from around 64% to 69%. That’s a meaningful improvement in its ability to fix bugs and work with real codebases.
Agentic and tool-use benchmarks
On benchmarks focused on agents and tool use, Opus 4.8 does very well:
• OSWorld (agentic computer use): Opus 4.8 leads by a strong margin over other state-of-the-art models, even beating Gemini 3.5 Flash.
• Agentic financial analysis, GPQA, and HLE: It performs competitively or better than peers, showing strong reasoning in specialized domains.
On a custom benchmark suite called "World of AI," which measures how well models can build full projects agentically across domains like frontend, backend, game dev, and 3D, Opus 4.8 takes the #1 spot—just edging out Opus 4.7.
However, those gains are small. It’s clearly better, but not a generational leap.
Cursor benchmark results
Cursor’s internal benchmark (Cursor Bench 3.1) shows Opus 4.8 as slightly more efficient overall but performing roughly on par—or even slightly worse—than Opus 4.7 within the margin of error. Because Cursor competes in this space, those numbers should be taken with some caution, but they reinforce the idea that 4.8 is a refinement, not a revolution.
Opus 4.8 vs GPT 5.5: which is better for coding?
In real-world coding, the comparison most people care about is Opus 4.8 vs GPT 5.5 (especially GPT 5.5 with CodeX and X-High reasoning).
Based on hands-on testing:
• Raw coding power: GPT 5.5 with CodeX and X-High reasoning still feels like the strongest overall coding model.
• Agentic coding: GPT 5.5 is faster and more token-efficient for long, tool-heavy workflows, while Opus 4.8 can match or beat it in some design and "vibe" coding tasks when pushed to max effort.
• Productivity: GPT 5.5 generally gives comparable results without needing to "think" 10x longer or burn huge token budgets.
If you care about absolute output quality and you’re willing to pay in time and tokens, Opus 4.8 can be stunning. If you care about speed, cost, and day-to-day productivity, GPT 5.5 still has an edge in many coding scenarios.
For a deeper dive into how Opus 4.8 fits into Anthropic’s lineup and how to use it effectively, it’s worth checking out this detailed guide to Claude Opus 4.8.
Real-world builds: MacOS clone, Minecraft, and more
Where Opus 4.8 really impresses is in complex, single-shot builds when set to maximum reasoning effort.
A full MacOS-style desktop clone
One test had Opus 4.8 generate a MacOS-style web desktop environment in a single prompt, with reasoning effort set to max. The trade-off: it took around two hours and consumed a huge number of tokens—but the result was one of the best MacOS clones seen from any model.
The clone included:
• A login screen with startup sound
• Functional top bar with window creation and brightness controls
• A working dock with multiple apps: Finder, Safari, Mail, Notes, Calendar, Settings, Calculator, App Store, Terminal
• Light/dark mode toggles and wallpaper changes
• Functional audio playback for music
On top of that, it embedded a fully playable Minecraft-style game inside the OS clone, with block breaking and placing, inventory, and a working world—again, all from a single prompt.
Standalone Minecraft clone
Another user test asked Opus 4.8 (on max effort) to build a Minecraft clone. Compared to Opus 4.7, 4.8 delivered a more polished and functional result in one shot:
• Working block breaking and placing
• Inventory system
• Cave systems
• More detailed textures than many other models
Some elements, like water dynamics, were still incomplete, but the overall execution was a clear step up—just not a massive one.
3D FPS dungeon crawler
Opus 4.8 was also tasked with building a full 3D FPS dungeon crawler in a single HTML file, using raycasting/WebGL and procedural dungeon generation. The result included:
• Multiple levels
• A sword and combat system
• Enemy AI that chases the player
• Keys, doors, potions, coins, and inventory
• A mini-map and basic UI
For a one-shot generation, the functionality was impressive and showed how strong Opus 4.8 can be at game dev under constraints.
Frontend design, 3D scenes, and SVGs
Beyond pure coding, Opus 4.8 continues Anthropic’s strength in frontend and creative coding, though with some caveats.
Frontend quality and design taste
Opus models are already known for strong frontend generation, and 4.8 keeps that trend:
• It follows instructions well and produces functional UIs with proper components and layout.
• It can translate a reference visual style into a premium-looking landing page, including motion, shaders, and creative visual elements.
However, there’s a noticeable "house style" that shows up repeatedly. Many generated frontends share the same basic aesthetic, suggesting the model leans heavily on a pre-trained design template. You’ll often see similar layout patterns and component structures across different prompts.
3D scenes and Three.js
In 3D and Three.js-style tasks, Opus 4.8 is mixed but capable:
• A low-poly 3D scene inspired by Zelda came out nicely, with cherry blossom ambience and good lighting, capturing the vibe reasonably well.
• A solar system visualization looked strong, with interactive planet switching and distinct features per planet.
That said, its 3D fundamentals aren’t flawless. Some more advanced 3D visualizations still show limitations in geometry, camera handling, or performance.
SVG generation
On SVGs, Opus 4.8 does a decent job and can produce creative, interesting outputs. But compared to Gemini 3.5 Flash, it’s still behind. Gemini tends to generate more polished and visually rich SVGs, especially for complex or highly stylized graphics.
Honesty, alignment, and safety
One of the less flashy but very important upgrades in Opus 4.8 is its behavior around honesty and safety:
• It’s significantly more likely to admit uncertainty or limitations instead of confidently hallucinating.
• It’s around four times less likely than Opus 4.7 to miss flaws or make unsupported claims.
• Its safety performance is comparable to the Claude Mythos preview, with lower rates of deceptive behavior.
This matters a lot for long-horizon tasks where the model is effectively acting as an agent. Better self-awareness and lower deception risk make it safer to trust with complex workflows.
Is Opus 4.8 worth switching to?
Whether Opus 4.8 is "worth it" depends on what you care about most.
When Opus 4.8 is a great choice
Opus 4.8 is a strong fit if you:
• Care about maximum output quality for complex builds and are okay with higher latency and token usage.
• Need agentic, long-horizon workflows where honesty and self-correction are critical.
• Do a lot of frontend, creative coding, or "vibe" projects where design feel matters.
In those cases, especially with effort control set to max, Opus 4.8 can produce some of the best one-shot generations available today.
When GPT 5.5 may be a better fit
GPT 5.5 (with CodeX and X-High reasoning) may be the better overall package if you:
• Prioritize speed and productivity over absolute peak quality.
• Want strong agentic coding without burning huge token budgets.
• Need a model that feels consistently fast and efficient in day-to-day dev workflows.
In many real-world scenarios, GPT 5.5 delivers comparable results with far less waiting and cost.
What’s next after Opus 4.8?
Anthropic has quietly hinted that an entirely new class of models beyond Opus is on the way. This lines up with growing speculation around a broader Mythos release or next-generation Claude models in the near future.
If you’re interested in how Claude’s code-focused capabilities are evolving and how to get more out of them today, it’s also worth exploring advanced workflows in this guide to Claude Code tricks and workflows.
Final verdict
Claude Opus 4.8 is a great model—but not a revolutionary one. It’s a small, meaningful step up from Opus 4.7 with:
• Better honesty and self-awareness
• Stronger long-horizon and agentic performance
• Impressive one-shot builds when pushed to max effort
However, the upgrade feels marginal rather than transformative, especially given the high latency and token usage at maximum reasoning. If you want the absolute best-looking, most detailed generations and you’re willing to pay for them, Opus 4.8 is arguably at the top of the stack. If you care more about speed, efficiency, and everyday productivity, GPT 5.5 still holds a noticeable edge.
For now, Opus 4.8 is a polished refinement that keeps Anthropic competitive at the high end—while the real excitement may be in whatever "beyond Opus" model comes next.
Comments
No comments yet. Be the first to share your thoughts!