Anthropic’s Claude Opus 4.7 Is Here — And It’s Not Even Their Strongest Model

15 May 2026 06:37 46,873 views
Anthropic has released Claude Opus 4.7, a major upgrade focused on coding, agents, and vision — while openly admitting it’s still holding back an even more powerful model called Mythos. Here’s what’s new, how it stacks up against GPT 5.4 and Gemini 3.1 Pro, and why this launch feels like a turning point for real-world AI workflows.

Anthropic just rolled out Claude Opus 4.7, and it’s a much bigger deal than a routine version bump. It’s faster, more reliable for coding and agents, sees more detail in images, and comes with new cybersecurity safeguards. But the most interesting part? Anthropic is openly saying this is not even its most powerful model — that title belongs to a still-limited system called Mythos.

In other words, Opus 4.7 is both a flagship upgrade and a preview of the safety infrastructure Anthropic thinks it needs for the next wave of models.

What Actually Launched: Claude Opus 4.7

Claude Opus 4.7 is now generally available across all Claude products and major cloud platforms. Pricing stays the same as Opus 4.6: $5 per million input tokens and $25 per million output tokens. Developers can access it using the model ID claude-opus-4.7.

Anthropic is positioning Opus 4.7 as the new default for serious work, especially software engineering and agent-style workflows. It’s meant to fix some of the biggest complaints about Opus 4.6: drifting logic in long coding sessions, messy tool use, and losing the thread on multi-step tasks.

According to Anthropic, users are now comfortable handing off coding work that used to require close supervision. The model is designed to:

• Plan more carefully before acting
• Follow instructions more precisely
• Check and verify its own work before reporting back

Coding, Agents, and Benchmarks: Where Opus 4.7 Pulls Ahead

Opus 4.7 isn’t just a small bump over 4.6 — the coding benchmarks show a real jump in capability, especially for agentic and tool-heavy workflows.

Big gains over Opus 4.6

Key benchmark improvements include:

SWE-bench Pro: 53.4 → 64.3
SWE-bench Verified: 80.8 → 87.6
Cursorbench: 58 → 70
MCP Atlas (tool use at scale): 75.8 → 77.3
Rakuten SWE-bench (production tasks): resolves 3× more production tasks than Opus 4.6

That last one matters a lot. Production tasks are where models either save you time or quietly create extra work. A 3× jump there suggests Opus 4.7 is much more capable of actually shipping working changes, not just writing plausible code.

How it stacks up against GPT 5.4 and Gemini 3.1 Pro

Anthropic is clearly using this release to plant a flag in the coding and agents race. On key benchmarks:

SWE-bench Pro: Claude Opus 4.7 at 64.3 vs GPT 5.4 at 57.7 and Gemini 3.1 Pro at 54.2
MCP Atlas: Opus 4.7 at 77.3 vs GPT 5.4 at 68.1 and Gemini 3.1 Pro at 73.9

On pure reasoning, though, the frontier models are basically tied. On GPQA Diamond:

• Opus 4.7: 94.2
• GPT 5.4 Pro: 94.4
• Gemini 3.1 Pro: 94.3

That tells you where the competition is shifting. Raw reasoning is no longer the main differentiator. Real-world execution — coding, tools, long workflows — is where models are trying to pull ahead. If you follow broader model races, this aligns with the trends we’ve seen in recent launches covered in recent multi-vendor AI news roundups.

How Opus 4.7 feels in practice

Early feedback suggests Opus 4.7 doesn’t feel like a flashy “new brain” so much as a more dependable coworker. The big differences users are noticing:

• Fewer wasted steps
• Better upfront planning before it starts coding
• Cleaner handling of long, multi-step sessions
• Less tendency to get stuck or spin in circles on tricky tasks
• Smoother recovery when something goes wrong

For people using AI daily, that kind of stability matters more than one-off, impressive answers. Trust comes from whether the model stays sharp over hours and days of work.

New Effort Levels, Vision Upgrade, and Design Automation

X-High: a new default for serious coding

Anthropic introduced a new effort level called X-High, which sits between High and Max. In practice:

High: faster, but sometimes too shallow on hard problems
Max: stronger, but slower and more expensive
X-High: intended sweet spot for deep but still practical work

Claude Code now defaults to X-High across all plans, and Anthropic recommends starting with High or X-High for coding and agentic workflows. This is another sign that the company is optimizing for long, complex tasks rather than quick chat responses.

Much sharper vision and document understanding

Opus 4.7 also gets a significant vision upgrade. It can now process images up to 2,576 pixels on the long edge (about 3.75 megapixels), which Anthropic says is more than 3× the detail of earlier Claude models.

This matters for things like:

• Dense UI screenshots and dashboards
• Technical diagrams and charts
• Scanned pages and PDFs
• Images with tiny labels or small text

With the higher resolution, the model can stop guessing and actually read what’s on the screen. Anthropic reports a 21% drop in document reasoning errors compared with Opus 4.6.

Toward design and digital creation

There’s also a bigger story behind this vision upgrade. Reports around Opus 4.7 suggested Anthropic is preparing tools that can generate websites, landing pages, presentations, and prototypes directly from natural language prompts.

Even the hint of this was enough to rattle parts of the design software market. Companies like Figma, Adobe, Wix, and GoDaddy reportedly saw pressure as investors digested the idea that users might soon be able to describe what they want and get polished visual output back quickly.

If Opus 4.7 and its successors can reliably turn text into working designs and interfaces, that cuts right into the middle of traditional design workflows — not just at the brainstorming stage, but at the production layer.

Instruction Following, Memory, and Developer Controls

Stricter instruction following (and why you may need to retune prompts)

Anthropic says Opus 4.7 is substantially better at following instructions. The catch: it’s also more literal.

Older prompts that relied on the model “filling in the gaps” may behave differently now. Where earlier versions might have loosely interpreted or skipped parts of a request, Opus 4.7 tends to follow what you wrote more exactly.

That’s great if your prompts are well-structured. But if your workflows depended on soft assumptions or vague instructions, you may see odd behavior. Anthropic is explicitly recommending teams retune prompts and harnesses when moving from 4.6 to 4.7.

Better long-term memory and project continuity

Memory is another area that’s getting stronger. Opus 4.7 is better at:

• Using file-system-based memory
• Remembering important notes across long, multi-session work
• Keeping track of project files and context over time

That should make setups like project workspaces, codebases, and tools such as Claude MD more reliable, with less need to restate everything from scratch every session. For agent-style workflows, this kind of continuity is crucial — weak memory is one of the biggest reasons agents still break down in real-world use.

New developer features: task budgets, tokenizer changes, and reviews

Alongside the model, Anthropic is rolling out several developer-facing updates:

Task budgets (API public beta)
• Let developers set limits on token spending for longer runs
• Particularly important because Opus 4.7 uses an updated tokenizer

With the new tokenizer, the same input can map to about 1.0–1.35× more tokens, depending on content. On top of that, the model tends to "think" more at higher effort levels, especially late in agentic sessions, which can increase output tokens.

Anthropic says that in internal coding evals, overall token efficiency is still favorable because Opus 4.7 reaches good answers in fewer turns. But if you’re running real API workloads, this is something you’ll want to monitor closely.

Claude Code /ultra review
• A new command that starts a dedicated review session
• Reads through changes, flags bugs and design issues like a careful human reviewer
• Pro and Max users get three free ultra reviews to try it

Auto Mode for Max users
• Auto Mode is now extended to Max plans
• Gives Claude more freedom to make permission decisions and push longer tasks forward with fewer interruptions

All of this fits the direction of the launch: Anthropic isn’t just making Claude smarter, it’s trying to make it more autonomous and useful inside longer, more complex workflows. If you’re exploring agent stacks or multi-model setups, this release sits nicely alongside the kind of orchestration patterns discussed in pieces like top AI models for agents and orchestrators.

Beyond Coding: Knowledge Work and Finance

Anthropic is also pitching Opus 4.7 as a stronger model for broader knowledge work, not just engineering.

In internal testing, the company says Opus 4.7 performs better than 4.6 as a finance analyst, producing:

• More rigorous analysis and models
• More professional presentations
• Better integration across multi-step tasks

Anthropic also claims Opus 4.7 is state of the art on:

• A finance agent evaluation
GDP-val AA, a third-party benchmark focused on economically valuable knowledge work in areas like finance and legal

The message is clear: Opus 4.7 isn’t just a coding model. It’s meant to feel like a serious professional assistant across the board.

The Secret Super Model: Mythos and Anthropic’s Safety Strategy

The most unusual part of this launch is how open Anthropic is about what Opus 4.7 is not. It’s not the company’s strongest model.

That title belongs to Claude Mythos preview, which Anthropic says is more capable — especially in cybersecurity — but too risky for broad release right now.

Project Glasswing and cyber safeguards

This ties directly into Anthropic’s Project Glasswing. The plan, as the company describes it, is to:

• Keep Mythos preview limited for now
• Test new cybersecurity safeguards on less capable models first
• Use Opus 4.7 as the first live deployment of those safeguards

Anthropic says Opus 4.7’s cyber abilities are below Mythos level, and that it even experimented with reducing those capabilities during training. At the same time, Opus 4.7 now includes safeguards that automatically detect and block requests related to prohibited or high-risk cybersecurity use.

So Opus 4.7 is both a capability upgrade and a real-world testbed for the safety systems Anthropic believes it will need before releasing Mythos-class models more broadly.

Why Mythos is being held back

The concern around Mythos is that it may be too good at finding vulnerabilities — good enough that long-buried software weaknesses could suddenly become exploitable at scale.

Once models cross that line, the conversation changes. It’s no longer just about productivity and automation. It becomes an infrastructure and security issue.

Anthropic is trying to walk a narrow line:

• Powerful enough for legitimate defenders and security professionals
• Not so open and capable that it becomes a large-scale abuse engine

To support legitimate use, the company is launching a cyber verification program. Security professionals doing vulnerability research, penetration testing, and red teaming can apply for access in a more controlled way.

Alignment, honesty, and trade-offs

On safety and alignment, Anthropic says Opus 4.7 has a similar overall profile to Opus 4.6, with low rates of:

• Deception
• Sycophancy (telling users what they want to hear)
• Cooperation with misuse

The company reports improvements in honesty and resistance to malicious prompt injection, though Opus 4.7 is modestly weaker on a few measures — for example, it can sometimes give overly detailed harm-reduction advice around controlled substances.

Overall, Anthropic describes Opus 4.7 as largely well-aligned and trustworthy, though not perfect. Interestingly, it also says Mythos preview remains the best-aligned model it has trained according to its own evaluations, which makes the decision to hold it back even more striking.

Put together, Opus 4.7 feels like a turning point: Anthropic is tightening its grip on the coding and agents race, pushing harder into vision and design automation, and at the same time signaling that a more powerful — and potentially more dangerous — class of models is already here, just not fully unleashed yet.

Share:

Comments

No comments yet. Be the first to share your thoughts!

More in Claude