OpenAI’s GPT‑5.4 Pro might now be the smartest AI model in the world

04 Jun 2026 07:07 18,454 views

OpenAI’s new GPT‑5.4 Pro model is setting records on cutting‑edge reasoning, math, web browsing, and real‑world work benchmarks—but it comes with a high price tag and serious cybersecurity implications. Here’s what’s actually new, how it compares to Gemini and Claude, and what it could mean for knowledge workers and AI safety.

OpenAI’s GPT‑5.4 Pro is here, and it’s not just another small upgrade. Across a new generation of harder benchmarks—math, web browsing, computer use, and real professional work—it’s starting to look like the smartest general-purpose AI model available right now.

At the same time, GPT‑5.4 Pro is expensive, and its rapidly growing capabilities are forcing serious conversations about cybersecurity risks and the future of white-collar jobs.

How GPT‑5.4 Pro stacks up against Gemini and Claude

Traditional benchmarks like GPQA are now mostly saturated: top models all score similarly, and the differences don’t tell you much about how they’ll perform in real work. GPT‑5.4 Pro stands out on a newer wave of harder, more realistic tests.

Beating Google at its own game: web browsing

On the Browse Comp benchmark—which tests whether a model can pull in and use fresh, real-time information from across the web—GPT‑5.4 Pro scores around 89.3%. That’s slightly ahead of Google’s Gemini 3.1 Pro, even though you’d expect Google to dominate anything tied to search and web data.

This doesn’t mean Gemini is suddenly obsolete, but it does show that OpenAI has pushed browsing and retrieval further than many expected, and sooner than many rivals probably hoped.

Price: powerful, but not cheap

All this performance comes at a cost. GPT‑5.4 Pro is one of the most expensive mainstream models on the market. Pricing is roughly:

• Around $30 per million input tokens
• Around $180 per million output tokens

That’s steep for heavy users, especially if you’re running long reasoning chains or agentic workflows. The non-reasoning GPT‑5.4 variants look more reasonable and, in some cases, may even be more cost-effective than Claude Opus 4.6 at similar performance levels.

For many teams, the winner over the next year won’t just be “the smartest model,” but the best price‑to‑performance option—especially for workloads that burn through millions or billions of tokens. For a deeper breakdown of the model family and pricing, see this overview of GPT‑5.4 as an all‑in‑one model for coding, agents, and knowledge work.

Frontier Math: research‑level problems and a “move 37” moment

One of the most impressive results for GPT‑5.4 Pro is on Frontier Math, a benchmark built from research-level math problems designed by professional mathematicians. These aren’t textbook exercises; they’re novel, hard questions that were kept private specifically to prevent models from memorizing them.

When Frontier Math launched, top models were scoring around 2%. GPT‑5.4 Pro now dominates the benchmark, including on Tier 4—the hardest problems—where it significantly outperforms Claude Opus 4.6, and OpenAI has consistently led on this benchmark across multiple model generations.

A 20‑year unsolved problem cracked

Perhaps the most striking anecdote: a mathematician reported that GPT‑5.4 solved a problem he had been stuck on for 20 years. This wasn’t part of any benchmark set and wasn’t available on the internet, so the model couldn’t have seen the solution during training.

He compared the moment to AlphaGo’s famous “move 37” in 2016—the first time many experts felt AI had done something genuinely superhuman in a complex domain. In this case, GPT‑5.4 produced a solution that the researcher described as “very nice, clean, and almost human.”

On the hardest tier of Frontier Math, GPT‑5.4 Pro solved around 38% of problems over 10 runs, and one of those runs included this previously unsolved question. That’s not just a small numerical bump; it looks like a qualitative shift in what AI can do in advanced mathematics.

ARC AGI2: pushing toward human‑level reasoning

Another key benchmark is ARC AGI2, designed to test abstract reasoning and fluid intelligence rather than memorized knowledge. It’s widely regarded as one of the toughest and most respected reasoning benchmarks.

Human baseline performance is around 85%. GPT‑5.4 Pro (high reasoning) is now in the 83–84% range—just shy of that human baseline—though each task can cost $30–$50 in compute. It’s also neck‑and‑neck with Google’s large Gemini Deepthink models, which were designed specifically for heavy reasoning and math.

ARC AGI2 is expensive to run at scale, and some other models may be more cost‑efficient. But in terms of raw reasoning power, GPT‑5.4 Pro is clearly at the very top of the field.

Real work benchmarks: slide decks, models, and legal memos

Traditional academic benchmarks don’t always tell you how a model will perform on the kind of work people actually do in offices. That’s where new “professional services” benchmarks come in.

On the Apex Agents benchmark—built to measure performance on real professional services work—GPT‑5.4 Pro is now at the top of the leaderboard. This benchmark focuses on long-horizon deliverables like:

• Slide decks
• Financial models
• Legal analysis and memos
• Market and business analysis

The tasks were designed by real investment bankers, consultants, and lawyers across 33 simulated work worlds, with 480 tasks in total. Models have to complete tasks like a junior employee would: no asking for help, no back‑and‑forth clarifications, and no internet access.

From 24% to 52% in weeks

When the benchmark launched in January 2026, the best model scored about 24%. Within roughly 6–8 weeks, GPT‑5.4 Pro pushed that to 52% on the hardest tasks—more than doubling performance.

Yes, 52% still means the model fails almost half of all tasks. And the setup is stricter than real life: no clarifying questions, no external tools, no messy context. In a real workflow with iteration and human oversight, performance would likely be higher.

But the speed of improvement is the story. Jumping from 24% to 52% on realistic, expert‑designed work tasks in a matter of weeks is exactly the kind of curve people have been warning about when they talk about rapid AI capability gains and job disruption for junior white‑collar roles.

OpenAI’s GDP‑val: beating human professionals on cost and speed

OpenAI also runs its own internal benchmark, called GDP‑val, to measure how close its models are to real knowledge workers. It covers 44 occupations across nine major industries that make up most of US GDP, including:

• Writing sales presentations
• Building spreadsheets and financial models
• Scheduling and operations tasks
• Legal and compliance work
• Financial and market analysis

On this benchmark, GPT‑5.4 matches or beats a human professional about 83% of the time, and GPT‑5.4 Pro is at 82%. Every model on the chart beats the human baseline line overall.

The more striking part: these models complete tasks roughly 100× faster and 100× cheaper than human experts, at least under the benchmark’s controlled conditions.

There are important caveats. Each task is well-defined and one‑shot: the model gets a single attempt, with a clear prompt. Real jobs involve messy context, back‑and‑forth, politics, and long‑term relationships. Still, the message is clear: we’re no longer talking about just passing high‑school exams. These systems are now competitive with professionals on a wide range of everyday knowledge work.

Finance, Office apps, and creative work

OpenAI is explicitly positioning GPT‑5.4 Thinking as an AI optimized for finance workflows. It’s marketed as ideal for financial reasoning and Excel‑based modeling, and OpenAI is working with industry practitioners to tune it for real finance tasks that normally take analysts hours or days.

Side‑by‑side comparisons with Claude Opus 4.6 show GPT‑5.4 often producing more detailed, structured breakdowns of complex financial questions. And thanks to integrations with tools like Word and PowerPoint via extensions, it can now live directly inside the apps many professionals already use.

Creative writing is finally back

Earlier GPT‑5.x models were heavily optimized for math and coding, and many users complained that they felt like “giant calculators” rather than conversational assistants. Even OpenAI leadership has admitted they over‑optimized for technical performance and hurt the model’s creative and conversational abilities.

GPT‑5.4 seems to fix that. On human‑rated creative writing benchmarks, GPT‑5.4 high‑reasoning variants are now near the top, currently ranked second with early voting data—and could climb higher as more evaluations come in. In practice, users are reporting that it’s much better at storytelling, tone, and engaging, human‑like conversation than GPT‑5.2 or early GPT‑5.3.

Coding, games, and fully looped agents

On the coding side, GPT‑5.4 is strong enough to build surprisingly complex projects from lightly specified prompts. Examples include:

• A theme park simulation game created from a single, loosely defined prompt, using an interactive browser for automated playtesting and image generation.
• A tactical RPG built over multiple turns, again using the browser for testing and AI image tools for the game’s visual style.

What’s new here isn’t just that the model can write code. It can now participate in a full loop:

1. Take in images or screenshots
2. Generate or modify code
3. Run and test the code via tools like Playwright
4. Observe the results and iterate

This kind of closed feedback loop is exactly what you want for powerful AI agents that can build, test, and refine software with minimal human intervention. For a more detailed look at GPT‑5.4’s agent and coding capabilities, see this deep dive into GPT‑5.4 Thinking and how it performs in practice.

Native computer use and OSWorld: AI that actually uses a PC

GPT‑5.4 is also the first general‑purpose OpenAI model with strong, native computer‑use capabilities. It can navigate software interfaces, read on‑screen content, and perform keyboard and mouse actions in real time.

On the OSWorld benchmark—which measures how well a model can operate a desktop environment via screenshots and actions—GPT‑5.4 hits about 75%, beating earlier GPT‑5.2 variants by a wide margin.

In demos, GPT‑5.4 can, for example, read data from an invoice and enter it into an internal system in real time. Unlike earlier agent demos that felt slow and brittle, this looks much closer to how a human would use a computer: fast, flexible, and able to adapt to small layout changes.

For developers building agents that interact with websites, SaaS tools, and enterprise software, GPT‑5.4 is currently one of the strongest options available.

Where GPT‑5.4 still struggles: novel engineering problems

It’s not all smooth progress. One of the more surprising findings in OpenAI’s own technical report is that GPT‑5.4 Thinking actually regresses on a benchmark called OPQA.

OPQA is an internal OpenAI benchmark built from real engineering problems the company has faced: unexpected performance regressions, training bugs, anomalous metrics, and other genuinely hard, novel debugging and systems questions. These are not textbook problems; they’re messy, real‑world engineering puzzles.

On OPQA, GPT‑5.2 Codex scored around 8%. GPT‑5.4 Thinking scores just 4%, dead last among the models tested and barely above zero. OpenAI didn’t hide this; they published it in their own report.

It’s not yet clear whether this is noise, an artifact of how the benchmark is constructed, or a sign that certain kinds of deep, systems‑level reasoning are still very hard for current architectures. But it’s an important reminder: even as models blow past humans on some tasks, they can still be surprisingly weak on others.

Cybersecurity: GPT‑5.4 crosses a new risk threshold

Perhaps the most serious part of GPT‑5.4’s launch has nothing to do with productivity and everything to do with security.

OpenAI now classifies GPT‑5.4 Thinking as “high” in its internal preparedness framework for cybersecurity. It’s the first general‑purpose model that required specific safety mitigations because of its offensive cyber capabilities.

On professional‑level capture‑the‑flag (CTF) challenges, GPT‑5.4 achieved an 88% success rate. In simulated network environments, it was able to:

• Exploit a vulnerable Azure web app to steal credentials
• Modify controls and move laterally through a network
• Execute complex, multi‑step attacks that resemble what skilled human attackers might do

Every model generation has pushed these capabilities higher. Roughly:

• GPT‑5.2: ~47% on certain cybersecurity benchmarks
• GPT‑5.3: ~80%
• GPT‑5.4: ~73–88%, depending on the task

The trend is clearly upward. OpenAI’s own framework says that a “high” rating means a model could potentially automate end‑to‑end cyberattacks against hardened targets. The next rung up—“critical”—would mean a model that could cause catastrophic, large‑scale damage autonomously, including attacks on critical infrastructure like power grids or water systems.

If GPT‑6 or GPT‑7 cross that “critical” threshold, OpenAI and regulators will face a very different release decision. It’s easy to imagine requirements like ID verification or stricter access controls for the most capable models, similar to how we already treat guns, driver’s licenses, or access to financial systems.

We’re not there yet—but GPT‑5.4 is the first mainstream model that makes that future feel close and concrete.

What GPT‑5.4 Pro means for the near future

GPT‑5.4 Pro shows that AI progress is no longer just about test scores on academic benchmarks. It’s about:

• Solving long‑standing research problems, like 20‑year‑old math questions
• Performing real professional work at near‑junior level, at a fraction of the time and cost
• Operating computers and software like a human assistant
• Crossing new thresholds in cybersecurity capability

At the same time, it’s expensive, uneven (brilliant at some tasks, weak on others like OPQA), and raises serious questions about safety, access, and the future of work.

Over the next few years, the key questions won’t just be “How smart is the next model?” but also:

• Who gets access to the most capable systems, and under what controls?
• Which roles and industries are most exposed to rapid automation?
• How do we balance the upside of breakthroughs in science, finance, and productivity with the downside risks in security and misuse?

GPT‑5.4 Pro doesn’t answer those questions, but it makes them impossible to ignore.