OpenAI GPT‑5.4: the new all‑in‑one model for coding, agents, and knowledge work

04 Jun 2026 05:07 131,939 views
GPT‑5.4 merges OpenAI’s best reasoning, coding, and agent capabilities into a single flagship model with a 1M token context window. Here’s how it compares to Claude Opus, what’s new in tools and planning, and why it’s built for serious knowledge work—despite the steep pricing.

OpenAI has launched GPT‑5.4, and it’s not just another incremental update. This release finally merges OpenAI’s strongest coding, reasoning, and agentic features into one flagship model that can realistically act as your main AI coworker.

If you’ve been bouncing between different models for code, writing, and automation, GPT‑5.4 is designed to replace that patchwork setup with a single, high‑end system—especially for serious knowledge work and AI agents.

From split models to one unified flagship

Until now, OpenAI’s lineup was split across use cases. GPT‑5.2 was the go‑to for personality, writing, and general reasoning. GPT‑5.3 Codex was the specialist for coding and developer workflows. You had to pick the right model for each job.

Meanwhile, Anthropic’s Claude Opus 4.6 showed what a unified model could look like: strong world knowledge, great reasoning, a pleasant personality, and top‑tier coding in a single system. That “all‑in‑one” approach is exactly what GPT‑5.4 is aiming to match—and surpass.

GPT‑5.4 essentially combines GPT‑5.2 and GPT‑5.3 Codex into one model that can:

• Write and debug complex code
• Handle long‑form reasoning and knowledge work
• Act as an AI agent across tools, browsers, and apps
• Do creative writing with a more natural personality

Two variants: GPT‑5.4 thinking vs GPT‑5.4 Pro

OpenAI released two versions:

GPT‑5.4 thinking – The main workhorse model. It’s cheaper than Pro and, interestingly, often scores better on OpenAI’s own real‑world benchmarks. For most people, this is the version to use.

GPT‑5.4 Pro – A more expensive, “smarter” variant aimed at the most demanding workloads. Despite being positioned as the premium option, it actually scores slightly lower than thinking on OpenAI’s GDP‑Val benchmark for knowledge work.

For many power users and developers, early testers say 5.4 thinking is already more than enough—even replacing their previous reliance on Pro‑tier models.

Performance: how GPT‑5.4 stacks up

OpenAI shared benchmark comparisons against its own previous models and, for once, also against Anthropic and Google.

Agentic computer use (OSWorld)

On OSWorld, a benchmark for operating a computer via tools and UI actions:

• GPT‑5.4 thinking: ~75% accuracy
• GPT‑5.3 Codex: ~74%
• Claude Opus 4.6: ~72.7%

More interesting than the raw score is efficiency. GPT‑5.4 reaches higher accuracy with far fewer tool calls than GPT‑5.2—about 75% accuracy with 15 tool calls versus under 50% with 42 tool calls for 5.2. Fewer tool calls means lower token usage, faster runs, and cheaper agents.

Real‑world knowledge work (GDP‑Val)

GDP‑Val is OpenAI’s own benchmark for “real” knowledge work—tasks that could actually move economic output, like complex analysis, document work, and professional workflows.

• GPT‑5.4 thinking scores 83%
• GPT‑5.3 Codex is 13 points lower
• Claude Opus 4.6 scores 78%

So on OpenAI’s home turf—structured, professional tasks—GPT‑5.4 thinking comes out ahead, even beating the Pro variant on this specific metric.

Built for knowledge workers and AI agents

GPT‑5.4 is clearly aimed at people who use AI as a serious productivity tool, not just for chat. Think analysts, operators, founders, and teams running complex automations or agent workflows.

It’s optimized for tasks like:

• Reading and summarizing long PDF documents
• Creating slide decks and structured reports
• Working with spreadsheets and data tables
• Searching the web and using a browser
• Operating a full computer environment via tools

If you’re already deep into AI agents, it’s designed to be a strong default “brain” behind systems like OpenClaw, custom tool‑calling stacks, or agent platforms. For a deeper dive into how 5.4 thinking performs in practice, you can also check out this breakdown of GPT‑5.4 thinking performance.

Massive 1M token context window

One of the biggest upgrades: GPT‑5.4 now supports a 1 million token context window, matching Claude’s headline feature.

In practical terms, that means you can:

• Load huge document sets (reports, contracts, books, codebases)
• Keep long, multi‑step workflows in a single conversation
• Run complex agents that need to “remember” a lot of state

It’s powerful—but not cheap. Using the full context window will be expensive, so you’ll want to be smart about chunking, retrieval, and caching if you’re building serious apps on top of it.

New planning and reasoning behavior

GPT‑5.4 introduces a more explicit planning step for complex tasks. Instead of immediately diving into execution, it can first outline an upfront plan of how it intends to solve the problem.

This is similar to the “plan first” features in tools like Cursor: you see the model’s strategy before it burns tokens writing code or generating long outputs. That gives you a chance to correct the direction early, saving time and cost.

This planning behavior is now built directly into ChatGPT for GPT‑5.4, and it’s especially useful for coding, multi‑step workflows, and agentic tasks.

Stronger vision and computer control

GPT‑5.4 also comes with upgraded vision and UI control capabilities. It can:

• Interpret screenshots
• Issue mouse and keyboard commands
• Use libraries like Playwright to control a browser
• Perform structured actions in apps like Gmail or calendar tools

In OpenAI’s demos, GPT‑5.4 was able to:

• Open Gmail, scan sent emails, star and label them
• Create calendar invites from email content
• Perform bulk data entry by extracting structured data (e.g., from JSON) into forms—at what appears to be real‑time speed

The main bottleneck now isn’t just the model—it’s that many websites still block automated or agentic access. As more publishers and platforms adapt, these capabilities will become more useful in real‑world workflows.

Impressive coding and game‑building demos

OpenAI showcased GPT‑5.4’s coding abilities with some eye‑catching demos, all reportedly built from single, lightly specified prompts.

Theme park simulation game
GPT‑5.4 generated a full theme park sim with:

• Adjustable speed controls
• Park design tools
• Logic for funds, guests, happiness, cleanliness, and ratings
• Simple but functional visual assets and moving “guests”

2D RPG battle game
It also built a retro‑style 2D RPG battle interface with:

• Polished pixel‑art style assets
• Multiple characters
• Turn‑based actions like “attack” and “end turn”

These aren’t production‑ready games, but they’re strong examples of how far single‑prompt prototyping has come—especially for developers using 5.4 inside coding environments or agent frameworks. If you’re interested in how developers are already shifting toward OpenAI’s coding stack, there’s a good companion read in this article on why power users are moving to OpenAI Codex.

Pricing: powerful but expensive

Here’s the painful part: GPT‑5.4 is not cheap.

For the base models (approximate pricing per 1M tokens):

• GPT‑5.2 input: $1.75 → GPT‑5.4 input: $2.50
• GPT‑5.2 Pro input: $21 → GPT‑5.4 Pro input: $30

Output tokens are also more expensive, especially for Pro:

• GPT‑5.4 output: ~$15 per 1M tokens (vs $14 for 5.2)
• GPT‑5.4 Pro output: ~$180 per 1M tokens (vs $168 for 5.2 Pro)

You can reduce costs by caching inputs, but output tokens will remain the main driver of your bill. For hobby projects, this will sting. For serious production workloads, you’ll want to design prompts and workflows with token efficiency in mind from day one.

Prompting GPT‑5.4 vs Claude and Opus

One important detail: GPT‑5.4 responds best to prompting styles that are different from what works for Claude Opus and other Anthropic models.

If you’re running multi‑model setups or tools like OpenClaw, it’s worth:

• Grabbing OpenAI’s official GPT‑5.4 prompting guide
• Letting your agent download and study that guide
• Maintaining separate prompt templates for GPT‑5.4 and for Opus/Sonnet

With models shipping at this pace—Opus 4.5 → 4.6, Sonnet 4.6, GPT‑5.3 Codex, now GPT‑5.4—keeping your prompts versioned and model‑specific is becoming a core part of building reliable AI systems.

Why models are shipping so fast now

Both Anthropic and OpenAI seem to have locked in stable, repeatable pre‑training cycles. In practice, that means:

• New model families are “baked” continuously in the background
• When performance crosses a certain threshold, they cut a snapshot and ship it
• We get frequent, meaningful upgrades instead of rare, giant jumps

Less than a year ago, OpenAI was struggling with this. GPT‑4.5 was powerful but slow, huge, and expensive to run, and it never really became a mainstream default. The 5.x family, by contrast, is fast, efficient, and consistently strong across use cases.

GPT‑5.4 is the clearest sign yet that OpenAI’s training pipeline is back on track—and that we should expect this rapid release cadence to continue.

Early reactions from power users

Early testers who had access to GPT‑5.4 for about a week have shared some strong opinions:

What they love

• Many call it the best overall model available right now
• 5.4 thinking is good enough to replace Pro models for most tasks
• Coding performance inside tools like Codex is described as “insanely reliable”
• It’s a better general‑purpose agent and writes clearer documentation

What still needs work

• Front‑end/UI design “taste” is still behind Opus 4.6 and Gemini 3.1 Pro
• It can miss obvious real‑world context (e.g., suggesting tourist spots packed with spring breakers for a “relaxing” trip)
• In some agent setups, it stops short of fully completing tasks

OpenAI leadership has publicly acknowledged these issues and says they’re working on quick fixes, especially around task completion and real‑world context awareness.

Who should actually use GPT‑5.4?

GPT‑5.4 is overkill for casual chatting, but it’s a strong fit if you:

• Run AI agents for real work (operations, research, outreach, data tasks)
• Build products on top of LLMs and need a single, high‑end default
• Do heavy coding, automation, or tool‑calling workflows
• Work with large document sets or long‑running projects

If cost is a concern, GPT‑5.4 thinking is the sweet spot: frontier‑level performance without Pro‑tier pricing. For most developers and power users, that’s the version to start with.

Overall, GPT‑5.4 feels like a turning point for OpenAI: one unified model that can realistically handle coding, knowledge work, and agents at the same time—finally catching up to, and in many areas surpassing, the unified approach pioneered by Claude Opus.

Share:

Comments

No comments yet. Be the first to share your thoughts!

More in ChatGPT