GPT‑5.4 thinking is here: what’s new and how it actually performs
OpenAI has rolled out a new wave of ChatGPT models: GPT‑5.4 thinking, GPT‑5.4 Pro, and just before that, GPT‑5.3 instant. Together, they’re meant to cover everything from fast everyday chats to deep research and heavy knowledge work.
This article walks through what’s new in GPT‑5.4 thinking, how it compares to earlier GPT‑5.x models and rivals like Gemini and Claude, and what it’s actually like to use for research, documents, spreadsheets, coding, and everyday writing.
GPT‑5.4, 5.4 Pro, and 5.3 instant: how the new models fit together
OpenAI’s current lineup can be a bit confusing, so it helps to separate the roles of each model:
GPT‑5.3 instant is the fast, lightweight model. It’s designed to respond almost instantly and is ideal for quick questions, brainstorming, and simple tasks where you care more about speed than deep reasoning.
GPT‑5.4 thinking is the new general‑purpose flagship. It takes a bit longer to respond because it “thinks” more, but in exchange you get better reasoning, stronger research, and more reliable structured outputs like spreadsheets, documents, and code.
GPT‑5.4 Pro sits above both as a research‑grade model. It’s available on higher‑tier plans, is more expensive to run, and is aimed at advanced analysis and high‑end research rather than everyday tasks.
One confusing detail: there was no GPT‑5.3 thinking model. That means version numbers for instant and thinking models won’t always line up—future instant models might be ahead of or behind the thinking models numerically.
Big upgrade for knowledge work: docs, slides, and spreadsheets
GPT‑5.2 already made a noticeable leap in “knowledge work” tasks like building spreadsheets, structured documents, and presentations. GPT‑5.4 thinking pushes that further.
Research to presentation in one flow
In testing, GPT‑5.4 thinking was asked to research whether consumer AI products have reduced hallucinations over time. The prompt specified a structured output with three sections: a plan, findings with citations, and a final checklist.
On standard thinking effort, it took under a minute to produce a detailed research plan and findings. The output followed the requested structure closely and included citations, without using the separate “deep research” tool.
From there, the same chat was used to generate a PowerPoint presentation based on the research. GPT‑5.4 thinking produced a 15‑slide deck (exactly as requested), including references to the original sources. The first design was functional but basic; a single follow‑up prompt asking for a more modern redesign produced a cleaner, more minimal look while preserving all content.
Complex Excel spreadsheets with formulas and summaries
GPT‑5.4 thinking also handled a more demanding spreadsheet task: generating a multi‑sheet Excel workbook with formulas, summary pages, and charts. It took around 10 minutes to produce a downloadable Excel file.
Inside Excel, the workbook included:
• A summary page with key metrics
• Multiple data sheets
• Formulas visible and working in the formula bar
• A data chart built from the generated data
As always, you still need to spot‑check numbers and formulas—LLMs can still hallucinate or miscalculate. But for scaffolding complex spreadsheets that you can then refine manually, GPT‑5.4 thinking is a major time saver.
If you’re interested in AI inside spreadsheets more broadly, there’s a deeper look at current tools in this guide to the best AI tools in Excel.
Native computer use: ChatGPT that can actually do things
One of the headline changes in GPT‑5.4 thinking is that computer use is now native to the model. Instead of relying on a separate “agent” model to control a browser or apps, GPT‑5.4 thinking can directly perform computer actions as part of a normal chat.
In practice, that means it can:
• Handle data entry in web apps
• Manage emails and calendars
• Perform web‑based workflows without you clicking anything
This turns ChatGPT into more of an actual assistant that can operate software on your behalf, not just generate text about what you should do. It’s still early days, and you’ll want to supervise anything that touches real data, but the direction is clear: general‑purpose models are becoming full computer agents.
Tool use and cost: more efficient than 5.2 despite higher pricing
GPT‑5.4 thinking is priced slightly higher per token than GPT‑5.2, but its tool use is more efficient, especially for tool calling and function calling scenarios.
Because it uses fewer tokens to decide which tools to call and how to call them, the total cost of a complex workflow can actually be lower on GPT‑5.4 than on GPT‑5.2, even though the base rate is higher. For developers and teams building multi‑tool agents or API workflows, this can translate into real savings.
Coding: 5.4 thinking catches up to dedicated code models
OpenAI recently introduced GPT‑5.3 Codex, a model specialized for coding. GPT‑5.4 thinking is now positioned as a general‑purpose model that matches the coding quality of GPT‑5.3 Codex, so you don’t have to switch models when moving between natural language and code.
Building a web app in one prompt
One test asked GPT‑5.4 thinking to build a website that compares top AI tools, with specific UI requirements: rounded cards, a dark/light mode toggle, filters, and more. The code was run directly in ChatGPT’s canvas mode.
The result:
• Light mode looked clean and well‑designed
• The layout followed the prompt’s structure and styling requests
• Filtering worked, and tool information was mostly accurate
However, there were still some issues:
• External links to tool websites didn’t work
• A comparison feature (e.g., Claude vs ChatGPT) couldn’t be selected
So while the app was close to usable from a single prompt, it wasn’t perfect. Minor bugs would still need manual fixing. Compared to GPT‑5.2, though, the first‑shot result was noticeably more reliable.
Simulation test: 5.4 vs 5.2
Another coding test involved building a day‑to‑night simulation. GPT‑5.4 thinking produced a working version on the first try, with a clear visual transition from dawn to night. GPT‑5.2, by contrast, produced something more “cinematic” but less aligned with the intended behavior.
Interestingly, GPT‑5.1 had previously produced something closer to GPT‑5.4’s result, suggesting that 5.4 is consolidating the strengths of earlier versions while improving reliability.
Despite benchmark wins, the tester still felt that Anthropic’s Opus 4.6 is currently the strongest coding model in real‑world use. A broader head‑to‑head comparison across GPT, Claude, Gemini, and Grok is still needed to confirm how GPT‑5.4 stacks up in everyday coding workflows. For a more general comparison of newer GPT models against Claude, see this real‑world GPT vs Claude test.
Benchmarks vs Gemini and Claude: slightly ahead, but close
OpenAI’s official blog focuses on comparisons between its own models, but one chart shared online compared GPT‑5.4 thinking to:
• Anthropic Claude Opus 4.6
• Google Gemini 3.1 Pro
Across a range of benchmarks, GPT‑5.4 thinking is generally state‑of‑the‑art, with a slight edge in some tests and near‑parity in others. On paper, that makes it the best overall model available right now. In practice, the gap is small, and user preference will still depend heavily on specific tasks—coding, writing style, reasoning, or tools integration.
Hallucinations: another 33% reduction
GPT‑5.4 thinking claims a 33% reduction in hallucinations compared to GPT‑5.2. That doesn’t mean it never makes things up, but it continues the slow march toward more reliable outputs.
In the research example on AI hallucination trends, GPT‑5.4 produced a solid plan, citations, and a structured checklist in under a minute, without needing the slower deep research mode. For high‑stakes work, you should still verify claims and sources, but each new generation is clearly getting closer to that “1% hallucination” goal.
Thinking effort: control how hard it thinks
GPT‑5.4 thinking introduces a visible “thinking effort” control in the ChatGPT interface. You can choose between levels like Standard and Heavy depending on how complex your task is.
• Standard is the default and is usually enough for everyday tasks and moderate research.
• Heavy can produce deeper analysis and more careful reasoning, but it can take significantly longer to respond.
A useful detail: you can change your instructions mid‑run without breaking the process. For example, if you initially ask for 10 sources and then decide you want 15, you can just say so while it’s still working. GPT‑5.4 will incorporate the new requirement and continue, instead of forcing you to restart the chat.
Writing and tone: still not the best daily writer
Most people use ChatGPT for everyday writing—emails, blog posts, scripts, and headlines—so writing quality and tone matter a lot.
In testing, GPT‑5.4 thinking was asked to write hooks and intros for a YouTube video about the new model, using account‑level system instructions that specify a particular tone and explicitly ban M‑dashes.
The results were mixed:
• Some runs ignored the no‑M‑dash rule entirely
• The tone felt more promotional and “AI‑ish” than desired
• Even with prior tuning, the outputs didn’t match the requested personal style
With one or two follow‑up prompts, it’s possible to steer GPT‑5.4 closer to the right voice. But out of the box, the tester still preferred Gemini and Claude for straightforward, conversational, non‑hypey writing that doesn’t require extra setup or custom projects.
ChatGPT for Excel and other ecosystem updates
Alongside GPT‑5.4, OpenAI has also released a ChatGPT for Excel add‑in. If you have a paid ChatGPT plan, you can install it directly into Excel to bring GPT‑style assistance into your spreadsheets.
Combined with GPT‑5.4’s stronger spreadsheet generation and formula handling, this makes Excel a much more powerful environment for analysis, reporting, and automation—especially if you’re comfortable spot‑checking and refining what the model produces.
Who should use GPT‑5.4 thinking right now?
Based on early testing, GPT‑5.4 thinking is most compelling if you:
• Do a lot of research and knowledge work (reports, briefs, structured analysis)
• Need to turn research into presentations and spreadsheets quickly
• Build AI‑powered tools or agents that rely on tool calling and efficient token use
• Want a single model that can handle both natural language and serious coding
GPT‑5.4 Pro is overkill for most everyday tasks, but valuable if you’re doing deep research or high‑stakes analysis. GPT‑5.3 instant remains a great choice when you just want fast answers and don’t need heavy reasoning.
The bottom line: GPT‑5.4 thinking is a meaningful upgrade over GPT‑5.2, especially for structured work and coding, and it’s now a serious contender—or leader—among state‑of‑the‑art models. But for pure writing style and tone, alternatives like Claude and Gemini may still feel more natural without extra tuning.
Comments
No comments yet. Be the first to share your thoughts!