Why everyone is arguing about whether AI progress has slowed down
AI has moved so fast over the last few years that even experts can’t agree on what’s actually happening anymore. Is progress slowing down after GPT‑4, or are we in the middle of the steepest part of the curve? Are coding agents just a narrow niche, or the key to everything else?
This article unpacks that debate in plain language: how AI progress has really changed since GPT‑4, what “thinking time” and coding agents are doing, and whether AI is already helping build the next generation of AI systems.
Did AI progress really slow down after GPT‑4?
One side of the debate claims that the big leaps were from GPT‑2 → GPT‑3 → GPT‑4, powered mainly by “pre‑training scaling” – throwing more data and compute at larger models. According to this view, by around 2025 that approach hit diminishing returns. Capability gains from simply scaling pre‑training started to flatten, and labs had to pivot to more targeted tricks: post‑training, fine‑tuning, and optimizing for specific benchmarks.
The other side argues that this is backwards. From a practical, user-facing perspective, GPT‑2 and GPT‑3 were mostly curiosities. The real shock came later: GPT‑3.5 Turbo, GPT‑4, and especially the new wave of “reasoning” models like OpenAI’s o1/o3 and Anthropic’s Opus 4.x. These models didn’t just feel like bigger autocomplete engines – they started solving complex coding tasks, mathematical problems, and multi-step reasoning in ways that felt qualitatively different.
So how can both views exist? It comes down to what you measure. If you only look at the underlying pre‑training scaling curves, you might see a slowdown. If you look at what users can now actually do – especially in coding, math, and automation – the curve looks like it just went vertical.
From scaling to “thinking”: why inference-time compute matters
Early large language models mainly improved by being trained bigger and longer. That’s pre‑training. But the recent breakthroughs have leaned heavily on what happens after that: post‑training and inference-time compute.
Post‑training includes techniques like fine‑tuning on specific tasks, reinforcement learning from human feedback, and optimizing for particular benchmarks. This is where models are taught how to behave, not just what to know.
Inference-time compute is newer in the mainstream conversation. Instead of forcing a model to answer in one short pass, you let it “think” longer: generate intermediate reasoning steps, explore multiple solution paths, and use more tokens before committing to an answer. This is sometimes exposed as “heavy thinking” modes or “slow but smart” settings.
Critics frame this as a kind of hack: since pure scaling hit limits, labs bolted on more compute at inference to squeeze out extra performance. But from another angle, it’s more like moving from propeller planes to jet engines. You’re not just making the same thing bigger; you’re changing how the system reasons.
On several independent benchmarks, that shift coincides with dramatic jumps. For example, reasoning-focused models rapidly went from barely registering on some tasks to saturating them near 100% within months. For people actually using these models day to day, that feels less like a slowdown and more like a phase change.
Are coding agents just a “narrow” use case?
A big flashpoint in the debate is coding. One narrative says: yes, AI coding tools have improved, but they’re a narrow, niche application. Useful, but not world-changing.
The counterargument is that calling coding “narrow” misses the point. Software directly contributes over a trillion dollars to US GDP, and more than half of the global economy runs on software in some form. If you accelerate coding, you accelerate everything that depends on it: finance, logistics, healthcare, media, manufacturing, and more.
Recent coding agents don’t just autocomplete functions. Given a clear spec and the right permissions, they can:
• Plan a multi-step software project
• Create repositories, write code, and wire up APIs
• Run tests, fix failing cases, and iterate
• Deploy to hosting platforms like Vercel or similar
Real users report building full websites and internal tools this way: describing the outcome in natural language, granting the agent access to GitHub and hosting, going to sleep, and waking up to a working app. It’s not magic – there’s still scoping, back-and-forth, and verification – but the human is doing zero low-level coding.
If you’re curious how AI is already reshaping day-to-day work, this ties closely to how tools are taking control of the workday rather than just taking jobs.
How professionals actually use AI coding tools
One criticism is that the “AI builds my app while I sleep” story doesn’t match how most professional engineers use these tools. Surveys of hundreds of working programmers show a more conservative pattern:
• They write very clear specs for small, bounded tasks.
• The model generates code for that piece.
• They run unit tests and integration tests.
• Around 20% of the time, the model’s output is wrong enough that they abandon it and code the feature themselves.
That’s accurate for many enterprise workflows today. But it doesn’t invalidate the more aggressive use cases – it just shows a spectrum. Enterprises are risk‑sensitive and optimize for reliability; solo builders and startups often optimize for speed and experimentation.
This is where disruption theory matters. Historically, new technologies rarely start by replacing expert workflows. Digital cameras were mocked by professional photographers; Wikipedia was dismissed by encyclopedia editors; early MP3s offended audiophiles. The people with the highest standards are the last to switch. Meanwhile, casual and mid‑tier users adopt the new thing because it’s better than nothing, not better than the best.
Applied to AI coding, that means:
• Top-tier engineers compare AI to their own best work and see limitations.
• Non-experts compare AI to “I can’t build this at all” and see a revolution.
• Over time, the tools improve until even experts quietly rely on them.
Is AI actually helping build better AI?
Another controversial claim is that labs deliberately pushed coding capabilities first so AI could help build the next generation of AI – a kind of soft recursive self-improvement loop.
Skeptics argue this is “vibey nonsense”: current models can automate tedious coding, but they’re not inventing new fundamental architectures or rewriting the math of machine learning. Conceptual breakthroughs still come from human researchers.
But there’s growing evidence that AI is already a meaningful part of the AI development stack:
• AlphaEvolve (Google DeepMind) – An evolutionary coding agent powered by large language models. According to DeepMind, AlphaEvolve has improved data center efficiency, chip design, and AI training pipelines – including the training of the very language models that power AlphaEvolve itself.
• AutoML and architecture search – As far back as 2017, Google’s AutoML systems were automatically designing neural network architectures that matched or beat human-designed ones for tasks like image recognition.
• Anthropic’s internal tooling – Leadership at Anthropic has stated that AI now writes the majority of the code for Anthropic’s own products, including the systems that support new model development.
• Sakana AI’s research – Before being acquired by Google DeepMind, Sakana AI demonstrated systems that could generate research ideas, code up experiments, run them, and write papers end-to-end. Their Darwin Göd el-style systems explicitly focused on agents that rewrite their own scaffolding code to improve their ability to code further.
None of this means AI is autonomously inventing entirely new paradigms of intelligence. But it does mean that:
• AI is optimizing the software and infrastructure used to train AI.
• AI is accelerating the coding work around new models and experiments.
• The human research loop is being tightened by AI tools at multiple layers.
That’s a softer, more incremental form of recursive improvement – but it’s real, and it compounds.
Beyond coding: math, science, and real-world impact
Coding isn’t the only domain where recent models have surprised people.
Mathematics. Large language models have started solving problems that were previously open in collections like Terence Tao’s curated Erdős problem lists. Models have also achieved gold-medal performance on the International Mathematical Olympiad (IMO) by reading problems in natural language, something many experts thought was decades away.
Scientific discovery. DeepMind’s work on tools like AlphaFold and GNoME shows AI discovering protein structures and millions of new crystalline materials – the equivalent of centuries of human experimentation. These systems don’t just regurgitate known facts; they propose new candidates, which are then tested by robotic labs.
Enterprise and geopolitics. On the business side, AI is already reshaping customer support, marketing, content creation, analytics, and internal tooling. Some sectors of SaaS have seen major valuation hits as investors price in AI-driven disruption. Governments are fighting over access to top models and GPUs; there are reports of models being used in military planning and operations. And as of early 2026, over a billion people are estimated to use AI chatbots monthly.
In other words, even if you believe pure model scaling has slowed, the system-level impact of AI – across software, science, and geopolitics – is still accelerating.
What investors are really nervous about
Another point of confusion is the state of the AI industry itself. Some argue that post‑GPT‑4, AI labs are struggling: big, expensive training runs that don’t deliver proportional gains; a shift to incremental post‑training tweaks; and investors getting nervous about where sustainable revenue will come from.
There is some truth here. Several high-profile mega-model projects reportedly under-delivered relative to their cost. Benchmarks can be gamed. And public markets do worry about whether AI spending will translate into durable profits.
But at the same time:
• Tech giants like Microsoft, Google, Amazon, and Meta are collectively planning well over $300 billion in AI infrastructure build-out.
• OpenAI has closed some of the largest private funding rounds in history, with valuations and capital raises that dwarf typical tech startups.
• Anthropic’s revenue has reportedly grown at roughly 10x per year for multiple years, reaching into the tens of billions with hundreds of enterprise customers spending over $1M annually.
So yes, investors are nervous – but they’re also writing some of the biggest checks in tech history. That combination usually signals not a dying hype cycle, but a high-stakes transition from speculative promise to entrenched infrastructure.
Why smart people disagree so strongly about AI right now
Underneath all of this is a deeper issue: we’re in a phase where reality is changing faster than our mental models. Reasonable, smart observers can look at the same landscape and come away with opposite conclusions.
Some focus on:
• The flattening of traditional scaling curves
• The messiness of real-world deployments
• The gap between hypey narratives and cautious enterprise adoption
Others focus on:
• The lived experience of building real products with agents
• Independent benchmarks that suddenly go from near-zero to saturation
• Concrete examples of AI accelerating AI development, science, and software
Both perspectives capture part of the truth. Pre‑training scaling alone is no longer the whole story. But neither is “it’s all just vibes and benchmarks.” The action has moved into how models are used, how they’re scaffolded, and how they’re integrated into workflows – especially in coding and research.
If you’re trying to navigate this moment as a developer, founder, or knowledge worker, the most practical stance is:
• Don’t assume the hype is fully accurate.
• Don’t assume the skeptics are fully right either.
• Get hands-on with the latest tools and agents in your own domain.
• Compare them not to perfection, but to what you could do without them.
That’s where the real signal is – not in abstract arguments, but in what you can actually build in a weekend now that would have taken a team months just a couple of years ago.
And if you’re interested in how different frontier models stack up for creative and technical tasks, you might like our head-to-head experiments such as ChatGPT vs Grok on recreating complex games in 90 minutes.
Comments
No comments yet. Be the first to share your thoughts!