DeepSeek V4 is here: open-source powerhouse with 1M token context

24 May 2026 10:37 23,562 views

DeepSeek V4 has arrived with two massive open-source models, a 1M token context window, and performance that rivals top closed models. Here’s what’s new, how it performs, and why its agentic capabilities and efficiency really stand out.

DeepSeek has launched V4, and it’s another big moment for open-source AI. The new release brings two huge models, a 1 million token context window, aggressive pricing, and performance that in many cases comes close to — or even rivals — top closed models.

Two Massive Models, Fully Open-Source

DeepSeek is sticking to its open-source roots. With V4, they’re not just releasing the fine-tuned models, but also the base model weights, making it much easier for developers and companies to fine-tune their own variants.

The lineup includes two models:

DeepSeek V4 Pro – ~1.6 trillion parameters, the flagship model meant to compete with top closed-source systems.
DeepSeek V4 Flash – ~284 billion parameters, a more manageable but still very capable model.

Both models support a 1 million token context window, which is particularly useful for long documents, large codebases, and complex multi-step tasks.

Performance, Pricing, and Hardware

DeepSeek positions V4 as being only about 3–6 months behind leading closed models in capability, which is impressive for an open-weight release. In many benchmarks, the Pro model is competitive with top-tier systems, especially in agentic and tool-using scenarios.

Pricing and Context Costs

Both Pro and Flash offer the full 1M context, and the pricing is notably lower than most Western providers at similar performance levels. The rough pricing mentioned:

Input tokens (V4): around $0.15 per million tokens.
Output tokens: around $3.50–$4.00 per million tokens.
Cache hits vs. misses: cache hits are much cheaper than cache misses, with misses around $1.75 per million in some cases.

The catch: V4 tends to generate very long chains of thought and detailed reasoning, which means it can be token-hungry. Even with good per-token pricing, heavy usage can add up if you let it think out loud too much.

Hardware and the Chinese NPU Angle

DeepSeek confirms that V4 has been validated on both NVIDIA GPUs and Havi Ascend NPUs, a notable milestone for the Chinese AI hardware ecosystem. They don’t disclose the exact training hardware this time, but they do say that:

Havi Ascend NPUs are already good enough to handle inference loads.
They’re currently compute-constrained, and service capacity for Pro is limited.
After 950 "super nodes" are launched in the second half of the year, they expect Pro pricing to drop significantly.

Big Efficiency Gains: FLOPs and KV Cache

One of the most impressive aspects of DeepSeek V4 is how efficiently it handles long context and KV caching.

Both Pro (1.6T) and Flash (284B) were trained on roughly 32–33 trillion tokens. Despite the huge size, DeepSeek has managed to dramatically cut compute and memory requirements compared to its previous generation (V3.2):

V4 Pro uses about 27% of the FLOPs of DeepSeek V3.2 for a 1M token context.
For KV cache memory, V4 Pro uses about 10% of V3.2’s requirements at 1M tokens.
V4 Flash is roughly one-third the size of V3.2 but uses only about 10% of the FLOPs and around 7% of the KV cache at 1M tokens.

These gains are partly driven by architectural changes such as compressed sparse attention, which significantly reduces memory usage for the KV cache while still handling very long sequences.

Benchmarks: Knowledge vs. Agentic Capabilities

DeepSeek’s own benchmarks split performance into two broad areas: knowledge & reasoning and agentic capabilities (planning, tool use, multi-step workflows).

Knowledge and Reasoning

On classic knowledge and reasoning benchmarks, V4 Pro is strong but not always the top performer:

It tends to land around the level of models like Gemini 1.5 Pro / 3.1 or Claude Opus 3/4x, depending on the specific benchmark.
On some simple QA and verified question-answering tasks, Gemini 3.1 Pro still comes out ahead.
For Chinese-language tasks, V4 performs extremely well, often beating other labs except for Gemini 3.1 in certain cases.

As always, the real test is how it performs on your own workloads. If you’re interested in how other frontier models behave in practice, it’s worth comparing this with hands-on reviews like this deep dive into GPT Image 2.

Agentic and Tool-Using Strengths

Where DeepSeek V4 really shines is in agentic use cases — tasks that require planning, multi-step reasoning, and tool usage:

On agentic benchmarks, V4 Pro stands out as one of the strongest models.
Interestingly, V4 Flash also performs very well here, coming surprisingly close to Pro on many agentic tasks.

This opens up some powerful hybrid setups. For example, you could:

Use a model like Claude Opus or another top planner for high-level planning.
Then hand off implementation and coding to DeepSeek V4 Pro or Flash, which are faster and cheaper while still very capable.

If you’re already experimenting with AI-assisted coding and team workflows, you might find parallels with how teams adapt to new tools in pieces like this six-month AI coding case study.

Real-World Tests: Coding, Web Apps, and 3D Visualizations

Early hands-on tests with DeepSeek V4 show a model that’s fast at inference but deliberate when it “thinks.” It often spends a long time in chain-of-thought mode, generating detailed reasoning before outputting final code or text.

Complex Website Generation

When asked to build a detailed website with very specific instructions, V4:

Took around two minutes of "thinking" before producing the final code.
Produced a functional site with interactive elements like toggle buttons that actually worked — something many open-weight models often fail at.
Followed detailed design instructions well, generating a polished layout when guided precisely.

However, when prompts were left vague, the model tended to produce more generic, low-effort "AI slop" designs. The takeaway: the more specific your instructions, the better V4 performs.

3D Scene with Progressive Creation

In another test, V4 was asked to build a progressively created "vowel pagoda garden" using Three.js:

It correctly chose Three.js as the right tool for the job.
The model took about four minutes to think and generate the code.
The final result implemented the requested functionality, though the visual design wasn’t particularly strong.

Again, functionality and instruction-following were solid, while aesthetics were more average.

Real-Time ISS Tracker

Another test asked V4 to build an app that tracks the real-time position of the International Space Station using an external API and updates every five seconds:

The app successfully called the API and displayed the ISS’s latitude and longitude.
It rendered the Earth with reasonably accurate continents, which many models struggle with.
It added a countdown-style tracker for the next update and allowed zooming in and out.

There were still some quirks — such as duplicated location components and slightly off positioning — but for a first attempt without any custom agent harness, the result was impressive.

Agent Harnesses and Future Potential

Right now, DeepSeek hasn’t released its own dedicated "agent harness" for V4. However, the model’s behavior suggests it could be very strong when paired with an external orchestration layer.

You should be able to plug V4 into existing frameworks and harnesses such as:

Cloud-based code agents that manage multi-step coding tasks.
Open-source orchestration frameworks like OpenAI-compatible tool routers or custom agent stacks.

Given its long context, efficient KV cache, and strong tool-using behavior, V4 looks particularly promising for:

Large codebase refactors and analysis.
Complex web apps that require multiple tools and APIs.
Research workflows that involve long documents and iterative reasoning.

Why DeepSeek V4 Matters

DeepSeek V4 lands on the same day as other major closed releases like GPT 5.5, but for the open-source ecosystem, V4 may be the more important event:

It delivers near-frontier performance with fully open weights, including base models.
It pushes efficiency in FLOPs and KV cache for 1M context to new levels.
It shows strong agentic capabilities, especially in coding and tool use.
It signals growing maturity in the Chinese AI hardware stack with validation on Havi Ascend NPUs.

If you care about building your own stacks, fine-tuning, or running powerful models on your own infrastructure, DeepSeek V4 is a release worth paying close attention to. The next wave of experiments will likely explore how far its agentic abilities can go once plugged into robust tool and workflow systems.