How DeepSeek cut prices with prompt caching (and how you can too)

10 Jun 2026 12:37 5,038 views
DeepSeek’s massive price cut isn’t magic—it’s architecture plus smart prompt design. This guide explains how prompt caching works, why DeepSeek can offer such low cached prices, and the practical habits you need to actually see those savings in your own agents and apps.

DeepSeek just slashed prices on its V4 Pro model by around 75% at the exact moment other major labs are raising theirs. Google’s Gemini 3.5 Flash is pricier than its preview, and OpenAI’s GPT‑5.5 costs roughly double GPT‑5.4 per token. So how is one provider cutting prices while everyone else is going up—and how can you, as a builder, claw back some of those savings even if your provider is getting more expensive?

The answer is prompt caching. On DeepSeek, it’s baked into the architecture. On other providers, it’s all about how you structure your prompts and agents. This article breaks down how caching works, why DeepSeek can charge so little for cached tokens, and the practical rules you should follow to avoid accidentally paying full price on every turn.

How an LLM request actually runs: prefill vs decode

Every large language model request has two very different phases under the hood: prefill and decode. Understanding the difference is key to understanding caching.

Phase 1: Prefill (compute-bound)

In the prefill phase, the model takes your entire prompt—system message, tools, context, conversation history—and processes all of it in parallel. Every token can attend to every earlier token, and this attention operation grows roughly with the square of the number of tokens.

Because of that, prefill is mostly limited by raw compute. GPUs are crunching through a big batch of work before you see a single output token. On long prompts (say 10,000 tokens), most of your time-to-first-token is spent in this prefill step.

Phase 2: Decode (memory-bandwidth-bound)

Once prefill is done, the model enters decode. Now it generates tokens one at a time. Each step doesn’t require much compute, but every step has to reread the full model weights from high-bandwidth memory (HBM). That makes decode limited by memory bandwidth, not compute.

The key takeaway: if you can skip prefill for a request—by reusing work from an earlier, identical prefix—you cut both latency and cost at the same time. Prompt caching is exactly about skipping that expensive prefill.

What actually gets cached: KV cache explained

To see what’s being reused, we need to look at how transformers process tokens. For every token, the model produces three vectors: a query (Q), a key (K), and a value (V).

The query is used immediately and then discarded. The key and value vectors, however, describe how that token will influence attention for all future tokens. They don’t depend on future tokens—only on the token itself, its position, and the model’s weights. That means once they’re computed, they stay valid for any continuation where that token stays in the same position.

Modern inference engines store these key and value vectors in what’s called the KV cache. If a new request shares its first N tokens with a previous request, the K and V vectors for those tokens are bit-identical. Recomputing them would be pure waste.

This isn’t a tiny micro-optimization. A recent paper, “Don’t Break the Cache,” measured 41–80% cost savings across 500 agent sessions on three different providers just by preserving cache hits. For long-running agents, caching is the difference between affordable and painful.

Why DeepSeek can offer ultra-cheap cached tokens

DeepSeek’s big price cut on V4 Pro isn’t a marketing stunt or a subsidy; it’s an architectural choice. The core idea: they made the KV cache so small and cheap to store that it’s faster and cheaper to load from disk than to recompute prefill on a GPU.

Multi-Head Latent Attention (MLA)

Starting with V2, DeepSeek introduced an attention variant called Multi-Head Latent Attention (MLA). Compared to standard multi-head attention, MLA shrinks the KV cache by around 93%. That’s a huge reduction in how much data needs to be stored per token.

Because the cache is so compact, it no longer needs to live in expensive HBM on the GPU. Instead, DeepSeek stores it on a distributed disk array—basically a pool of relatively cheap disks across nodes.

With MLA, streaming the KV cache from disk is actually faster than recomputing prefill from scratch on the GPU. That’s why DeepSeek can price cached tokens at a fraction of uncached tokens: they’re literally renting disks instead of burning GPU compute and HBM bandwidth.

If you want a broader look at how this fits into DeepSeek’s overall strategy and pricing, you may also find this breakdown of DeepSeek’s $10 billion valuation and rational pricing helpful.

Architecture is only half the story: you have to keep the cache alive

Even with a great caching system on the provider side, you only save money if your prompts and agents are designed to hit the cache. That’s where most people lose out: small, seemingly harmless changes can invalidate the cache and force a full, expensive prefill.

A useful mental model is to think of your prompt as layered from most stable to most volatile. This is how Anthropic structures Claude Code, and it’s a good general pattern for any agent or coding assistant.

The four-layer prompt structure for cache-friendly agents

Claude Code organizes each request into four layers, ordered from most stable (rarely changes) to most volatile (changes every turn):

Layer 1: Static system prompt and tool definitions

This includes the core system instructions and the definitions of all tools the agent can call. It’s the most stable layer and can be cached across all sessions. Changing anything here forces a full cache rebuild.

Layer 2: Project context (e.g., cloud.md)

This is project-level context: files like cloud.md or other configuration that describes the repository, coding style, or rules. It’s cached within a project, so as long as it stays the same, many sessions can reuse it.

Layer 3: Session context

This layer holds session-specific state: what this particular agent run is working on, current goals, or task-level notes. It’s cached within a session.

Layer 4: Conversation history

This is the actual back-and-forth: user messages, tool calls, and assistant replies. It’s the most volatile layer and grows with each turn.

The rule of thumb: changes at a lower layer (like conversation messages) should not force changes at higher layers (like the system prompt). If you keep the top layers stable, you preserve more of the cache.

Five common ways you accidentally bust the cache

Once you understand the layers, you can see how easy it is to break caching mid-session. Here are five common pitfalls that apply to most providers, not just Anthropic.

1. Switching models mid-conversation

Each model has its own cache. If you’re 100,000 tokens into a conversation with a large model (say Claude Opus) and then switch to a smaller one (like Haiku) for a simple question, the smaller model has to rebuild the cache from zero.

That means the “cheaper” model can actually be more expensive for that turn than just letting the original model answer, because you lose all your cached prefill work.

Instead of switching models inside a session, pick your model upfront. If you truly need a different model for a subtask, spin up a separate sub-agent that starts its own conversation with its own cache.

2. Adding or removing tools

Tool definitions live in the system prompt layer. Any change to the tool list—adding a new tool, removing one, or even tweaking parameters—changes that layer and invalidates the entire cache.

Even something like an MCP (Model Context Protocol) server crashing and reconnecting can look like a new tool definition to the model, which again busts the cache.

Best practice: connect all your tools and MCP servers at the start of the session and don’t touch the tool list mid-run.

3. Putting timestamps or dynamic data in the system prompt

It’s tempting to embed things like “Current time: 2026-06-10 14:32” directly into the system prompt. But if that value updates every minute, you’ve just turned your most stable layer into your most volatile one.

Every time the timestamp changes, the system prompt changes, and every downstream cache hit is invalidated. You end up paying full price repeatedly for what should be shared work.

Instead, keep the system prompt static and pass dynamic information like time as regular messages (more on that below).

4. Naive conversation compaction

Compaction is when you summarize a long conversation so the prompt doesn’t grow forever. The naive way is to call a separate summarization endpoint with a different system prompt and no tools, then replace the history with that summary.

The problem: that separate call has zero cache overlap with the main conversation. You pay full price to reprocess the entire history just to summarize it.

Also, if you compact mid-task, you may lose useful context or break the flow for the model. Compaction should happen at natural breaks between tasks, not in the middle of one.

5. Upgrading your agent’s system prompt or toolset

If you ship a new version of your agent that changes the system prompt or tool definitions, the first request after the upgrade has to rebuild the cache from scratch. That’s unavoidable—but it’s worth being aware of.

Plan for a cache rebuild after upgrades, and avoid frequent, small tweaks to the system prompt in production if you care about cost.

The golden rule: use messages, not system prompt edits

All of these pitfalls point to one unifying principle: when you want to update the model about what’s happening, prefer adding messages over editing the system prompt.

Editing the system prompt or tools touches the most expensive, most shared layer of the cache. Adding a message only affects the conversation layer, which is expected to change every turn.

Examples of cache-friendly updates

Here are some concrete patterns that Claude Code uses and that you can copy:

  • Time of day: Instead of updating the system prompt with the current time, insert a short system-style note as the next message, e.g., “System: The current local time is 3:15 PM.”

  • File edits: When a user edits a file, don’t rewrite the prompt. Append a system reminder like “System: The file main.py was updated; please use the latest version.”

  • State changes: Any time the world changes (flags, modes, user preferences), record it as a message instead of mutating the prefix.

This pattern lets the model stay fully informed while preserving the expensive cached prefix.

Two Claude Code features designed around caching

Anthropic’s Claude Code is a good case study in what cache-aware design looks like in practice. Two of its features—plan mode and cache-safe compaction—are built specifically to avoid breaking the cache.

Plan mode without changing tools

Plan mode is a state where the agent plans but doesn’t execute destructive actions. The obvious implementation would be to swap the toolset to a read-only subset when plan mode is on. But that would change the system prompt and bust the cache every time the user toggles it.

Instead, Claude Code implements plan mode as two tools: enter plan mode and exit plan mode. The tool definitions themselves never change. When the user toggles plan mode, the agent receives a system-style message in the conversation explaining that it is now in or out of plan mode.

Result: the cache stays intact, even as behavior changes.

Cache-safe compaction

The naive compaction approach, as mentioned earlier, uses a separate API call with a different system prompt and no tools. That destroys cache overlap with the main conversation.

Claude Code does it differently. It uses the exact same system prompt, tools, and prefix as the parent conversation and simply appends a final user message asking the model to summarize or compact the history.

From the API’s perspective, it’s just “the last request plus one more message,” so the cache is preserved. Anthropic has exposed this primitive in their public API, so if you’re building your own agent loop, you can follow the same pattern.

If you’re interested in how Claude and DeepSeek are competing on price and capabilities, you might also want to read this comparison of DeepSeek V4 vs GPT‑5.5 and the emerging AI stack war.

Two useful behaviors: editing project context and rewinding

There are a couple of subtle behaviors around project context and rewinding that are worth knowing about if you’re trying to keep your cache healthy.

Editing project context files mid-session

In Claude Code, editing a file like cloud.md during a session does not immediately bust the cache—but it also doesn’t take effect right away. The file is read once when the session starts and then held in memory.

To apply changes, you need to either:

  • Restart the session, or

  • Trigger a compaction or similar reset command.

If you change a rule in cloud.md and the agent keeps behaving as before, that’s why: it’s still using the cached, in-memory version.

Rewind to reuse earlier cache entries

Rewind (often exposed as a /re or similar command) truncates the conversation back to an earlier turn. The remaining history is exactly what the cache was built from at that point.

That means the next request can reuse the earlier cache entry instead of starting over. Anthropic explicitly recommends using rewind (or a combination of rewind and compaction) when you want to abandon a path entirely and go back to a known-good state.

Practical checklist: how to actually save money with caching

Putting it all together, here’s a simple discipline you can follow to get real cost savings from prompt caching, whether you’re on DeepSeek or another provider:

  • Pick your model at the start. Avoid switching models mid-session. Use sub-agents if you need different models for different tasks.

  • Connect all tools and MCP servers upfront. Don’t add, remove, or reconfigure tools during a session unless you’re okay paying for a full cache rebuild.

  • Keep the system prompt static. No timestamps, no constantly changing metadata. Put dynamic information into messages instead.

  • Use messages to reflect world changes. File edits, mode toggles, time of day, user settings—all should be appended as messages, not baked into the prefix.

  • Compact at natural task boundaries. And when you do, try to use the same system prompt and tools so you preserve cache overlap.

  • Use rewind to abandon bad paths. Truncate back to a good state to reuse existing cache entries instead of starting a brand-new session.

On the provider side, DeepSeek’s MLA and disk-based KV caching explain how they can offer such aggressive discounts on cached tokens. On your side, it comes down to prompt discipline. If you respect the cache, you can turn rising list prices into manageable, even dramatically lower, effective costs for your agents and applications.

Share:

Comments

No comments yet. Be the first to share your thoughts!

More in DeepSeek