Save 98% on AI agent tokens with smarter MCP patterns

04 Jun 2026 10:37 7,973 views

MCP servers can quietly burn through your context window before your AI agent even starts working. This guide walks through 10 practical patterns—from code execution to tool search, grouping, and TOON—that can cut token usage by up to 98% while making agents cheaper, faster, and more reliable.

Modern AI agents are powerful, but they can be surprisingly wasteful with tokens. If you plug in multiple MCP servers, your context window can get flooded with tool definitions and formatted data before the agent even sends a single useful message.

The good news: with a few smart patterns, you can cut token usage by up to 98% while making your agents cheaper, faster, and often more accurate. This guide walks through 10 practical techniques you can stack together in your own MCP setup.

Why MCP agents burn so many tokens

MCP (Model Context Protocol) servers expose tools to your AI agent—things like Google Drive, Salesforce, web scrapers, or internal APIs. By default, the model gets full tool definitions for every connected server: names, descriptions, input schemas, and more.

That means:

Large multi-server setups can add tens of thousands of tokens of tool metadata before any real work starts.
Selection accuracy drops as the number of tools grows (often after 30–50 tools).
Intermediate data from tools (like formatted web pages or JSON) further bloats the context.

The techniques below all aim to fix the same core problem: only send the model what it actually needs, when it needs it.

1. Use code execution instead of loading every tool

The most powerful pattern is to treat your MCP servers like a file system inside a sandboxed code environment. Instead of loading all tool definitions into the context, you let the model explore and execute code that calls tools as needed.

Here’s how it works conceptually:

Each tool is represented as a file (for example, a TypeScript file).
Each MCP server becomes a folder (e.g., /google-drive, /salesforce).
The agent discovers tools by listing and reading files only when required.

Anthropic’s example shows the impact clearly: moving a Google Drive document into Salesforce with direct tool calls can push ~150,000 tokens of context. With code execution, only the result of the code reaches the model—around 2,000 tokens, roughly a 98% reduction.

Extra benefits:

Pre-filtering in code: Filter or aggregate large datasets before they ever hit the model.
Loops and conditionals in code: Avoid round-tripping to the model for every small step.
Better privacy: Sensitive intermediate data (emails, phone numbers) can stay inside the execution environment.

The trade-off is complexity: you need a real sandbox with isolation, resource limits, and a secure way to expose MCP tools as files. But if you’re building serious agents, this pattern is worth the investment. For more ideas on this style of workflow, see these Claude code workflow tricks.

2. Add tool search so agents discover tools on demand

Another powerful approach is Anthropic’s tool search pattern. Instead of loading every tool definition up front, you give the model:

A small set of core tools.
A special "search" tool that lets it query a catalog of all available tools.

When the agent needs a new capability, it calls the search tool (similar to how Claude does file search) and loads only the relevant tools. Anthropic documents two variants:

Regex-based search for simple, structured matching.
BM25-based search for natural language ranking of tools.

Implementation-wise, you:

Add the search tool to your tool list.
Mark tools you don’t want loaded by default with a flag (e.g., default_loading: true for lazy loading).

In Anthropic’s benchmarks, a typical multi-server setup might include ~55,000 tokens of tool definitions. Tool search cuts that by more than 85%, and it also improves tool selection accuracy once you go beyond a few dozen tools.

3. Scope loading with tool groups

If you want to stay closer to MCP’s original design—where servers tell clients what tools exist—scope loading is a clean middle ground. Instead of exposing every tool at once, you group related tools and only load the groups you actually need.

For example, imagine an MCP server that exposes:

E-commerce data tools
Finance data tools
Social media data tools

You can define these as separate groups and configure your client to load just the relevant group(s) for a given session.

Bright Data’s open source MCP server uses this pattern heavily, with more than 60 tools across 11 groups. You can:

Specify groups via a URL parameter (e.g., ?groups=ecommerce,finance).
Or set them via environment variables in your local config.

This gives you efficient token usage: only tools in the selected groups are loaded, and you can still combine multiple groups in a single session when needed.

4. Load only specific tools for a session

You can take grouping one step further and specify the exact tools that should be available in a session. Instead of saying "load the finance group," you say "load these four tools by name."

In the Bright Data setup, this is done via a TOOLS environment variable where you list the tool names. If your app only needs 4 tools out of 60, you load just those 4—and you only pay tokens for those definitions.

The catch: you need to already know what tools exist on the server. This pattern is ideal for production agents with a clearly defined job, where discovery has already happened during development.

5. Use dynamic context loading in three stages

Dynamic context loading borrows the idea from Claude skills: reveal information to the model gradually, only when it proves relevant.

A simple three-level pattern looks like this:

Level 1 – Servers only: The model sees just the list of available MCP servers (e.g., "You can use: WebData, CRM, Docs").
Level 2 – Tool summaries: Once the model picks a server, it gets a list of that server’s tools with one-line summaries.
Level 3 – Full tool details: Only when it chooses a specific tool do you send the full name, description, and input schema.

This approach ensures that only truly relevant tool details enter the context. It also composes nicely with the grouping and explicit tool selection patterns above—they operate at different layers.

6. Package reusable skills for your agents

Skills are reusable capability bundles you can drop into different agents. Anthropic popularized this with the skill.md format: each skill is a folder containing a YAML header plus Markdown instructions.

Bright Data ships a set of skills in this format, compatible with more than 40 coding agents via the Open Agent Skill Ecosystem. Each skill can describe:

What the skill does.
How and when the agent should use it.
Any relevant MCP tools or workflows.

By encapsulating complex behavior into skills, you reduce the need to repeat long instructions in every prompt, which helps keep your context lean and consistent across agents.

7. Use programmatic tool calling to hide intermediate steps

Programmatic tool calling is another Anthropic feature that pairs well with code execution. Instead of having the model call tools directly and see every intermediate result, you let it write code that calls tools as normal Python functions.

The key idea: intermediate tool outputs stay out of the model’s context. Only the final code output is sent back to the model. That means:

The model reasons over a few lines of summarized output instead of hundreds of kilobytes of raw data.
Multi-step workflows (like deep search or browsing) become far more efficient.

To enable this, you typically:

Add the code execution tool to your tools list.
Mark tools that can be called from code with something like allowed_callers: ["code_execution"].

Anthropic notes that on agentic search benchmarks (like browse, comp, and deep search QA), adding programmatic tool calling on top of basic search was the key to unlocking full agent performance.

One limitation today: tools exposed via MCP connectors generally can’t be called programmatically yet. This pattern works best with tools you define directly in your own application.

8. Consider a layered MCP server architecture

For large setups with many teams and servers, a layered MCP design can keep your main agent’s context clean. You can think of it as a sub-agent architecture with three layers:

Discovery: Knows what MCP servers and tools exist.
Planning: Decides which tools and steps are needed to solve a task.
Execution: Runs the actual tool calls and workflows.

The top-level orchestrator (your main agent) interacts mostly with the planning layer, sending high-level goals and receiving summarized results. Most of the noisy details—tool schemas, intermediate outputs, retries—stay inside the sub-agent layers.

This pattern is more complex and usually only makes sense when:

You have many underlying MCP servers.
Different teams own different tools and you want a clean interface in front of them.

For smaller setups, the simpler patterns above are usually enough.

9. Strip and trim output to save tokens

So far we’ve focused on input tokens (tool definitions and instructions). But you can also save a lot on output tokens by cleaning up the data your tools return before sending it to the model.

Practical tactics include:

Remove heavy formatting: Strip Markdown, HTML, and extra styling from web results and documents. Plain text is usually enough for the model.
Light parsing of search results: For web search tools, keep just the top organic results and drop ads, related searches, and boilerplate.
Summarize before forwarding: If a tool returns a very long document, run a lightweight summarization step outside the model or via a smaller model before passing it into your main agent.

The exact savings depend on the data, but this kind of output trimming can cut a meaningful number of tokens from every response—especially for web-heavy MCP servers.

10. Use TOON instead of JSON for flat data

JSON is convenient, but it’s token-inefficient: every field name is repeated for every record. If you’re returning large lists of similar objects, this adds up quickly.

TOON (Token Oriented Object Notation) solves this by declaring field names once at the top and then streaming rows like a CSV:

Field names are listed a single time.
Each subsequent row is just a list of values in the same order.

For flat, uniform data (like product lists, transactions, or simple logs), TOON can reduce tokens by 30–60% compared to standard JSON.

Limitations:

TOON works best for flat, tabular data.
Deeply nested structures (like full LinkedIn profiles with nested experience, education, and skills) don’t compress as well and may not benefit much.

Still, if your MCP tools often return large, uniform datasets, TOON is an easy win.

How to combine these techniques in practice

You’ll get the best results by stacking several of these patterns rather than relying on just one. A practical setup might look like this:

Connection layer: Use tool groups (and, in production, explicit tool lists) to limit what loads per session.
Discovery layer: Add tool search for tools that don’t fit neatly into groups or that change often.
Execution layer: Use programmatic tool calling for multi-step workflows and heavy data processing.
Output layer: Strip formatting from tool outputs, lightly parse web data, and use TOON for flat tabular responses.
Advanced setups: If you’re ambitious, move to full code execution with MCP and a layered sub-agent architecture, replacing most direct tool calls.

All of this plays nicely with modern agent stacks and complements broader agent design principles like those covered in AI agent fundamentals.

Open source MCP servers and next steps

Many of these patterns are already available in open source form. Bright Data’s MCP server, for example, is MIT-licensed on GitHub, ships with more than 60 tools across 11 groups, and offers a generous free tier (5,000 requests per month) for prototyping.

If you’re building AI agents on top of MCP, start by measuring where your tokens are going: tool definitions, intermediate outputs, or final responses. Then progressively introduce grouping, dynamic loading, programmatic calling, and output trimming. With a bit of careful design, you can dramatically cut token usage while making your agents more robust and capable at the same time.