The Model Isn’t the Problem
Context is the bottleneck. Models are the distraction.
Today we published a benchmark. We gave six LLMs between 25 and 150 tool definitions and measured how often they picked the right one. The results were counterintuitive enough that they’re worth sitting with.
The most expensive model lost. Claude Sonnet, at $0.028 per call, was the least accurate at every toolset size. Claude Haiku outperformed it at 3x lower cost. The two cheapest models in the test were also the two most accurate. The correlation between price and tool-calling performance wasn’t just weak. It was inverted.
And every model degraded. Not just the cheap ones. All six got worse as the toolset grew. The degradation started between 25 and 50 tools, which is roughly what you get when you connect two or three MCP servers.
I’m not writing this to dunk on any particular model. I’m writing it because the pattern points at something more fundamental than which provider to pick.
The benchmark question everyone is asking is the wrong one
“Which model should I use?” is a reasonable question. Models differ in capability, cost, latency, and context window size. Those differences matter.
But the benchmark data suggests that for tool-calling specifically, the question misses the point. The reason accuracy degrades at higher tool counts isn’t that one model is smarter than another. It’s that every model is doing the same thing: reading every tool definition in the context window and picking the one that seems most relevant. That’s a semantic matching problem over a growing candidate pool. It gets harder as the pool grows, regardless of which model you’re using.
You could swap in a better model and see marginal improvement. Or you could change what you’re asking the model to do. One of those has a ceiling. The other doesn’t.
MCP made this worse
MCP solved a real problem. Before it, tool integrations were bespoke, fragile, and expensive to build. Now you can connect a GitHub MCP server, a Jira MCP server, a Kubernetes MCP server, and your agent has tools for all three. That’s genuinely useful.
But the default behavior is to load all of those tools into the context window at once. Connect ten services and you’re at 80 to 150 tools before you’ve written a single line of agent logic. The model sees all of them, all the time, and has to figure out which ones are relevant to whatever the user just asked.
This is the context architecture most agents are running right now. It’s also the architecture our benchmark measured failing.
The OpenAI models hit a hard wall at 128 tools. Not a degradation curve. A limit. GPT-4o and GPT-5.4 Mini both returned errors at 150 tools because OpenAI’s API won’t accept more than 128 tool definitions per request.
That limit looks like a constraint. It might also be a signal. If the benchmark data is right that accuracy degrades sharply past 50 tools, a hard ceiling at 128 is arguably OpenAI saying: past this point, the results aren’t reliable enough to ship. The limit forces you to think about what you’re loading into the context window. Maybe that’s the point.
The failure mode is structural, not probabilistic
Here’s what the benchmark is actually measuring: how well can a model do semantic search over a list of tool definitions?
That’s not really an intelligence task. It’s a retrieval task. And retrieval degrades with scale. We already know this from RAG. The more candidates in the pool, the harder it is to surface the right one, even with good embeddings and reranking. Handing that retrieval problem to an LLM doesn’t change the underlying dynamic.
The cross-service confusion errors tell the same story. Datadog versus Grafana. Linear versus Jira. GitHub versus GitLab. These aren’t subtle errors. They’re cases where the model is navigating a crowded candidate pool and defaulting to whichever option looks most plausible.
Now think about what this looks like at enterprise scale. A large organization doesn’t run two overlapping observability tools. It runs five, accumulated across acquisitions, team preferences, and vendor lock-in that never got cleaned up. Datadog and Grafana and New Relic and Dynatrace and some homegrown thing the platform team built in 2019. Multiple project trackers. Multiple documentation systems. Multiple deployment pipelines. The heterogeneity isn’t an edge case in enterprise environments. It’s the default. And every redundant service you add to the candidate pool compounds the confusion the model is already experiencing.
This is a structural problem. Swapping models is a local fix for a systemic failure.
Context isn’t a starting condition. It’s a workflow.
The way most agents treat context: assemble it upfront, load it into the window, run the agent. Context is a static artifact. You prepare it, then you use it.
That model made sense when agents were single-turn. Ask a question, get an answer. The context you needed at the start was the context you needed throughout.
Agents that actually do things don’t work that way. Each step changes what’s relevant. The agent queries a repository and learns it’s owned by a specific team. That fact changes which tools matter next. It finds a deployment linked to an incident. That changes which services are relevant. The context at step five is different from the context at step one, and it should be.
If you treat context as static, you have two options: load everything upfront and pay the candidate pool penalty the whole way through, or load too little and watch the agent fail when it encounters something it doesn’t have context for.
The alternative is to treat context as a workflow within the agent workflow. A continuous loop: discover what’s relevant, scope the tools accordingly, act, update the context based on what you learned, repeat. Not a fixed input at the start. A living layer that evolves as the agent moves through a task.
This is harder to build. It’s also the only approach that scales. A static context layer that works for 10 tools starts failing at 50 and breaks at 150. A dynamic one keeps the candidate pool small at every step, regardless of how many total services are connected.
What this means for how you build
The benchmark data points toward a practical conclusion: keep the active toolset small. Not by connecting fewer services, but by only surfacing the tools that apply to the current context.
At 25 tools, the models in our benchmark were in the mid-to-high 80s on accuracy. That’s a reasonable operating range for a production agent. The goal is to stay there regardless of how many total services are connected. That requires a context layer that can scope the toolset dynamically, rather than a model smart enough to navigate an unlimited candidate pool.
Better models will keep coming. Context windows will keep growing. And the temptation will be to treat those improvements as a reason to load more into the window and let the model sort it out.
The benchmark says that doesn’t work. The ceiling for that approach is already visible, and it’s lower than most people building agents expect.
The full benchmark data and methodology, along with the open source framework we used to run it, are at SixDegree.

