Context Shift

Twelve Rules for Building AI Agents That Actually Work

Craig Tracey — Sun, 29 Mar 2026 17:45:16 GMT

Everyone’s building agents. Most are building them wrong.

Not because they lack skill, but because they’re missing the right mental models. Before you write a line of code, you need to understand what agents actually are and how they differ from everything else you’ve built.

These are the rules that would have saved us a lot of pain.

1. Understand the Loop

An agent is not a chatbot with tools. It’s not RAG with extra steps. It’s a system that perceives, reasons, and acts in a loop until a goal is achieved.

Chatbot: Question > Response > Single answer

RAG: Question > Retrieve > Response > Answer with context

Agent: Goal > Reason > Act > Observe > Repeat > Task accomplished

A chatbot answers. An agent accomplishes.

Our first “agent” was basically a chatbot with a for-loop. Took us three weeks to realize we’d reinvented the agentic loop badly.

2. Context Is Working Memory

Most modern models offer 1M+ tokens. Sounds like a lot. It isn’t.

Every turn of the loop adds to context: the goal, every tool call and result, every reasoning step, every error and retry. A complex task might burn 50K tokens before you’ve done anything interesting.

Context is not free storage. It’s working memory. The more you stuff in, the worse the model reasons. Performance degrades well before you hit the limit. Researchers call it the “lost in the middle” effect.

The best agents use the least context to accomplish the goal.

3. Tools Are Your Interface

An LLM can only think. It can’t do. Tools bridge that gap.

The temptation is to give agents every tool they might need. GitHub, AWS, Slack, Jira, databases. Pile them on.

Don’t.

Every tool is a decision point. Every decision point is a chance for the model to choose wrong. We’ve seen noticeable degradation starting around 10-25 tools, with severe issues at 100+.

Start minimal. Add tools only when you hit a wall.

4. Tool Descriptions Are Prompts

The model decides which tool to use based on the description. Vague descriptions lead to wrong choices.

Bad: "Gets service information"

Good: "Search for services by name, owner, or tag. Returns a list of matching services with their IDs, names, and basic metadata. Use this when you need to find services. Do not use this to get detailed information about a specific service you already know."

The description is a prompt. Write it like one.

5. Agents Thrive with Structure

LLMs generate text. Agents need structured data.

Use constrained decoding when available. It forces the model to output valid JSON at the token level. More reliable, and faster because you never retry.

6. Plan for Hallucination

Hallucination isn’t a bug you can fix. It’s just how these things work.

In agent systems, hallucination shows up as invented tool names, fabricated parameters, false confidence, and imagined results.

We had a case where our prompt included an example UUID with the explicit instruction: “THIS IS AN EXAMPLE UUID. DO NOT USE THIS VALUE.”

The agent used it anyway. Repeatedly.

You can’t prompt your way out of this. You engineer around it: validate everything, fail gracefully, reduce opportunity, add reflection, and escalate when stakes are high.

7. Prompts Are Code

The system prompt is the most important code you write. It’s also the least tested.

Treat it like code: version control it, review changes with rigor, test in isolation, run regression tests, iterate based on failures.

A prompt change can break your agent just as easily as a code change.

8. Curate Memory

Chat history is a log of what was said. Memory is what the agent knows and can use.

These are different.

Agent memory has layers: working memory (current context), episodic memory (what happened before), and semantic memory (what’s true about the world).

Most agents only implement working memory. That’s fine for simple tasks. Complex agents need more.

9. Evaluate Continuously

“It seems to work” is not evaluation. Agents are probabilistic. They might work 80% of the time. You need to know that number.

Evaluation requires a test set, a metric, and a baseline. And it’s not something you do once. Every prompt change, tool change, or model upgrade requires re-evaluation.

10. Design for Security

Agents that can act in the real world amplify risks. Prompt injection, unauthorized tool use, data leakage. These aren’t theoretical.

Least-privilege everything. Validate outputs. Gate high-risk actions. Log everything. Never handle secrets directly.

Over-privileged agents are the top reason enterprise pilots fail.

11. Instrument for Observability

Agents are black boxes by nature. Without traces, you can’t diagnose why a loop failed or where hallucinations compounded.

Implement from day one: full trajectory logging, structured traces, metrics dashboards, and replay capabilities.

This turns “it sometimes works” into something you can actually debug.

12. Optimize for Cost and Latency

Agents multiply inferences. In production, this often determines whether the thing is viable at all.

We benchmarked agent performance across models and tasks. The results surprised us: a “smarter” model that costs 10x more per token often isn’t 10x better. Sometimes it’s worse because it overthinks.

Use cheaper models for simple steps. Reserve expensive models for complex reasoning. Cache aggressively. Set budgets. Kill runaway agents.

When to Break the Rules

Agents are not always the answer. They’re slow, expensive, and unpredictable.

Use an agent when the task requires multiple dependent steps, the path isn’t known in advance, and human-like reasoning adds value.

Don’t use an agent when a deterministic script would work, latency is critical, or the cost of failure is high.

A well-designed API call beats an agent for predictable tasks. A simple chain beats a full agent when the path is mostly known. An agent beats both when you genuinely don’t know what you need until you start exploring.

These rules aren’t exciting. They’re not the cool demos you see on Twitter. But they’re what separates agents that work from agents that almost work.

Master the loop. Respect context. Secure your tools. Plan for hallucination. Treat prompts as code. Curate memory. Evaluate continuously. Instrument everything. Watch your costs.

Start simple. Instrument everything. Iterate.

Originally published on sixdegree.ai. If you’re building agents into your infrastructure, let’s talk.

The Model Isn’t the Problem

Craig Tracey — Sun, 22 Mar 2026 15:15:43 GMT

Today we published a benchmark. We gave six LLMs between 25 and 150 tool definitions and measured how often they picked the right one. The results were counterintuitive enough that they’re worth sitting with.

The most expensive model lost. Claude Sonnet, at $0.028 per call, was the least accurate at every toolset size. Claude Haiku outperformed it at 3x lower cost. The two cheapest models in the test were also the two most accurate. The correlation between price and tool-calling performance wasn’t just weak. It was inverted.

And every model degraded. Not just the cheap ones. All six got worse as the toolset grew. The degradation started between 25 and 50 tools, which is roughly what you get when you connect two or three MCP servers.

I’m not writing this to dunk on any particular model. I’m writing it because the pattern points at something more fundamental than which provider to pick.

The benchmark question everyone is asking is the wrong one

“Which model should I use?” is a reasonable question. Models differ in capability, cost, latency, and context window size. Those differences matter.

But the benchmark data suggests that for tool-calling specifically, the question misses the point. The reason accuracy degrades at higher tool counts isn’t that one model is smarter than another. It’s that every model is doing the same thing: reading every tool definition in the context window and picking the one that seems most relevant. That’s a semantic matching problem over a growing candidate pool. It gets harder as the pool grows, regardless of which model you’re using.

You could swap in a better model and see marginal improvement. Or you could change what you’re asking the model to do. One of those has a ceiling. The other doesn’t.

MCP made this worse

MCP solved a real problem. Before it, tool integrations were bespoke, fragile, and expensive to build. Now you can connect a GitHub MCP server, a Jira MCP server, a Kubernetes MCP server, and your agent has tools for all three. That’s genuinely useful.

But the default behavior is to load all of those tools into the context window at once. Connect ten services and you’re at 80 to 150 tools before you’ve written a single line of agent logic. The model sees all of them, all the time, and has to figure out which ones are relevant to whatever the user just asked.

This is the context architecture most agents are running right now. It’s also the architecture our benchmark measured failing.

The OpenAI models hit a hard wall at 128 tools. Not a degradation curve. A limit. GPT-4o and GPT-5.4 Mini both returned errors at 150 tools because OpenAI’s API won’t accept more than 128 tool definitions per request.

That limit looks like a constraint. It might also be a signal. If the benchmark data is right that accuracy degrades sharply past 50 tools, a hard ceiling at 128 is arguably OpenAI saying: past this point, the results aren’t reliable enough to ship. The limit forces you to think about what you’re loading into the context window. Maybe that’s the point.

The failure mode is structural, not probabilistic

Here’s what the benchmark is actually measuring: how well can a model do semantic search over a list of tool definitions?

That’s not really an intelligence task. It’s a retrieval task. And retrieval degrades with scale. We already know this from RAG. The more candidates in the pool, the harder it is to surface the right one, even with good embeddings and reranking. Handing that retrieval problem to an LLM doesn’t change the underlying dynamic.

The cross-service confusion errors tell the same story. Datadog versus Grafana. Linear versus Jira. GitHub versus GitLab. These aren’t subtle errors. They’re cases where the model is navigating a crowded candidate pool and defaulting to whichever option looks most plausible.

Now think about what this looks like at enterprise scale. A large organization doesn’t run two overlapping observability tools. It runs five, accumulated across acquisitions, team preferences, and vendor lock-in that never got cleaned up. Datadog and Grafana and New Relic and Dynatrace and some homegrown thing the platform team built in 2019. Multiple project trackers. Multiple documentation systems. Multiple deployment pipelines. The heterogeneity isn’t an edge case in enterprise environments. It’s the default. And every redundant service you add to the candidate pool compounds the confusion the model is already experiencing.

This is a structural problem. Swapping models is a local fix for a systemic failure.

Context isn’t a starting condition. It’s a workflow.

The way most agents treat context: assemble it upfront, load it into the window, run the agent. Context is a static artifact. You prepare it, then you use it.

That model made sense when agents were single-turn. Ask a question, get an answer. The context you needed at the start was the context you needed throughout.

Agents that actually do things don’t work that way. Each step changes what’s relevant. The agent queries a repository and learns it’s owned by a specific team. That fact changes which tools matter next. It finds a deployment linked to an incident. That changes which services are relevant. The context at step five is different from the context at step one, and it should be.

If you treat context as static, you have two options: load everything upfront and pay the candidate pool penalty the whole way through, or load too little and watch the agent fail when it encounters something it doesn’t have context for.

The alternative is to treat context as a workflow within the agent workflow. A continuous loop: discover what’s relevant, scope the tools accordingly, act, update the context based on what you learned, repeat. Not a fixed input at the start. A living layer that evolves as the agent moves through a task.

This is harder to build. It’s also the only approach that scales. A static context layer that works for 10 tools starts failing at 50 and breaks at 150. A dynamic one keeps the candidate pool small at every step, regardless of how many total services are connected.

What this means for how you build

The benchmark data points toward a practical conclusion: keep the active toolset small. Not by connecting fewer services, but by only surfacing the tools that apply to the current context.

At 25 tools, the models in our benchmark were in the mid-to-high 80s on accuracy. That’s a reasonable operating range for a production agent. The goal is to stay there regardless of how many total services are connected. That requires a context layer that can scope the toolset dynamically, rather than a model smart enough to navigate an unlimited candidate pool.

Better models will keep coming. Context windows will keep growing. And the temptation will be to treat those improvements as a reason to load more into the window and let the model sort it out.

The benchmark says that doesn’t work. The ceiling for that approach is already visible, and it’s lower than most people building agents expect.

The full benchmark data and methodology, along with the open source framework we used to run it, are at SixDegree.

First Principles of AI Context

Craig Tracey — Sat, 14 Mar 2026 04:12:51 GMT

Every few weeks someone publishes a benchmark showing that the latest model is smarter, faster, more capable. Context windows are getting massive. A million tokens, two million, more on the horizon. And that’s genuinely impressive.

But it raises a question nobody seems to be asking: what are we filling those windows with?

Right now, the answer is mostly everything. Dump in the docs. Stuff in the chat history. Append the tool definitions. Hope the model figures out what matters.

Bigger windows don’t solve the context problem. They just give you more room to be wrong. A million tokens of unfocused, unstructured context isn’t better than ten thousand tokens of the right context. It’s worse, because the model has to work harder to find the signal in the noise, and you’re paying for every token of that noise.

I’ve spent the last two-plus years building agent infrastructure, and I keep landing on the same conclusion: the bottleneck isn’t the model and it isn’t the window size. It’s the quality and structure of what goes into the window. Until we treat context as an engineering problem, not just a capacity problem, we’re going to keep building impressive demos that fall apart in production.

Here are the first principles I keep coming back to.

The Context Exists. The Relations Don't.

There’s a reason AI coding tools are so far ahead of everything else. Code has explicit structure: dependencies, type systems, call graphs. The model can follow the relationships. It can reason about how things connect.

Now think about everything else we’re trying to point AI at. Your operations. Your organization. Your business processes. There’s no relationship graph. No map connecting a customer complaint to the team responsible to the system that caused it.

Without structure, the model guesses. A bigger window just means it has more room to guess in.

The structure already exists inside your systems. Before you can get real value from AI, you need to connect it.

Semantics are probability, not truth.

This is the thing that’s easy to forget when a model gives you a confident, well-formatted answer: it doesn’t know anything. It’s predicting the most likely next token. When you ask it to interpret your data, it’s giving you the most probable interpretation, not necessarily the correct one.

That distinction doesn’t matter much when you’re generating a summary or drafting an email. It matters enormously when an agent is deciding which team to page at 3am, or which customer account is affected by an outage, or whether a support ticket is related to a known incident.

You can see this play out in real time with tool calls. An agent without enough context doesn’t just pick the wrong tool. It tries one, fails, tries another, fails again, and loops. It’s not being stupid. It’s doing exactly what you’d expect from a system that’s navigating by probability without a map. It doesn’t have the connective tissue to know that this entity means that tool, so it guesses, checks the result, and guesses again. It’s brute-forcing a path through a graph it can’t see.

Probability is useful. But decisions need ground truth. And ground truth comes from structure: explicit relationships that say this is connected to that, defined by rules, not inferred by a model.

The more we rely on agents to take real action, the less we can afford to let them operate on vibes.

Facts without relationships are a dead end.

RAG was supposed to solve the context problem. Ground the model in your data. Retrieve relevant chunks. It works for question answering.

And even that takes a surprising amount of effort. Chunking strategies, embedding model selection, reranking, relevance tuning, keeping the index fresh as your data changes. RAG pipelines are deceptively expensive to build well and even harder to maintain. That’s a lot of investment for a system that tops out at retrieval.

And when teams hit the ceiling of what vanilla RAG could do, where did they turn to improve it? You guessed it. Graphs. GraphRAG exists because people kept running into the same wall: retrieval without relationships isn’t enough.

But the moment you want an agent to do something, retrieval isn’t enough. Knowing “there was an incident last Tuesday” is a fact. Knowing that the incident affected three customers, was caused by a change made by a specific team, and is related to two open support tickets? That’s a graph. That’s the difference between an agent that can answer questions and one that can actually reason about what to do next.

We keep trying to solve a graph problem with a search engine. Vector similarity tells you what’s textually related. It can’t tell you what’s causally connected, what depends on what, or what breaks if something changes. And because similarity is probabilistic, it’ll happily surface content that looks related but isn’t, with no way to tell the difference.

Context has to discover itself.

Here’s where it gets hard. You can’t manually build and maintain a map of how everything in your world connects. But look at what we’re doing today to try.

We write longer prompts. We craft system instructions. We maintain AGENTS.md and CLAUDE.md files. We build onboarding documents that try to explain our world to the model in prose. We hand-author tool descriptions and few-shot examples. We create elaborate prompt chains that try to steer the model toward the right context at the right time.

All of these are manual. All of them go stale. And all of them are fundamentally trying to solve the same problem: teaching the model what it should already be able to see.

And here’s the kicker. What are we writing all of this context in? Natural language. Prose. The very thing we just established is interpreted probabilistically, not precisely. We’re using semantics to provide context to a system that processes semantics as probability. We’re bootstrapping truth from a medium that doesn’t guarantee it.

It works at small scale. When you have five tools and one domain, you can write enough context by hand to get by. But it breaks the moment your environment grows. More tools, more systems, more relationships, more change. The rate of change is faster than any human process can keep up with.

The only context that stays accurate is context that builds itself, continuously, from the systems that are already running. The relationships already exist inside your tools and platforms. They’re just not structured in a way that AI can use.

The job isn’t data entry. The job is discovery.

Structure needs rules, not just data.

This one took me a while to internalize. You can ingest every piece of data from every system you touch and still have nothing useful. Data without interpretation is noise, and a model will happily interpret that noise for you. Confidently, probabilistically, and sometimes wrong.

Structure emerges from rules. A project is owned by a team. A customer is served by a product. An alert relates to an incident. These aren’t things you discover statistically. They’re things you define. And once defined, they make relationships queryable, composable, and trustworthy. Not probable. True.

Without rules, you have data. With rules, you have structure an agent can trust.

Agents need context before tools.

MCP gave agents a standard way to call tools. That was a genuine breakthrough. But tools without context are blind.

Think about how an agent actually decides which tool to call. It reads the tool’s name and description and picks the one that seems most relevant. Semantics again. The entire tool selection process is probabilistic. The agent isn’t matching against a schema or following a rule. It’s making its best guess.

Give an agent access to hundreds of tools and watch what happens. It picks the wrong ones. It hallucinates capabilities. It takes action without understanding what it’s acting on. And every one of those irrelevant tool definitions is eating up your context window, crowding out the information the agent actually needs. Each failed tool call burns tokens, adds latency, and pushes useful context further out of reach.

The fix isn’t better prompting. The fix is context first, tools second. The agent needs to understand what’s relevant to the current task before it gets access to the tools that apply.

This is the order of operations that most agent architectures get backwards.

Why this matters now

We’re about to get 10 million token context windows. The temptation will be to treat that as a solution. Just throw everything in and let the model sort it out.

That won’t work. It’ll just be expensive, slow, and probabilistically wrong in ways that are hard to debug. The context problem isn’t about capacity. It’s about knowing what matters, how things connect, and what’s relevant right now. With certainty, not just likelihood.

MCP is taking off. Agent frameworks are proliferating. Everyone is building tool integrations. But almost nobody is building the context layer underneath: the thing that decides what goes into the window and why.

That’s the gap. And it’s the gap that will determine whether AI agents become genuinely useful or remain expensive toys that work great in demos.

I started this newsletter because I think the people building in this space need a place to think through these problems together. Not hype. Not product announcements. Just the hard, specific questions that come with making AI systems work for real.

This is the problem I'm building toward solving with sixdegree.ai. More on that soon - and more on the specific patterns that actually work in production.

Welcome to Context Shift

Craig Tracey — Sat, 14 Mar 2026 03:06:19 GMT

You’re building AI agents. You’ve got the LLM calls working. The demo is impressive.

Then you ship it.

And you discover the real problem was never the model — it was the context.

Which tools should the agent see right now? How do you keep it from calling a GitLab endpoint when the user is asking about a GitHub repo? What happens when you load 200 MCP tools into a single session and the model starts hallucinating capabilities it doesn’t have?

These are the problems nobody talks about at AI conferences. They’re too specific. Too operational. Too boring — until they take down your production system on a Tuesday afternoon.

What this newsletter is

Context Shift is a weekly newsletter for engineers building with MCP and agent infrastructure. Each issue delivers practitioner-tested insights on:

MCP architecture — what the spec says, what it doesn’t, and what actually works
Agent context management — the hard problem of getting the right information to the right model at the right time
The infrastructure layer — the unglamorous plumbing that makes AI systems work in the real world

No hype. No “AI is changing everything” takes. Occasionally I’ll surface something worth reading from elsewhere in the ecosystem, but mostly this is first-hand — the specific, hard-won knowledge that comes from shipping context-aware systems.

Who I am

I’m Craig, founder of sixdegree.ai — a live system intelligence platform that delivers real-time business context to AI agents via MCP. Every week I’m in the code, solving the same problems you’re solving: tool discovery, context windows, entity resolution, agent orchestration.

The things I write about here are things I’ve built, broken, and rebuilt. This newsletter is the one I wish existed when I started.

If you’re building agents, designing MCP servers, or architecting the context layer underneath your AI systems — subscribe. Every issue is written to save you at least one bad production decision.

Let’s get into it.