Twelve Rules for Building AI Agents That Actually Work
What agents are, how the loop works, and the mental models that matter.
Everyone’s building agents. Most are building them wrong.
Not because they lack skill, but because they’re missing the right mental models. Before you write a line of code, you need to understand what agents actually are and how they differ from everything else you’ve built.
These are the rules that would have saved us a lot of pain.
1. Understand the Loop
An agent is not a chatbot with tools. It’s not RAG with extra steps. It’s a system that perceives, reasons, and acts in a loop until a goal is achieved.
Chatbot: Question > Response > Single answer
RAG: Question > Retrieve > Response > Answer with context
Agent: Goal > Reason > Act > Observe > Repeat > Task accomplished
A chatbot answers. An agent accomplishes.
Our first “agent” was basically a chatbot with a for-loop. Took us three weeks to realize we’d reinvented the agentic loop badly.
2. Context Is Working Memory
Most modern models offer 1M+ tokens. Sounds like a lot. It isn’t.
Every turn of the loop adds to context: the goal, every tool call and result, every reasoning step, every error and retry. A complex task might burn 50K tokens before you’ve done anything interesting.
Context is not free storage. It’s working memory. The more you stuff in, the worse the model reasons. Performance degrades well before you hit the limit. Researchers call it the “lost in the middle” effect.
The best agents use the least context to accomplish the goal.
3. Tools Are Your Interface
An LLM can only think. It can’t do. Tools bridge that gap.
The temptation is to give agents every tool they might need. GitHub, AWS, Slack, Jira, databases. Pile them on.
Don’t.
Every tool is a decision point. Every decision point is a chance for the model to choose wrong. We’ve seen noticeable degradation starting around 10-25 tools, with severe issues at 100+.
Start minimal. Add tools only when you hit a wall.
4. Tool Descriptions Are Prompts
The model decides which tool to use based on the description. Vague descriptions lead to wrong choices.
Bad: "Gets service information"
Good: "Search for services by name, owner, or tag. Returns a list of matching services with their IDs, names, and basic metadata. Use this when you need to find services. Do not use this to get detailed information about a specific service you already know."
The description is a prompt. Write it like one.
5. Agents Thrive with Structure
LLMs generate text. Agents need structured data.
Use constrained decoding when available. It forces the model to output valid JSON at the token level. More reliable, and faster because you never retry.
6. Plan for Hallucination
Hallucination isn’t a bug you can fix. It’s just how these things work.
In agent systems, hallucination shows up as invented tool names, fabricated parameters, false confidence, and imagined results.
We had a case where our prompt included an example UUID with the explicit instruction: “THIS IS AN EXAMPLE UUID. DO NOT USE THIS VALUE.”
The agent used it anyway. Repeatedly.
You can’t prompt your way out of this. You engineer around it: validate everything, fail gracefully, reduce opportunity, add reflection, and escalate when stakes are high.
7. Prompts Are Code
The system prompt is the most important code you write. It’s also the least tested.
Treat it like code: version control it, review changes with rigor, test in isolation, run regression tests, iterate based on failures.
A prompt change can break your agent just as easily as a code change.
8. Curate Memory
Chat history is a log of what was said. Memory is what the agent knows and can use.
These are different.
Agent memory has layers: working memory (current context), episodic memory (what happened before), and semantic memory (what’s true about the world).
Most agents only implement working memory. That’s fine for simple tasks. Complex agents need more.
9. Evaluate Continuously
“It seems to work” is not evaluation. Agents are probabilistic. They might work 80% of the time. You need to know that number.
Evaluation requires a test set, a metric, and a baseline. And it’s not something you do once. Every prompt change, tool change, or model upgrade requires re-evaluation.
10. Design for Security
Agents that can act in the real world amplify risks. Prompt injection, unauthorized tool use, data leakage. These aren’t theoretical.
Least-privilege everything. Validate outputs. Gate high-risk actions. Log everything. Never handle secrets directly.
Over-privileged agents are the top reason enterprise pilots fail.
11. Instrument for Observability
Agents are black boxes by nature. Without traces, you can’t diagnose why a loop failed or where hallucinations compounded.
Implement from day one: full trajectory logging, structured traces, metrics dashboards, and replay capabilities.
This turns “it sometimes works” into something you can actually debug.
12. Optimize for Cost and Latency
Agents multiply inferences. In production, this often determines whether the thing is viable at all.
We benchmarked agent performance across models and tasks. The results surprised us: a “smarter” model that costs 10x more per token often isn’t 10x better. Sometimes it’s worse because it overthinks.
Use cheaper models for simple steps. Reserve expensive models for complex reasoning. Cache aggressively. Set budgets. Kill runaway agents.
When to Break the Rules
Agents are not always the answer. They’re slow, expensive, and unpredictable.
Use an agent when the task requires multiple dependent steps, the path isn’t known in advance, and human-like reasoning adds value.
Don’t use an agent when a deterministic script would work, latency is critical, or the cost of failure is high.
A well-designed API call beats an agent for predictable tasks. A simple chain beats a full agent when the path is mostly known. An agent beats both when you genuinely don’t know what you need until you start exploring.
These rules aren’t exciting. They’re not the cool demos you see on Twitter. But they’re what separates agents that work from agents that almost work.
Master the loop. Respect context. Secure your tools. Plan for hallucination. Treat prompts as code. Curate memory. Evaluate continuously. Instrument everything. Watch your costs.
Start simple. Instrument everything. Iterate.
Originally published on sixdegree.ai. If you’re building agents into your infrastructure, let’s talk.

