Agent reliability is a guardrails problem, not a model problem

15 Jun 2026 · ~5 min · building in public

I've spent two years running an autonomous multi-agent system unattended across 40-plus of my own projects — agents that write code, deploy sites, edit DNS, move money-adjacent config, and pick their own next task without me watching. Long enough to learn the one thing every demo hides: in production, agents almost never fail because the model was dumb. They fail on the operating layer around the model.

Swapping in a smarter model fixes a category of mistakes I rarely hit. It does nothing for the ones that actually bite.

What actually breaks

The real failures repeat, and none of them are reasoning failures:

Unbounded scope. An agent given "fix the build" decides the cleanest fix is to delete a directory, or reaches for a tool it was never meant to touch.
Runaway cost. A retry loop that looked fine in testing quietly burns through a budget at 3am because nothing capped it.
Destructive actions. rm -rf, a force-push to main, a DROP — one token away from a fine command, and the model has no inherent sense that this one is different.
Leaked secrets. A key ends up in a log, a commit, or an outbound call, because nothing was watching the bytes leave.
No halt. When the agent goes off the rails, there's nothing standing between its decision and the real world.

Every one of those is the same shape: the agent decided to do something it should never have been allowed to do, and there was no layer to refuse it. A better model still decides things. The question is what happens between the decision and the action.

Why a smarter model doesn't save you

Capability and safety are different axes. A more capable model is, if anything, more dangerous when unbounded — it's more willing to take a big, confident, irreversible action. Reliability isn't "the model is right more often." It's "when the model is wrong, nothing catastrophic happens." Those are bought separately, and the second one you build yourself.

You don't make an agent reliable by trusting it more. You make it reliable by needing to trust it less.

The three properties of agents that actually ship

Across every system of mine that has run unattended without an incident, the same three properties show up — and they're architectural, not prompt-level:

Bounded scope. The agent has an explicit allowlist of what it can do. Everything else is refused by default, not by hope. This single property prevents more incidents than any other.
Guardrails as code. Invariants that run on every action — deny destructive commands, block secrets, cap cost and call count, validate outputs against a schema. Not instructions in a prompt the model can talk itself out of; checks in the execution path it can't.
A hard stop on violation. When a guardrail trips, the action is halted before it runs — not logged after the damage. The failure mode is "nothing happened," not "we'll roll it back."

Notice none of this is AI. It's the boring operations discipline we already know from databases, payments, and infra — applied to a new kind of actor that happens to make its own decisions. The teams shipping agents reliably aren't using secret models. They wrapped the model in a layer that says no.

The layer, extracted

I kept rebuilding this layer per project until I pulled the minimum useful version out into one small file. You describe what the agent is allowed to do; it blocks the rest, and refuses the action before it ever executes.

MIT-licensed · open-sourcing shortly · zero dependencies

agent-guardrails →

Wrap an agent's actions with invariants, cost caps, and a bounded tool scope. A blocking violation halts the action before it runs. ~3 KB, no deps, no telemetry — built from the patterns above.

@paulodevries/agent-guardrails · open-sourcing shortly

It's deliberately tiny. The point isn't the code — it's the idea that an agent's safety should live in a layer you own and can read in one sitting, not scattered through prompts you hope the model honors.

If you're building with agents

Audit one thing this week: when your agent does something it shouldn't, what stops it? If the honest answer is "the prompt asks it not to," you don't have a reliability layer — you have a suggestion. The gap between those two is where production incidents live.

I'm figuring this out in public, one system at a time. If you're wrestling with the same problem, I'd genuinely like to compare notes.

Paulo de Vries

Senior CRO & product designer who builds AI · Telegraaf, NRC, dentsu · Amsterdam · LinkedIn