What ten years of CRO taught me about building reliable AI agents

16 Jun 2026 · ~5 min · building in public

For ten years my job was to answer one question for big brands — will this actually make more people convert? — and to never answer it with my opinion. You don't guess in conversion-rate optimization. You form a hypothesis, run the experiment, and let the data tell you you were wrong, which it usually does. Then two years ago I started building autonomous AI agents, and I watched them fail in a way that felt eerily familiar: confidently, unpredictably, and exactly when no one was measuring.

Agents fail the way landing pages fail

A page that looks perfect in the mockup tanks in the test. An agent that demos beautifully deletes a directory in production. Same shape: the thing performed in the controlled case and misbehaved in the wild — because nobody instrumented the wild. The mistake in both worlds is identical. It's trusting the demo: the polished version you wanted to be true.

"It worked when I tried it" is not a test

In CRO, "I'm pretty sure the green button wins" is worth nothing — you ship both and you measure. Agents are non-deterministic, a coin you can't watch flip, so "it worked when I tried it" is even weaker than a hunch about a button. One good run tells you almost nothing. You have to assert on behaviour, across many runs: which tools it called, whether it stayed in scope and budget, whether it actually finished — and whether it does that consistently, not one time in three. That isn't an AI idea. It's the A/B test, pointed at a new kind of subject.

Guardrails are just respecting the funnel

Every CRO has a rule burned in: you do not push an untested change straight onto the live checkout, because the downside is asymmetric — a small lift is never worth a chance of breaking the money. Agents have the same asymmetry, only sharper. So the instinct transfers directly: bound what the agent is allowed to do, cap what it can spend, and refuse the destructive action before it runs — not after you roll it back. I pulled that layer out into a small library, agent-guardrails, but the idea came from a decade of not breaking other people's checkouts.

A system that corrects itself

The deepest habit the work builds is the loop: hypothesize, measure, update, repeat — and trust the measurement over your ego. When I built AcePilot, the agent system I run across more than forty of my own projects, the part I'm proudest of isn't the agents. It's the calibration — the scorecards that compare what the system predicted to what actually happened, and adjust. That's not novel AI engineering. It's the experimentation loop I'd been running on conversion funnels, aimed at a system that now runs itself.

Reliability isn't a smarter model. It's the discipline of refusing to trust what you haven't measured — and building the instruments that let you measure it.

The conversion brain is the agent brain

The teams shipping agents that actually hold up aren't using a secret model. They treat the agent the way a CRO treats a funnel: instrument it, bound it, measure it across many runs, and let the data — not the demo, not the vibe — decide whether it ships. A more capable model just makes a more confident actor; it doesn't make a measured one. The measurement you build yourself.

I used to think my conversion career and my AI building were two separate things. They're the same skill wearing two costumes: never trust what you can't measure, and build the instrument that measures it. If you're building agents and you're tired of "it worked on my machine," that's the edge I'm working at — I'd genuinely like to compare notes.

Paulo de Vries

Senior CRO & product designer who builds AI · Telegraaf, NRC, dentsu · Amsterdam · LinkedIn