Question 1

What is agent-eval?

Accepted Answer

agent-eval is a zero-dependency JavaScript library that turns a recorded AI-agent run into a pass/fail report you can put in CI. It asserts which tools were used, whether scope and budget held, whether the agent finished, and what its output contained — then scores consistency across many runs to catch flakiness. About 4 KB, Node 18+, MIT-licensed, works in any test runner.

Question 2

How do you test a non-deterministic AI agent?

Accepted Answer

You don't assert the exact output text — you assert on behaviour. Record what the agent did (its actions and final output) and check invariants: that it used only the tools you allowed, stayed within a cost and call budget, actually finished, and that its output contained the real answer. Then run it many times and score the pass rate — one green run isn't a test; a stable pass rate is. A check that throws fails closed, never silently.

Question 3

How is agent-eval different from agent-guardrails?

Accepted Answer

agent-guardrails is the runtime half — it refuses unsafe actions before they run. agent-eval is the test-time half — it verifies a recorded run after the fact, in CI. Same action shape, same mental model: guardrails stop the agent live; eval proves in your test suite that it behaved.

usedTools(list)	every listed tool was used at least once
usedOnlyTools(list)	the agent stayed inside an allowlist (no out-of-scope calls)
didNotUseTools(list)	none of these tools were touched
outputContains / Omits / Matches	assert on the final output (substrings, banned strings, RegExp)
withinBudget({cost,calls})	cost & call count stayed within bounds
maxSteps(n)	the agent took no more than n actions
finished(predicate?)	it actually produced an output
custom(name, fn)	your own predicate over (output, run)
judge(name, fn)	LLM-as-judge — you supply the model call, so it stays zero-dependency

agent-eval

Why

Quickstart

Built-in checks

Part of a two-piece reliability toolkit

agent-guardrails →

agent-eval

Questions