agent-eval
Assert what your AI agent actually did. Check a recorded run in your tests/CI: did it use the right tools, stay in budget, finish, and produce an output you can assert on? Then score consistency across many runs to catch flakiness.
npm i @paulodevries/agent-evalWhy
Agents are non-deterministic, so "it worked when I tried it" isn't a test. You need to assert on behaviour — which tools were called, whether scope held, what it cost, whether it actually finished — and whether it does all that consistently, not one run in three. agent-eval turns a recorded run into a pass/fail report you can put in CI.
Quickstart
import { evaluate, assertPass, usedOnlyTools, outputContains, withinBudget, finished }
from '@paulodevries/agent-eval';
// A "run" is what your agent loop already produces:
const run = {
actions: [
{ tool: 'search', input: 'opening hours', cost: 0.002 },
{ tool: 'read_file', input: 'hours.md', cost: 0.001 },
],
output: 'Open 08:00–18:00, Mon–Fri. Source: hours.md',
};
const report = await evaluate(run, [
usedOnlyTools(['search', 'read_file']), // stayed in scope
outputContains('08:00'), // got the real answer
withinBudget({ cost: 0.5, calls: 10 }), // affordable
finished(), // actually answered
]);
assertPass(report); // throws in CI if it didn't pass
Built-in checks
| usedTools(list) | every listed tool was used at least once |
| usedOnlyTools(list) | the agent stayed inside an allowlist (no out-of-scope calls) |
| didNotUseTools(list) | none of these tools were touched |
| outputContains / Omits / Matches | assert on the final output (substrings, banned strings, RegExp) |
| withinBudget({cost,calls}) | cost & call count stayed within bounds |
| maxSteps(n) | the agent took no more than n actions |
| finished(predicate?) | it actually produced an output |
| custom(name, fn) | your own predicate over (output, run) |
| judge(name, fn) | LLM-as-judge — you supply the model call, so it stays zero-dependency |
Catch flaky agents with scoreRuns(runs, checks) → pass rate + mean score across many runs. A check that throws fails closed, never silently.
Part of a two-piece reliability toolkit
agent-guardrails →
Refuses unsafe actions before they run — bounded scope, cost caps, deny-lists, secret-blocking.
agent-eval
Verifies a run after the fact — the assertions that make an agent safe to ship in CI.
Same action shape, same mental model. The thinking behind both: Agent reliability is a guardrails problem, not a model problem →. Feedback + PRs welcome.