← Paulo de Vries Open source · MIT · zero dependencies

agent-eval

Assert what your AI agent actually did. Check a recorded run in your tests/CI: did it use the right tools, stay in budget, finish, and produce an output you can assert on? Then score consistency across many runs to catch flakiness.

npm i @paulodevries/agent-eval

~4 KB · Node 18+ · works in any test runner · shipping to npm + GitHub now

Why

Agents are non-deterministic, so "it worked when I tried it" isn't a test. You need to assert on behaviour — which tools were called, whether scope held, what it cost, whether it actually finished — and whether it does all that consistently, not one run in three. agent-eval turns a recorded run into a pass/fail report you can put in CI.

Quickstart

import { evaluate, assertPass, usedOnlyTools, outputContains, withinBudget, finished }
  from '@paulodevries/agent-eval';

// A "run" is what your agent loop already produces:
const run = {
  actions: [
    { tool: 'search', input: 'opening hours', cost: 0.002 },
    { tool: 'read_file', input: 'hours.md', cost: 0.001 },
  ],
  output: 'Open 08:00–18:00, Mon–Fri. Source: hours.md',
};

const report = await evaluate(run, [
  usedOnlyTools(['search', 'read_file']), // stayed in scope
  outputContains('08:00'),                // got the real answer
  withinBudget({ cost: 0.5, calls: 10 }), // affordable
  finished(),                             // actually answered
]);

assertPass(report); // throws in CI if it didn't pass

Built-in checks

usedTools(list)every listed tool was used at least once
usedOnlyTools(list)the agent stayed inside an allowlist (no out-of-scope calls)
didNotUseTools(list)none of these tools were touched
outputContains / Omits / Matchesassert on the final output (substrings, banned strings, RegExp)
withinBudget({cost,calls})cost & call count stayed within bounds
maxSteps(n)the agent took no more than n actions
finished(predicate?)it actually produced an output
custom(name, fn)your own predicate over (output, run)
judge(name, fn)LLM-as-judge — you supply the model call, so it stays zero-dependency

Catch flaky agents with scoreRuns(runs, checks) → pass rate + mean score across many runs. A check that throws fails closed, never silently.

Part of a two-piece reliability toolkit

Runtime

agent-guardrails →

Refuses unsafe actions before they run — bounded scope, cost caps, deny-lists, secret-blocking.

Test-time · you are here

agent-eval

Verifies a run after the fact — the assertions that make an agent safe to ship in CI.

Same action shape, same mental model. The thinking behind both: Agent reliability is a guardrails problem, not a model problem →. Feedback + PRs welcome.