Guard your LLM agents in CI. Snapshot tests that catch behavioral regressions when models, prompts, or vendors change.
📚 Documentation: agentprdiff.dev · ⚡ Quickstart · 🤖 AI-agent adoption · 📦 PyPI
You upgraded Claude. You tweaked a system prompt. You swapped
gpt-4oforgpt-4o-miniin the cheap path. Which of your agent's behaviors just changed?agentprdifftells you — before the PR merges.
pip install agentprdiffDon't have Python 3.10+ yet? Step-by-step install instructions for macOS, Windows, and Linux.
Multiple Python versions on your machine? If
pip installreportsNo matching distribution foundeven after installing Python 3.10+, usepython3.12 -m pip install agentprdiff(substitute your installed 3.10+ binary). Sidesteps$PATHconfusion when Homebrew's Python and the system Python coexist. Full troubleshooting: Installation guide.
Adopting with an AI coding agent? Point Claude Code, Cursor, Aider, or any agentic IDE at
AGENTS.md— a step-by-step adoption playbook the agent reads directly. Humans driving the adoption: seedocs/ai-driven-adoption.mdfor copy-paste prompt templates. The canonical file layout — what's mandatory, what's recommended, what's optional — is atdocs/suite-layout.md.
Unit tests assume determinism. Agents aren't deterministic, but they do have behaviors you rely on — a specific tool gets called, a refund amount is quoted, a latency budget is respected, a safety guardrail fires. When a model or prompt changes, those behaviors drift. Today most teams find out in production.
agentprdiff turns those behaviors into versioned, diffable baselines you check into git, and a CI command that fails the build when they regress.
It is not a framework. Your agent stays exactly the way it is. agentprdiff records what it did, lets you assert what should be true about what it did, and compares runs across time.
# suite.py
from agentprdiff import case, suite
from agentprdiff.graders import contains, tool_called, latency_lt_ms, semantic
from my_agent import run # your agent — unchanged
support = suite(
name="customer_support",
agent=run,
cases=[
case(
name="refund_happy_path",
input="I want a refund for order #1234",
expect=[
contains("refund"),
tool_called("lookup_order"),
semantic("agent acknowledges the refund and explains the timeline"),
latency_lt_ms(10_000),
],
),
],
)agentprdiff init
agentprdiff record suite.py # save this run as the baseline
agentprdiff check suite.py # in CI: diff vs baseline, exit 1 on regressionThat's the whole product. Five CLI commands (init, record, check, review, scaffold). One Python file. Zero framework lock-in.
- Case + Suite model — tiny, opinionated, no magic.
- 10 batteries-included graders —
contains,contains_any,regex_match,tool_called,tool_sequence,no_tool_called,output_length_lt,latency_lt_ms,cost_lt_usd,semantic(LLM-as-judge with pluggable backend). - Baseline store — JSON files under
.agentprdiff/baselines/, meant to be committed. Reviewers see trace changes in pull requests. - Diff engine — per-case
TraceDeltawith assertion pass/fail changes, cost delta, latency delta, tool-sequence changes, and a unified output diff. - CI-ready CLI — exit 1 on regression,
--json-outfor artifact archiving, Rich-formatted terminal output. - Zero SDK lock-in — works with OpenAI, Anthropic, Gemini, Bedrock, LangChain, LangGraph, LlamaIndex, Vercel AI SDK, custom wrappers — if you can wrap your agent in a function,
agentprdiffcan test it. - One-line SDK adapters —
with instrument_client(client) as trace:automatically records every LLM and tool call when you're on the OpenAI Python SDK (sync or async —AsyncOpenAIis supported by the same context manager) or any OpenAI-compatible provider (Groq / Gemini / OpenRouter / Ollama / vLLM / Together / Fireworks / DeepInfra) or the Anthropic SDK. No manualTracewiring required.
| Unit tests | LLM-as-judge eval | agentprdiff |
|
|---|---|---|---|
| Deterministic pass/fail | yes | no | yes (when assertions are deterministic) |
| Catches behavioral drift | no | yes | yes |
| Runs in CI on every PR | yes | too expensive | yes |
| Human-readable diff of what changed | n/a | rare | yes |
| Works without API keys | yes | no | yes (deterministic graders + fake judge) |
The value is in the combination: deterministic assertions for the 80% of behaviors you can encode as rules ("this tool was called", "this word appeared", "cost stayed under $0.02"), plus a semantic grader for the 20% that need a judge — with a fake-judge fallback so your CI stays green and free when API keys aren't available.
- Write a
Suitealongside your agent code. - Run
agentprdiff recordonce on a known-good version. Commit the resulting.agentprdiff/baselines/directory. - In CI, on every PR, run
agentprdiff check. If any assertion regresses, or cost/latency budgets are breached, the job fails. - When behavior intentionally changes, the PR author re-runs
agentprdiff record, commits the new baseline, and explains the change in the PR description. Reviewers see the before/after in the diff.
This is the same loop as Jest snapshot tests or VCR cassettes — applied to LLM agents.
agentprdiff doesn't read your agent's API key — your agent does, through whatever env var it already uses. Set that locally (in .env, your shell, direnv, whatever) and as a GitHub Actions secret in CI. The scaffold's workflow YAML has the right shape; you fill in the env var name to match your agent.
The semantic() grader is the one piece of agentprdiff that can use an API key directly — for the LLM judge. Without one, it silently falls back to keyword matching. Set ANTHROPIC_API_KEY (cheaper) or OPENAI_API_KEY if you want a real judge in CI; leave both unset to keep CI free with fake_judge.
See AGENTS.md → API keys for the full setup (local options, CI secrets, what never to do).
A common first-day question. Short version:
record— overwrites baselines in place. Re-recording an intentional change shows up as a regular git diff in your PR; that's the review surface.check— creates a new timestamped directory under.agentprdiff/runs/on every invocation. It's gitignored by default, so it never reaches CI; clean local history any time withrm -rf .agentprdiff/runs/.--json-out PATHoverwrites a single file at PATH.review— same comparison ascheck, but renders one verbose panel per case and always exits 0. Designed for local iteration loops; not meant for CI. Writes to the same.agentprdiff/runs/directory.scaffold— never overwrites. Skips files that already exist ([skip]) and writes the rest.init— idempotent; running it twice does nothing the second time.
See AGENTS.md → Rerun semantics for examples.
Skip the copy-paste from AGENTS.md:
agentprdiff scaffold ai_content_summary --recipe sync-openaiWrites the canonical layout (suites/__init__.py, _eval_agent.py, _stubs.py, <name>.py, <name>_cases.md, suites/README.md, and .github/workflows/agentprdiff.yml) with TODO markers where you wire in your agent. The <name>_cases.md file is a case dossier — reviewer-facing prose with one block per case (what it tests, input, assertions in plain English, file:line references to production code, and the application impact if the case regresses). Three recipes:
sync-openai(default): usesinstrument_clientfrom the OpenAI adapter with a syncOpenAI()client.async-openai: sameinstrument_client, paired with anasyncio.runbridge so anAsyncOpenAIagent works with agentprdiff's sync runner. The adapter detects the async client at entry — no separate API.stubbed: substitutes a single LLM helper instead of the SDK client. Best for summarization / classification / embedding-prep agents — seedocs/adapters.md.
The generated workflow includes permissions: contents: read so GHAS doesn't flag it. Pre-existing files are never overwritten.
You have two paths. Most agents need the first.
If your agent uses the OpenAI Python SDK — sync OpenAI or async AsyncOpenAI, including any OpenAI-compatible provider (Groq, Gemini, OpenRouter, Ollama, vLLM, Together, Fireworks, DeepInfra) — or the Anthropic SDK, the SDK adapter captures every model and tool call automatically:
from openai import OpenAI
from agentprdiff.adapters.openai import instrument_client, instrument_tools
TOOL_MAP = {"lookup_order": lookup_order, "send_email": send_email}
def my_agent(query: str):
client = OpenAI()
with instrument_client(client) as trace:
tools = instrument_tools(TOOL_MAP, trace)
# ... your existing tool-calling loop, untouched ...
# the only swap: TOOL_MAP[fn](**args) → tools[fn](**args)
return final_text, traceFor AsyncOpenAI, the same instrument_client works — it inspects client.chat.completions.create at entry and installs an awaitable patched method when the underlying one is async def. instrument_tools mirrors per-tool: async def tools come back awaitable, sync tools stay sync. The with block is still a regular with:
import asyncio
from openai import AsyncOpenAI
from agentprdiff.adapters.openai import instrument_client, instrument_tools
async def my_agent_async(query: str):
client = AsyncOpenAI()
with instrument_client(client) as trace:
tools = instrument_tools(TOOL_MAP, trace)
response = await client.chat.completions.create(...)
# ... await tools[name](**args) for async tools, tools[name](**args) for sync ...
return final_text, trace
def my_agent(query: str):
return asyncio.run(my_agent_async(query))The patch is scoped to the specific client instance and reversed when the with block exits — no global SDK state is touched. Anthropic adopters use agentprdiff.adapters.anthropic with the same shape (sync clients today; async Anthropic is on the roadmap).
See docs/adapters.md for the full reference, including pricing overrides, custom provider tags, and recipes for nested agents.
If you're not on either SDK, or you want full control, build the Trace yourself — agentprdiff doesn't require any monkey-patching:
from agentprdiff import Trace, LLMCall, ToolCall
def my_agent(query: str) -> tuple[str, Trace]:
trace = Trace(suite_name="", case_name="", input=query)
# ... call your model, record what happened ...
trace.record_llm_call(LLMCall(
provider="anthropic",
model="claude-sonnet-4-6",
prompt_tokens=120, completion_tokens=80,
cost_usd=0.0012, latency_ms=340,
))
# ... call a tool, record what happened ...
trace.record_tool_call(ToolCall(name="lookup_order", arguments={"id": "1234"}))
return final_output, traceAgents that return just an output still work — agentprdiff wraps them and captures wall-clock latency. You can backfill richer instrumentation incrementally, assertion by assertion.
# .github/workflows/agents.yml
name: agent-regression
on: [pull_request]
permissions:
contents: read # least-privilege; GHAS flags workflows without this.
jobs:
agentprdiff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: { python-version: "3.11" }
- run: pip install -e ".[dev]"
- run: agentprdiff check suites/*.py --json-out artifacts/agentprdiff.json
- uses: actions/upload-artifact@v4
if: always()
with: { name: agentprdiff, path: artifacts/ }If you use --json-out artifacts/..., add artifacts/agentprdiff*.json (or the broader artifacts/) to your project's .gitignore — the CI artifact upload doesn't prevent a contributor from accidentally git adding it locally.
See docs/ci-integration.md for GitLab, CircleCI, and Buildkite.
A runnable end-to-end demo, no API keys needed:
git clone https://github.com/vnageshwaran-de/agentprdiff
cd agentprdiff
pip install -e ".[dev]"
cd examples/quickstart
agentprdiff init
agentprdiff record suite.py
agentprdiff check suite.py # exit 0
# now break the agent and watch agentprdiff catch it
sed -i "s/refund/noundr/g" agent.py
agentprdiff check suite.py # exit 1; see the diffIterating on a single failing case shouldn't require commenting out the rest. record, check, and review all accept --case and --skip for narrowing a run:
# Discover what's available.
agentprdiff check suite.py --list
# Single case (case-insensitive substring).
agentprdiff check suite.py --case refund_happy_path
# Glob across cases.
agentprdiff check suite.py --case "*order*"
# Multiple patterns (repeated flag or comma-separated).
agentprdiff check suite.py --case refund --case policy
agentprdiff check suite.py --case refund,policy
# Everything except slow cases.
agentprdiff check suite.py --skip slow
agentprdiff check suite.py --case ~slow # equivalent
# Qualify by suite when names collide across suites.
agentprdiff check suite.py --case "billing:refund*"A filter that matches zero cases exits 2 and prints the available case names — --list is the discoverable counterpart. The selection summary (running 2 of 4 cases in <suite>: ...) is printed before each suite runs so a partial match is never silent.
agentprdiff check is built for CI: a compact summary table and exit 1 on regression. While you're iterating on a single case, that's the wrong shape — you want to see everything about that one case, and you don't want your shell going red between every keystroke. That's agentprdiff review:
# Verbose per-case panel: input, every assertion's was→now verdict,
# cost/latency/token deltas, tool-sequence diff, output diff.
agentprdiff review suite.py --case refund_happy_path
# Same filter syntax as check / record — globs, negation, multi-pattern.
agentprdiff review suite.py --case "*refund*"
agentprdiff review suite.py --skip slowreview runs the same comparison check does (and writes to the same .agentprdiff/runs/ directory) but always exits 0, even on regression — so it sits cleanly inside watcher loops (entr, watchexec, fzf previews). Use check when you want CI's exit semantics locally; reach for review while you're working. Think pytest -k.
agentprdiff is alpha (0.2.x). The core model, CLI, and OpenAI / Anthropic SDK adapters are stable. The OpenAI adapter covers both sync OpenAI and async AsyncOpenAI clients via the same instrument_client context manager. Async Anthropic, LangChain/LangGraph adapters, and a JS companion package for the Vercel AI SDK are on the 0.3 roadmap. See CHANGELOG.md.
Feedback, bug reports, and PRs extremely welcome. Open an issue or @ me.
MIT. See LICENSE.