A LangGraph ReAct agent paired with agentprdiff snapshot tests to demonstrate how to catch behavioral regressions when models, prompts, or tools change.
Tutorial narrative: Build → Record → Break → Catch → Fix
| File | Purpose |
|---|---|
agent.py |
LangGraph ReAct agent with 3 tools: lookup_order, process_refund, check_policy |
suite.py |
12 agentprdiff test cases across 5 suites covering all 10 built-in graders |
AGENTS.md |
Persistent instructions for AI coding agents working in this repo |
.github/workflows/agentprdiff.yml |
CI: runs agentprdiff check on every PR |
- Python 3.11+
- An Anthropic API key (
ANTHROPIC_API_KEY)
# 1. Enter the project
cd video-tutorials/customer_support_agent
# 2. Install dependencies
pip install -r requirements.txt
# 3. Set your API key
echo "ANTHROPIC_API_KEY=sk-ant-..." > .env
# 4. Smoke test the agent manually
python agent.py
# 5. Record baselines (run once on the known-good version)
agentprdiff record suite.py
# 6. Check for regressions
agentprdiff check suite.pyOpen agent.py. The agent is a standard LangGraph StateGraph with three nodes:
HumanMessage → [agent] → [tools] → [agent] → ... → AIMessage
Three mock tools simulate a real backend (no real database needed):
lookup_order(order_id)— returns order status, item, category, amountprocess_refund(order_id, reason)— approves refunds on delivered orders onlycheck_policy(category)— returns the return/refund policy for an item category
python agent.pyYou'll see three queries answered: order status, refund request, and policy lookup.
Open suite.py. It contains 5 suites covering distinct behavior categories:
| Suite | Cases | Key graders demonstrated |
|---|---|---|
refund_flow |
3 | tool_sequence, no_tool_called, regex_match, semantic |
policy_queries |
3 | tool_called, output_length_lt, contains_any |
order_status |
2 | no_tool_called (agent doesn't over-call) |
multi_step_reasoning |
2 | tool_sequence (3 steps), cost_lt_usd |
out_of_scope |
2 | no_tool_called (all tools), graceful fallback |
All 10 agentprdiff graders are used:
contains · contains_any · regex_match · tool_called · tool_sequence ·
no_tool_called · output_length_lt · latency_lt_ms · cost_lt_usd · semantic
agentprdiff record suite.pyThis runs every case once and writes JSON snapshots to .agentprdiff/baselines/.
Commit these files — they are the "known good" reference for CI.
git add .agentprdiff/baselines/
git commit -m "chore: record initial agentprdiff baselines"Swap the model to an older, less capable one:
export ANTHROPIC_MODEL=claude-3-haiku-20240307Now run the check:
agentprdiff check suite.pyYou'll see failures like:
FAIL refund_flow / refund_happy_path
tool_sequence(["lookup_order", "process_refund"]) — FAILED
actual sequence: ["lookup_order"] ← haiku skipped the refund step
FAIL multi_step_reasoning / full_refund_journey
semantic(...) — FAILED
judge: "agent acknowledged the issue but did not process the refund"
This is the core value of agentprdiff: a model swap that looks safe silently changes behavior.
Option A — Fix the regression (revert the model swap):
export ANTHROPIC_MODEL=claude-3-5-haiku-20241022
agentprdiff check suite.py # passes againOption B — Accept the new behavior (intentional change):
agentprdiff record suite.py
git add .agentprdiff/baselines/
git commit -m "chore: update baselines after model change"
# Write a ## Behavior Change section in your PR descriptionEvery PR that touches this directory triggers the GitHub Actions workflow.
If agentprdiff check exits non-zero, the PR is blocked. Reviewers see
the baseline diff in the uploaded artifact.
| Grader | What it checks | Example in suite.py |
|---|---|---|
contains(text) |
Output contains substring | contains("refund") |
contains_any([...]) |
Output contains at least one substring | contains_any(["30 days", "30-day"]) |
regex_match(pattern) |
Output matches regex | regex_match(r"REF-\d+") |
tool_called(name) |
Tool was called at least once | tool_called("lookup_order") |
tool_sequence([...]) |
Tools were called in this exact order | tool_sequence(["lookup_order", "process_refund"]) |
no_tool_called(name) |
Tool was never called | no_tool_called("process_refund") |
output_length_lt(n) |
Output is fewer than n characters | output_length_lt(400) |
latency_lt_ms(ms) |
End-to-end latency under budget | latency_lt_ms(10_000) |
cost_lt_usd(usd) |
Token cost under budget | cost_lt_usd(0.05) |
semantic(description) |
LLM-as-judge checks intent | semantic("agent confirms refund approved") |
See AGENTS.md for the full set of rules AI coding agents must follow when
modifying this project — including when to re-record baselines, code style,
and what they must never touch.