Evaluation¶
Kwasi has a maintained evaluation system that measures how the agent is actually doing — is intent classified correctly, is the right domain agent selected, and are the right tools called for the right tasks — and catches regressions as prompts and models change. It also closes a feedback loop: real production traffic is mined into new test cases.
Everything lives under eval/. It is built on
pydantic-evals (pinned to match pydantic-ai).
One source of truth, two view layers
The corpus (YAML), the task (the routing / agent-run callable), and the scorer are shared. Only the reporting backend differs — pydantic-evals → Logfire Experiments, and a parallel runner → Langfuse Experiments. The two can never score the same run differently.
What gets evaluated¶
| Layer | Question it answers | Cost | CI gate |
|---|---|---|---|
| Intent + routing | Right domain classified, right agent selected? | Free (keyword path, no LLM) | ✅ pytest |
| Real-world routing | Does it route your actual phrasings correctly? | Free | ✅ pytest |
| Tool selection | Are the right tools called for the task? | Tokens (runs the real agent) | Manual / nightly |
| Response quality | Did it actually do the right thing? (hard tiers T3–T5) | Cheap (mini-model judge) | Manual / nightly |
The corpus is stratified into tiers (from research/kwasi-eval-framework.md):
T1 single-tool · T2 routing · T3 multi-turn · T4 orchestration · T5 proactive/edge.
File map¶
eval/
datasets/
intent_routing.yaml # synthetic routing corpus (message → expected categories + agent)
intent_routing_real.yaml # MINED + SANITIZED real-usage routing cases (separate signal)
tool_selection.yaml # 41 end-to-end scenarios (T1–T5) + read-only To Do cases
tasks.py # classify_and_route — the live routing pipeline, isolated to score
tool_selection.py # run_agent_turn (real agent, in-memory DB, no E2B) + score_scenario + ToolSelection
evaluators.py # IntentMatch (set-equality + Jaccard), AgentMatch, shared category_scores
langfuse_experiment.py # Langfuse Dataset + Experiment runner (both phases)
mine_traces.py # trace→dataset miner (with conversational-fragment filter)
report.py # scorecard + JSON artifact
run.py # CLI entry point
tests/
test_eval_intent_routing.py # CI gate (synthetic + real-world routing)
test_eval_tool_selection.py # CI gate for the scenario scorer (pure)
test_eval_mine_traces.py # CI gate for the miner's pure logic
The corpus YAML is hand-maintained data (the original migration from
scripts/run_eval.py was one-time; that runner is retired). Mining drafts are
gitignored.
Running it¶
# Free, CI-gated routing eval (no LLM/embedding spend)
uv run python -m eval.run # synthetic scorecard
uv run python -m eval.run --dataset eval/datasets/intent_routing_real.yaml
uv run pytest tests/test_eval_intent_routing.py # the gate itself
# Real-agent tool-selection (costs tokens; side-effect-free)
uv run python -m eval.run --phase tools --test T1-01 --judge # one cheap case
uv run python -m eval.run --phase tools --tier T2 --judge # a whole tier
uv run python -m eval.run --phase tools --trace --langfuse --judge # full sweep → BOTH platforms, one run
# Mine real production traffic for new cases (read-only; needs prod env)
railway run uv run python -m eval.mine_traces --limit 500
Useful flags: --trace (→ Logfire), --langfuse (→ Langfuse), --judge
(LLM judge on T3–T5), --tier / --test (filter), --max-concurrency
(default 3, to respect rate limits), --semantic (also score embedding-fallback
routing cases — costs embeddings).
Where results land¶
=== "Logfire"
--trace runs appear under Evals → Datasets / Experiments as Local
datasets named intent_routing / tool_selection, with per-case OTEL traces.
Needs LOGFIRE_TOKEN.
=== "Langfuse"
--langfuse pushes to Datasets kwasi-intent-routing / kwasi-tool-selection
and creates Experiment runs named model · git-sha · timestamp — with
run-over-run comparison plus cost/latency charts, so you can see quality
move across prompt/model changes. Needs LANGFUSE_PUBLIC_KEY /
LANGFUSE_SECRET_KEY.
--trace --langfuse together runs the agent once and populates both natively
(the Logfire run's per-case outputs are reused for the Langfuse experiment — only
the evaluators/judge re-run, not the agent). The default run and the pytest gate
send nothing anywhere (free, cred-free).
Datasets, runs & experiments — reading the two platforms¶
The vocabulary is shared across both platforms:
| Term | Meaning |
|---|---|
| Dataset | A named bag of items (test cases): input + expected_output + metadata. Stable; you grow it over time. |
| Item | One test case. |
| Run / Experiment | One scored pass of a task over (some of) the items. Immutable, timestamped, tagged model · git-sha · timestamp. Many runs over a dataset = the comparison history. |
"Experiment" and "dataset run" are the same thing — a scored execution.
Langfuse. The Datasets tab holds the items; Experiments holds the
runs. Click a run for per-item outputs + scores; tick two or more runs to
compare side-by-side; the cost/latency charts are per-run aggregates.
Watch out for non-comparable runs — a run over a filtered subset (e.g.
--tier T1) or on a different model isn't apples-to-apples with a full sweep.
Delete stray runs with delete_dataset_run (items are kept).
Logfire. Distinguishes Hosted datasets (managed in Logfire via API/UI,
listed by its datasets API) from Local datasets (defined in your
pydantic-evals code). Code-run experiments do not link to a hosted dataset by
name — they surface as a Local dataset named after the Dataset(name=...)
(intent_routing / tool_selection), each case openable as a full OTEL trace.
This project uses Local datasets; there are no hosted ones.
Cost/latency in unified mode
With --trace --langfuse, the agent runs once via the Logfire path and the
Langfuse run is mirrored from cached outputs — so for those runs, trust
Logfire for cost/latency; Langfuse's numbers reflect only the judge re-run.
The CI sweep uses --langfuse only, so it runs the agent itself and its
Langfuse cost/latency is accurate. Match models across runs (set MODEL_NAME
/ MINI_MODEL_NAME) or the comparison mixes models.
Automation (CI)¶
Two GitHub Actions workflows (.github/workflows/):
ci.yml — per-PR routing gate (free). Beyond pytest, two labelled steps run
the routing scorecards with explicit thresholds, so the eval is visible and fails on
a routing regression:
- name: Routing eval — synthetic # --min-intent 0.9 --min-agent 0.9
- name: Routing eval — real-world # --dataset …intent_routing_real.yaml --min-agent 0.85
No spend (keyword path, no creds).
eval-sweep.yml — tool-selection sweep → Langfuse. Runs the real-agent sweep
weekly (Mon 06:00 UTC) and on demand via workflow_dispatch (a Run workflow
button in the Actions tab, inputs tier / judge / trace). It runs on the
production model and reports to Langfuse; cases whose requires secrets are
absent skip. It's not in per-PR CI (it costs tokens).
Repo secrets (mirrored from Railway) drive the sweep:
| Secret | Purpose |
|---|---|
GOOGLE_API_KEY, MODEL_NAME, MINI_MODEL_NAME |
agent + judge models (match prod) |
LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST |
report to Langfuse |
LOGFIRE_TOKEN |
optional — also create a Logfire experiment (trace: true) |
OUTLOOK_REFRESH_TOKEN / OUTLOOK_CLIENT_ID / OUTLOOK_CLIENT_SECRET |
enable the read-only Microsoft To Do cases |
Adding more per-service secrets (GITHUB_TOKEN, GMAIL_REFRESH_TOKEN,
GOOGLE_MAPS_API_KEY, WEATHERAPI_KEY, TAVILY_API_KEY) widens coverage.
The feedback loop¶
The highest-value test cases come from real usage where routing actually went wrong.
flowchart LR
P[production interactions] --> M[mine_traces.py]
M -->|flags routing disagreements| D[gitignored draft]
D --> H[human review + PII sanitize]
H --> C[intent_routing_real.yaml]
C --> G[CI gate]
C --> R[router fixes]
R --> P
mine_traces.py reads the interactions table, recomputes the keyword route per
message, and flags rows where that route disagrees with the domain implied by the
tools the agent actually used. It filters conversational fragments (follow-ups,
acknowledgments) so only self-contained intents surface.
What the first prod run found
Mining 500 real interactions fixed 4 genuine routing bugs — all naive
substring false-matches in _INTENT_KEYWORDS:
"pr"/"ci"(github) matched "sprint", "spring", "april", "specific""latest on"(news) matched "latest on my emails""book"(calendar) matched "a book", "facebook", "notebook"
It also revealed the corpus was blind to Microsoft To Do — the dominant real workflow — now covered with read-only scenarios.
Current baselines¶
- Routing: synthetic 29/29 (100%), real-world 7/7 (100%) — free, CI-gated.
- Tool-selection sweep (T1–T5): ~98% assertion pass rate, LLM judge passing on every judged case, 0 E2B sandboxes spawned, 0 rate-limit errors.
Design decisions¶
- PII never enters git. Mining drafts are gitignored; promoted real-world cases are hand-sanitized (names/emails/personal details paraphrased, routing-relevant phrasing preserved).
- No real side effects. The eval nulls
e2b_api_keysoexecute_python/ delegation degrade gracefully instead of spawning billed sandboxes, runs against a throwaway in-memory SQLite DB, and excludes mutating Microsoft To Do ops (they would hit the real Graph). - Substring checks are brittle → the LLM judge is preferred for the hard tiers.
- Real-world cases are a separate dataset so real-world routing accuracy is its own signal, distinct from the synthetic suite.
Known limitations¶
The offline CLI harness can't faithfully reproduce some production-only behaviour;
those cases are marked eval_skip in the corpus with a documented re-enable path:
| Skipped case | Why | Covered by |
|---|---|---|
| T5-01 (destructive-action friction) | friction comes from the Telegram approval gate, which the CLI eval bypasses | tests/test_approval.py |
| T4-06 (read-later relevance injection) | runs in handle_message preprocessing; the eval calls agent.run() directly |
— (future harness work) |
| T3-05 (reminder correction) | set_reminder degrades on the CLI interface (needs a Telegram chat) |
— |
| T4-04 (task dedup) | dedup is vacuous against the empty in-memory eval DB | — |
Future fidelity work (each re-enables one or more of the above): a gated-interface mode, replicating the relevance-injection preprocessing, and DB state-seeding.
See eval/README.md for the operational quick-reference that ships next to the code.