Skip to content

Evaluation

Kwasi has a maintained evaluation system that measures how the agent is actually doing — is intent classified correctly, is the right domain agent selected, and are the right tools called for the right tasks — and catches regressions as prompts and models change. It also closes a feedback loop: real production traffic is mined into new test cases.

Everything lives under eval/. It is built on pydantic-evals (pinned to match pydantic-ai).

One source of truth, two view layers

The corpus (YAML), the task (the routing / agent-run callable), and the scorer are shared. Only the reporting backend differs — pydantic-evals → Logfire Experiments, and a parallel runner → Langfuse Experiments. The two can never score the same run differently.


What gets evaluated

Layer Question it answers Cost CI gate
Intent + routing Right domain classified, right agent selected? Free (keyword path, no LLM) ✅ pytest
Real-world routing Does it route your actual phrasings correctly? Free ✅ pytest
Tool selection Are the right tools called for the task? Tokens (runs the real agent) Manual / nightly
Response quality Did it actually do the right thing? (hard tiers T3–T5) Cheap (mini-model judge) Manual / nightly

The corpus is stratified into tiers (from research/kwasi-eval-framework.md): T1 single-tool · T2 routing · T3 multi-turn · T4 orchestration · T5 proactive/edge.


File map

eval/
  datasets/
    intent_routing.yaml        # synthetic routing corpus (message → expected categories + agent)
    intent_routing_real.yaml   # MINED + SANITIZED real-usage routing cases (separate signal)
    tool_selection.yaml        # 41 end-to-end scenarios (T1–T5) + read-only To Do cases
  tasks.py                     # classify_and_route — the live routing pipeline, isolated to score
  tool_selection.py            # run_agent_turn (real agent, in-memory DB, no E2B) + score_scenario + ToolSelection
  evaluators.py                # IntentMatch (set-equality + Jaccard), AgentMatch, shared category_scores
  langfuse_experiment.py       # Langfuse Dataset + Experiment runner (both phases)
  mine_traces.py               # trace→dataset miner (with conversational-fragment filter)
  report.py                    # scorecard + JSON artifact
  run.py                       # CLI entry point
tests/
  test_eval_intent_routing.py  # CI gate (synthetic + real-world routing)
  test_eval_tool_selection.py  # CI gate for the scenario scorer (pure)
  test_eval_mine_traces.py     # CI gate for the miner's pure logic

The corpus YAML is hand-maintained data (the original migration from scripts/run_eval.py was one-time; that runner is retired). Mining drafts are gitignored.


Running it

# Free, CI-gated routing eval (no LLM/embedding spend)
uv run python -m eval.run                                              # synthetic scorecard
uv run python -m eval.run --dataset eval/datasets/intent_routing_real.yaml
uv run pytest tests/test_eval_intent_routing.py                        # the gate itself

# Real-agent tool-selection (costs tokens; side-effect-free)
uv run python -m eval.run --phase tools --test T1-01 --judge           # one cheap case
uv run python -m eval.run --phase tools --tier T2 --judge              # a whole tier
uv run python -m eval.run --phase tools --trace --langfuse --judge     # full sweep → BOTH platforms, one run

# Mine real production traffic for new cases (read-only; needs prod env)
railway run uv run python -m eval.mine_traces --limit 500

Useful flags: --trace (→ Logfire), --langfuse (→ Langfuse), --judge (LLM judge on T3–T5), --tier / --test (filter), --max-concurrency (default 3, to respect rate limits), --semantic (also score embedding-fallback routing cases — costs embeddings).


Where results land

=== "Logfire" --trace runs appear under Evals → Datasets / Experiments as Local datasets named intent_routing / tool_selection, with per-case OTEL traces. Needs LOGFIRE_TOKEN.

=== "Langfuse" --langfuse pushes to Datasets kwasi-intent-routing / kwasi-tool-selection and creates Experiment runs named model · git-sha · timestamp — with run-over-run comparison plus cost/latency charts, so you can see quality move across prompt/model changes. Needs LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY.

--trace --langfuse together runs the agent once and populates both natively (the Logfire run's per-case outputs are reused for the Langfuse experiment — only the evaluators/judge re-run, not the agent). The default run and the pytest gate send nothing anywhere (free, cred-free).


Datasets, runs & experiments — reading the two platforms

The vocabulary is shared across both platforms:

Term Meaning
Dataset A named bag of items (test cases): input + expected_output + metadata. Stable; you grow it over time.
Item One test case.
Run / Experiment One scored pass of a task over (some of) the items. Immutable, timestamped, tagged model · git-sha · timestamp. Many runs over a dataset = the comparison history.

"Experiment" and "dataset run" are the same thing — a scored execution.

Langfuse. The Datasets tab holds the items; Experiments holds the runs. Click a run for per-item outputs + scores; tick two or more runs to compare side-by-side; the cost/latency charts are per-run aggregates. Watch out for non-comparable runs — a run over a filtered subset (e.g. --tier T1) or on a different model isn't apples-to-apples with a full sweep. Delete stray runs with delete_dataset_run (items are kept).

Logfire. Distinguishes Hosted datasets (managed in Logfire via API/UI, listed by its datasets API) from Local datasets (defined in your pydantic-evals code). Code-run experiments do not link to a hosted dataset by name — they surface as a Local dataset named after the Dataset(name=...) (intent_routing / tool_selection), each case openable as a full OTEL trace. This project uses Local datasets; there are no hosted ones.

Cost/latency in unified mode

With --trace --langfuse, the agent runs once via the Logfire path and the Langfuse run is mirrored from cached outputs — so for those runs, trust Logfire for cost/latency; Langfuse's numbers reflect only the judge re-run. The CI sweep uses --langfuse only, so it runs the agent itself and its Langfuse cost/latency is accurate. Match models across runs (set MODEL_NAME / MINI_MODEL_NAME) or the comparison mixes models.


Automation (CI)

Two GitHub Actions workflows (.github/workflows/):

ci.yml — per-PR routing gate (free). Beyond pytest, two labelled steps run the routing scorecards with explicit thresholds, so the eval is visible and fails on a routing regression:

- name: Routing eval — synthetic      # --min-intent 0.9 --min-agent 0.9
- name: Routing eval — real-world     # --dataset …intent_routing_real.yaml --min-agent 0.85

No spend (keyword path, no creds).

eval-sweep.yml — tool-selection sweep → Langfuse. Runs the real-agent sweep weekly (Mon 06:00 UTC) and on demand via workflow_dispatch (a Run workflow button in the Actions tab, inputs tier / judge / trace). It runs on the production model and reports to Langfuse; cases whose requires secrets are absent skip. It's not in per-PR CI (it costs tokens).

Repo secrets (mirrored from Railway) drive the sweep:

Secret Purpose
GOOGLE_API_KEY, MODEL_NAME, MINI_MODEL_NAME agent + judge models (match prod)
LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST report to Langfuse
LOGFIRE_TOKEN optional — also create a Logfire experiment (trace: true)
OUTLOOK_REFRESH_TOKEN / OUTLOOK_CLIENT_ID / OUTLOOK_CLIENT_SECRET enable the read-only Microsoft To Do cases

Adding more per-service secrets (GITHUB_TOKEN, GMAIL_REFRESH_TOKEN, GOOGLE_MAPS_API_KEY, WEATHERAPI_KEY, TAVILY_API_KEY) widens coverage.


The feedback loop

The highest-value test cases come from real usage where routing actually went wrong.

flowchart LR
    P[production interactions] --> M[mine_traces.py]
    M -->|flags routing disagreements| D[gitignored draft]
    D --> H[human review + PII sanitize]
    H --> C[intent_routing_real.yaml]
    C --> G[CI gate]
    C --> R[router fixes]
    R --> P

mine_traces.py reads the interactions table, recomputes the keyword route per message, and flags rows where that route disagrees with the domain implied by the tools the agent actually used. It filters conversational fragments (follow-ups, acknowledgments) so only self-contained intents surface.

What the first prod run found

Mining 500 real interactions fixed 4 genuine routing bugs — all naive substring false-matches in _INTENT_KEYWORDS:

  • "pr" / "ci" (github) matched "sprint", "spring", "april", "specific"
  • "latest on" (news) matched "latest on my emails"
  • "book" (calendar) matched "a book", "facebook", "notebook"

It also revealed the corpus was blind to Microsoft To Do — the dominant real workflow — now covered with read-only scenarios.


Current baselines

  • Routing: synthetic 29/29 (100%), real-world 7/7 (100%) — free, CI-gated.
  • Tool-selection sweep (T1–T5): ~98% assertion pass rate, LLM judge passing on every judged case, 0 E2B sandboxes spawned, 0 rate-limit errors.

Design decisions

  • PII never enters git. Mining drafts are gitignored; promoted real-world cases are hand-sanitized (names/emails/personal details paraphrased, routing-relevant phrasing preserved).
  • No real side effects. The eval nulls e2b_api_key so execute_python / delegation degrade gracefully instead of spawning billed sandboxes, runs against a throwaway in-memory SQLite DB, and excludes mutating Microsoft To Do ops (they would hit the real Graph).
  • Substring checks are brittle → the LLM judge is preferred for the hard tiers.
  • Real-world cases are a separate dataset so real-world routing accuracy is its own signal, distinct from the synthetic suite.

Known limitations

The offline CLI harness can't faithfully reproduce some production-only behaviour; those cases are marked eval_skip in the corpus with a documented re-enable path:

Skipped case Why Covered by
T5-01 (destructive-action friction) friction comes from the Telegram approval gate, which the CLI eval bypasses tests/test_approval.py
T4-06 (read-later relevance injection) runs in handle_message preprocessing; the eval calls agent.run() directly — (future harness work)
T3-05 (reminder correction) set_reminder degrades on the CLI interface (needs a Telegram chat)
T4-04 (task dedup) dedup is vacuous against the empty in-memory eval DB

Future fidelity work (each re-enables one or more of the above): a gated-interface mode, replicating the relevance-injection preprocessing, and DB state-seeding.


See eval/README.md for the operational quick-reference that ships next to the code.