Skip to content

Kwasi Codebase Curriculum — A Pedagogic Walkthrough

Context

You built Kwasi to learn the modern AI stack by doing. Along the way, with heavy AI assistance, the codebase grew into something substantial: a multi-interface agent with planning, memory, reflection, observability, multimodal I/O, and a wearable-data pipeline. You now want to step back and understand what you actually built — not as the person who wrote each commit, but as the engineer who needs to defend, extend, and benchmark it against the state of the art.

This curriculum is a guided tour of the codebase, ordered to build understanding from the inside out: the runtime spine first, then routing, then tools, then memory, then interfaces, then observability, finishing with a state-of-the-art comparison so you can see clearly which techniques you've internalised, which you're using cosmetically, and which gaps would be worth closing next.

Where it lives: the curriculum will be saved to docs/curriculum.md in the repo so it sits next to the existing docs and you can return to it anytime. We'll walk through one session per sitting; each session is self-contained.

Format of each session: - Concept primer — the technique in general (e.g. "what is a Pydantic AI agent loop") with the data-scientist-friendly framing - In Kwasi — the specific files, functions, and line numbers where it's implemented - Why these choices — the design tradeoffs the code embodies - Exercises — 2–3 small things to try (read this code, trace this call, modify this config) to lock the concept in - State of the art note — one paragraph: how this compares to what frontier teams are doing in 2026


Session 1 — Orientation & The Runtime Spine

Goal: see the whole system from 30,000 ft, then drop into the two most central files: app/main.py (the FastAPI process) and app/agent.py (the brain).

1.1 The 30,000-ft view

  • Read docs/index.md and docs/architecture.md (the mermaid diagram is the ground truth — return to it whenever you're lost)
  • The five layers, top to bottom: Interfaces (Telegram/CLI/WhatsApp/HTTP) → Planning gate (Spec 008) → Intent routingPydantic AI agentTools / Storage / LLM
  • Three model roles (MODEL_NAME, MINI_MODEL_NAME, REFLECTION_MODEL_NAME) and why they're separated

1.2 FastAPI lifespan as the orchestrator (app/main.py)

  • lifespan() is the startup/shutdown context manager — every long-lived asyncio task is born and killed here
  • Walk through what gets initialised: build_deps() → anchor cache pre-warm → Outlook MSAL persistence → Telegram bot → 13 background loops
  • Look at app/main.py:243–1487 (loop functions) and notice the pattern: while True: await sleep; do work — no Celery, no Redis queue, just asyncio
  • Concept: this is the "majestic monolith" pattern for a single-user system — defensible because the loops are low-frequency

1.3 The single AgentDeps factory (app/agent.py:82–115)

  • build_deps() is the only place AgentDeps is constructed — every interface shares one instance
  • Per-request mutation of telegram_chat_id / active_categories happens via dataclasses.replace (shallow copy) — not by mutating the singleton
  • Why: storage pools, MCP credentials, and HTTP clients are expensive to build; sharing them is correct, and the per-request copy keeps state clean

1.4 Pydantic AI in 5 minutes

  • The agent loop: model decides → call tool → tool returns → model decides → ... → final text response
  • Agent(deps_type=AgentDeps, retries=3) (app/agent.py:118)
  • Tools registered with @agent.tool (sync access to RunContext[AgentDeps])
  • System prompt registered with @agent.system_prompt (dynamic — runs once per agent.run())
  • agent.run() vs agent.run_stream() — the latter yields partial responses, used for Telegram's live "typing" effect

Exercises: 1. Open app/main.py, find every asyncio.create_task in lifespan(). Make a list. You should find ~13. 2. In app/agent.py, count the @agent.tool decorators. That's your native tool surface. 3. Trace one path: python -m app.main --clicli_loop()process_cli_message()agent.run(). What's the minimum stack between user input and LLM call?

State of the art note: Pydantic AI is one of three "production-grade" Python agent frameworks in 2026 (the others being LangGraph and OpenAI's Agents SDK). It wins on type safety (every tool argument is a Pydantic field) and provider portability (swap Gemini for Claude with one env var). It loses to LangGraph on graph-shaped agent topologies (yours is sequential, so this doesn't bite you).


Session 2 — Intent Routing & Multi-Step Planning

Goal: understand the two pre-LLM filters that decide which agent and how many steps before any expensive call happens.

2.1 The intent router (app/tools/router.py)

  • classify_intent(message, context_hint) (lines 668–739)
  • Stage 1 — keyword match against _INTENT_KEYWORDS (a hand-tuned dict mapping phrases → categories): zero LLM cost, microsecond latency
  • Stage 2 — semantic fallback: embed the message, cosine-compare against pre-computed "anchor embeddings" per domain (one short paragraph per domain, embedded at startup), threshold ≥0.60
  • Stage 3 — context inheritance: if no match and the previous turn was in a single non-utility domain, inherit it (handles "anything else?")
  • Concept primer: this is an embedding-based nearest-centroid classifier — same family as k-NN. Cheap and explainable; an LLM-based router would cost a full inference per message.

2.2 Agent dispatch (app/routing/agents.py)

  • Twelve domain agents (email, calendar, memory, github, jira, drive, slack, news, meetings, diagnostics, health, utility) + briefing_agent + full_agent
  • Each domain agent is built once at import time with only its relevant tools — fewer tools = better LLM accuracy and lower per-call tokens
  • select_agent(categories) (line 535+) does the dispatch; multi-domain matches go through _get_composed_agent(frozenset(categories)) which is @lru_cache(maxsize=32)'d
  • Why pre-build: the cost of Agent(...) construction (tool schema generation) is non-trivial. Doing it per-request would add latency for nothing.

2.3 Multi-step planning (Spec 008, app/planning/)

  • classify_complexity(text) — regex pre-filter on connectives ("and then", "after that", "also send"), ≥30 chars. Skips planning entirely for simple messages — zero LLM cost on the common case.
  • generate_plan(text, deps) — calls a tool-less _planner_agent with output_type=ExecutionPlan. Returns None if needs_planning=False or fewer than 2 steps.
  • execute_plan(plan, deps, send_progress) — runs steps sequentially, each through the same classify_intent + select_agent machinery
  • Scratchpad threading: each step's output (truncated to 300 chars) is prepended to the next step's message — later steps build on earlier results without re-fetching
  • Resume on failure: a failed step saves a PendingAction(action_type="plan_resume") carrying remaining steps + scratchpad — Confirm restarts from the failure point

Exercises: 1. Open app/tools/router.py:193+ and read the _INTENT_KEYWORDS dict for one domain (say email). Notice the bias toward verbs and concrete nouns. 2. In Telegram, send "what's the time?" then "and what about in Tokyo?" — observe how the second message inherits utility/datetime context. 3. Send a 3-step request ("check email, then summarise the top 3, then save them as notes") and watch the plan preview render.

State of the art note: Frontier agents (Devin, Claude Code, Cursor's agent mode) lean heavier on planning — they often run a "thinking" pass with extended reasoning before tool selection. Yours is closer to the router-then-actor pattern that dominated 2024–2025 agentic systems and is still the right choice for narrow personal-assistant scope. The piece that's missing relative to SOTA: no reflection/self-critique loop within a turn (your reflection happens nightly, not mid-conversation).


Session 3 — Tools, Skills, and the Approval Gate

Goal: understand the three layers between the LLM and the outside world.

3.1 Tool inventory tour (app/tools/, ~30 native tools)

Group by purpose: - Web/research: search_web (Tavily), summarize_url, browse_web (Playwright), deep_research (skill) - Productivity (MCP): Gmail/Outlook email (search, read, draft, send), calendars (Google + Outlook), Microsoft To Do, Google Drive - Memory: notes, tasks, reminders, scheduled tasks, user_facts, journal entries, semantic search - Code/DevOps: GitHub (PyGitHub), Jira, Slack, Logfire (diagnostics) - Multimodal: transcribe_audio (Gemini STT), analyze_image (Gemini Vision), synthesize_speech (edge-tts) - Maps/transit: search_places, get_directions, check_transit_status - Code execution: execute_python (E2B sandbox) - Health: get_recent_health, get_sleep_summary, get_hrv_trend, get_health_snapshot

3.2 The asyncio.to_thread pattern (app/interfaces/mcp/client.py)

  • Gmail/Outlook/Drive/GitHub/Slack SDKs are all synchronous — calling them directly would block the event loop
  • Each MCP wrapper does await asyncio.to_thread(sync_fn, *args) — offloads to Python's default thread pool
  • Concept: this is the standard escape hatch for sync libraries in async Python. It's correct here because these calls are I/O-bound (network), not CPU-bound.

3.3 The skills file-drop registry (app/skills/)

  • @skill decorator (just appends to a list)
  • load_skills(agent) walks app/skills/*.py, imports each, registers decorated functions on the agent
  • Idempotent (calling twice is a no-op)
  • Built-in skills: read_later, travel_briefing, cv, research, meeting_notes
  • Why: extending Kwasi without touching agent.py reduces merge-conflict surface and makes skills genuinely modular

3.4 The approval gate (app/approval.py)

This is one of the cleverest pieces in the codebase — read it carefully.

  • Problem: how do you let the user veto a destructive tool call (send email, delete note) without pausing the LLM mid-run?
  • Solution: the gated tool returns the string [APPROVAL_PENDING:<uuid>] instead of executing. The agent treats this as a normal tool result, writes a "your action is pending approval" response, and exits cleanly.
  • The bot post-processes the response, finds the sentinel, looks up the PendingAction, and renders Confirm/Cancel/Edit buttons
  • On Confirm, an ACTION_REGISTRY lookup finds the actual executor and runs it (no second LLM call)
  • Why approval.py is a leaf module: it imports only from app/memory/ports. Both agent.py and bot.py import it, but neither imports the other. This breaks a circular-import chain that would form if the gate lived in bot.py.

Exercises: 1. Find the @skill decorator in app/skills/__init__.py. It's ~3 lines. Marvel at the simplicity. 2. Pick one MCP tool (say gmail_read_email_wrapper) and trace it from agent.py registration → client.py wrapper → underlying asyncio.to_thread call. 3. In app/approval.py, find approval_gate() and follow what happens when deps.interface != "telegram" (CLI/WhatsApp/API path). Notice the bypass.

State of the art note: The sentinel pattern is unusual — most agent frameworks (LangGraph, OpenAI Assistants) handle approval via "interrupt" primitives that pause the run. Yours is simpler and stateless, which is genuinely a smart design choice for a single-user system. The cost: you can't ask the LLM to react to the user's edit ("I changed my mind, send to Bob instead"); you re-run the agent with a [REVISION] prompt instead. For Kwasi's scope this is fine.


Session 4 — Memory I: Storage, Embeddings, Context Injection

Goal: understand how Kwasi remembers — in three time scales (per-request, per-day, per-lifetime).

4.1 The StoragePort protocol (app/memory/ports.py)

  • A Python Protocol (structural typing) defines the storage interface — both adapters implement it
  • Models: Interaction, Note, Task, Reminder, UserContext, UserFact, ScheduledTask, AlertRule, PendingAction, PendingIntention, AgentLearning, JournalEntry, ReadLaterItem, NewsTopic, SeenStory, AuditEntry, SemanticSearchResult, plus HealthSample from app/health/models.py
  • Why a protocol, not an ABC: ducks. SQLite/Postgres adapters don't inherit from anything — they just implement the methods.

4.2 Two adapters (app/memory/adapters/)

  • SQLite (sqlite.py): aiosqlite, embeddings stored as JSON-encoded text, cosine similarity computed in Python
  • Postgres (postgres.py): asyncpg + pgvector with HNSW index, embeddings as halfvec(3072) (16-bit; full-precision vector(3072) exceeds pgvector's 2000-dim HNSW limit), cosine via <=> operator
  • Both implement hybrid search: keyword (ILIKE) + semantic in parallel, fused via Reciprocal Rank Fusion (RRF, k=60)
  • Concept primer: HNSW (Hierarchical Navigable Small World) is the dominant ANN index for high-dim vectors — sub-linear search with high recall. Cosine via <=> is the standard pgvector idiom.

4.3 Embeddings (app/tools/embedding.py)

  • Model: Gemini gemini-embedding-001 (3072 dims), fallback to gemini-embedding-2-preview on 404
  • Direct REST call via httpx — no google-genai SDK dependency for this hot path
  • Fire-and-forget on write: row is INSERTed first, then embed_text is called and the embedding is UPDATEd onto the row. If embedding fails, the row exists without one — keyword search still works.
  • Why fire-and-forget: the user-facing operation (saving a note) shouldn't block on a 200ms embedding call

4.4 Semantic context injection (app/utils/message_utils.py)

Before every agent.run(), three retrieval layers prepend XML-tagged context to the user message:

  1. find_relevant_notes (≥0.6 cosine) — top 2, with recency boost (+0.05 if <90 days old). Threshold lowered from 0.75 in May 2026 after measurement showed real matches sat in the 0.55-0.70 band.
  2. find_relevant_summaries (≥0.6 cosine) — top 2, matches notes prefixed Summary:. Lowered from 0.70 for the same reason.
  3. find_relevant_read_laternot semantic, just tag overlap (substring match), top 3 newest

All three share a 1,000-token budget; layers fill in priority order until exhausted.

  • Why XML tags: the model can distinguish <context type="notes"> from instructions — reduces "the model treated my retrieved fact as a command" failure mode
  • Why datetime in user turn, not system prompt: keeps the system prompt byte-identical across requests, qualifying for Gemini's implicit prompt cache

Exercises: 1. Open app/memory/adapters/sqlite.py:1479+ and read the cosine computation. Then postgres.py:1586+ for the SQL version. Same idea, two implementations. 2. Trace one note save: save_note tool → adapter save_note() → INSERT → embed_text() → UPDATE. Find where it can fail safely. 3. In message_utils.py, find inject_context() and read how the budget accounting works.

State of the art note: The retrieval pattern (semantic + keyword + tag-overlap, with budget-capped XML-tagged injection) is essentially RAG done well for a single-user assistant. What you're missing relative to frontier RAG: no re-ranking (a cross-encoder pass to re-order the top-K), no query rewriting (HyDE / sub-query generation), no structured retrieval (over a knowledge graph). For a personal assistant where the corpus is small (your own notes), these are likely premature. For a multi-user system at scale, you'd want them.


Session 5 — Memory II: The Three-Tier Pipeline & Reflection Engine

Goal: understand how raw conversation becomes lasting structured memory.

5.1 Short-term memory: the message_history (app/utils/message_utils.py)

  • Loaded via fetch_message_history() per turn. Default mode: last 10 interactions chronological. With ENABLE_SEMANTIC_HISTORY=true: last 3 verbatim + top 3 semantically-relevant older interactions (recency-boosted). Falls back to chronological on any retrieval failure.
  • build_message_history() then enforces a 6,000-token budget, dropping oldest first
  • Converted to Pydantic AI ModelRequest/ModelResponse pairs
  • The retrieval step is wrapped in @observe(name="message_history_retrieval") so it shows up under each turn in Langfuse with metadata {mode, recent_count, semantic_count, semantic_enabled}
  • Why a token budget, not a count: 10 short messages and 10 long messages cost wildly different tokens
  • Why two modes: chronological is reliable but doesn't surface relevant older context (e.g. "what did we discuss about X two weeks ago?"). Semantic mode trades pure recency for relevance, with a recency tail to preserve dialog coherence.

5.2 The three-tier write pipeline

Tier Latency What's written Where
Post-message seconds explicit user facts extract_facts_from_exchange (post_conversation.py:55)
Post-session ~30 min after last message session summary as Summary: <topic> notes summarise_session (post_conversation.py:96)
Nightly 2 AM UTC full profile rewrite, intentions, learnings, full-history clusters ReflectionService.run() (reflection.py:213)

5.3 The session-close timer trick (app/interfaces/telegram/bot.py)

  • Module-level _session_tasks: dict[str, asyncio.Task] — one pending task per chat
  • Every new message cancels the previous timer and schedules a fresh 30-min one
  • After 30 min of silence, summarise_session() runs once
  • Concept: this is a debounce. Same pattern as keystroke debouncing in UIs.

5.4 The reflection engine (app/memory/reflection.py)

Four outputs from a single LLM call: 1. ---PROFILE--- — narrative markdown, 6 sections, ≤550 words, capped at 4,800 chars before injection 2. ---FACTS--- — JSON array of new/changed UserFact records 3. ---INTENTIONS--- — JSON array of soft commitments ("I should call the dentist") with follow_up_days 4. ---LEARNINGS--- — JSON array of behavioral rules ("don't ask before saving notes") with category

Critical design choice: the prompt receives the existing profile, facts, intentions, and learnings as input, so the LLM only emits new records. Without this, every nightly run would re-emit the same facts and the dedup logic would have to handle it — much more expensive and error-prone.

Topic clustering (_summarise_conversations, reflection.py:635+): looks back 30 days, identifies up to 7 topics, writes each as a Summary: <topic> note. These are then findable by find_relevant_summaries on the next conversation.

5.5 Where memory is read

build_system_prompt() (app/agent.py:734+) on every request: - Fetches UserContext (the narrative profile) → injected as ## Your Memory of This User - Fetches all UserFact records → injected as ## What I Know About You, grouped by category - Fetches active AgentLearning records → injected as ## Behavioral Guidelines

Exercises: 1. Open app/memory/post_conversation.py and read _FACT_EXTRACTION_PROMPT (lines 16–28). Notice how strict it is — only explicit facts. 2. In reflection.py, find _REFLECTION_PROMPT and trace how existing facts are interpolated to prevent duplicates. 3. Trigger a reflection manually: curl -X POST $URL/reflect -H "X-Reflection-Secret: ..." and check the response counts.

State of the art note: The three-tier write pipeline is more sophisticated than what most personal-assistant projects ship — including some commercial ones. The frontier comparison is MemGPT / generative agents (Stanford 2023) which use a similar reflection/summarisation hierarchy. Where you diverge from SOTA: no vector forgetting (old facts never decay; you'd need TTLs or relevance-decay scoring at scale) and no episodic vs semantic memory split (everything goes in the same pile). For one user with thousands of facts, fine; for millions, you'd partition.


Session 6 — Interfaces & Multimodality

Goal: see how the same agent serves four very different surfaces.

6.1 Telegram (primary, app/interfaces/telegram/bot.py)

  • handle_message (text), handle_voice_message, handle_photo_message, handle_document_message
  • python-telegram-bot long-polling inside the FastAPI lifespan (no webhook needed)
  • Live-edit streaming: edits the placeholder message every 1.5s with the growing buffer
  • Voice-reply trigger regex (_VOICE_TRIGGER_RE): "tell me", "read it", "say that", "speak to me" → reply with TTS audio if response ≤500 words
  • Allowlist enforcement on every handler (ALLOWED_TELEGRAM_USER_IDS)
  • Audit log written after each interaction
  • Post-conversation asyncio.create_task() calls fire and forget

6.2 CLI (app/interfaces/cli/client.py)

  • Pure REPL — no approval gate, no streaming, no multimodal
  • Same AgentDeps, same agent — proves the interface abstraction works

6.3 WhatsApp webhook (app/interfaces/whatsapp/webhook.py)

  • Meta platform requires HTTPS webhook (no polling option)
  • Signature verification + dedup by message ID
  • Same intent routing + agent path; voice gets text reply (no TTS)

6.4 External API (POST /message in app/main.py)

  • For Android HTTP Shortcuts and other clients
  • X-API-Token auth, accepts text + optional image
  • Uses BRIEFING_CHAT_ID as user_id so history unifies with Telegram
  • Response delivered to Telegram and returned in the JSON body

6.5 Multimodal pipelines

  • STT: Gemini gemini-2.5-flash directly (model name derived from MODEL_NAME by stripping provider prefix)
  • Vision: same Gemini model, takes raw image bytes + mime type
  • TTS: edge-tts (Microsoft Neural voices, free, no API key) — TTS_VOICE defaults to en-GB-RyanNeural
  • PDF: routed to Gemini Vision (handles PDFs natively)
  • E2B code execution: execute_python runs in an ephemeral cloud VM; chart outputs return as [CHART_PNG:<b64>] markers that the bot extracts and sends as photos

Exercises: 1. Run the CLI (uv run python -m app.main --cli) and watch the same agent answer with no approval prompts. 2. Send Kwasi a voice note saying "what's on my calendar tomorrow?" and trace the path: download → STT → agent → TTS. 3. Send the text "tell me the weather" — observe the voice-reply trigger fire on a typed message.

State of the art note: Single-codepath-multi-interface is increasingly the dominant pattern — the alternative ("one process per channel") is dead for systems this size. Your multimodal stack is all Gemini for input, edge-tts for output which is genuinely cost-optimal for a personal assistant. Frontier alternatives (Whisper-large-v3 for STT, ElevenLabs for TTS) give better quality but cost 10–100× more. You picked the right knee of the price/quality curve.


Session 7 — Background Loops & Observability

Goal: understand the "everything that happens when the user isn't talking to Kwasi" half of the system.

7.1 The 13 background loops (all in app/main.py lifespan)

Categorise by purpose: - Proactive comms: morning briefing, evening recap, weekly recap, weekly prep, read-later digest, journal digest, email intelligence - Reactive: reminders, alerts + intentions, meeting prep, user scheduled tasks - Maintenance: nightly reflection, approval expiry / audit pruning

Patterns to notice: - Every loop is gated by an env var (BRIEFING_CHAT_ID, TELEGRAM_TOKEN, etc.) — missing config disables the loop cleanly - Dedup via the context table acting as a KV store: keys like system:briefing (today's date), system:meeting_prep:<event_id> prevent duplicate sends across container restarts - All cron-shaped loops use croniter for "next fire time" math

7.2 Observability (Spec 009, app/observability.py)

The cleverest piece: one OTEL tracer provider, two backends. - init_observability(settings, app) configures Logfire AND Langfuse on the same global provider - Agent.instrument_all() (Pydantic AI) emits spans once; both processors observe them - Division of labor: - Logfire owns infrastructure: FastAPI routes, loop spans, exceptions, latency by file - Langfuse owns LLM telemetry: per-generation tokens/cost, prompt versions, sessions, scores

7.3 Per-turn trace grouping

  • langfuse_root_span(name, session_id, user_id, ...) is an async context manager wrapping a whole turn (message + agent run + post-conv tasks)
  • All agent.run() calls inside nest under one Langfuse trace — without this you'd get one trace per LLM call instead of one per user turn
  • session_id shape: "telegram:{chat_id}", "cli:local", etc. — Langfuse aggregates by session

7.4 Asynchronous trace scoring

  • When the user taps Confirm/Cancel/Edit minutes after the original turn, score_trace(action.trace_id, name, value) writes a quality score to the original (already-closed) trace
  • PendingAction.trace_id is captured at gate time via _current_otel_trace_id() precisely so this delayed score lands on the right trace
  • Three score names: user_approval (1.0/0.0), user_edit, agent_error

7.5 Managed prompts (app/prompts.py + prompts.lock.json)

  • Nine prompts (persona, tone_calibration, morning_briefing, evening_recap, weekly_recap, weekly_prep, journal_digest, email_intel, reflection) are managed in Langfuse UI
  • get_prompt(name, fallback) returns the production-labeled Langfuse version when reachable, else the code constant
  • prompts.lock.json pins each constant's sha256
  • check_drift() runs at startup — warns if code has been edited without sync_prompts.py --push
  • scripts/sync_prompts.py is the only path that writes to Langfuse (--check / --push / --pull)

Exercises: 1. Open the Logfire dashboard, find a recent telegram/message span, look at its attributes (est_tokens_history, context_layers, etc.). 2. Open Langfuse, find a recent trace by session_id="telegram:<your_chat_id>". Look for an associated user_approval score. 3. Run uv run python scripts/sync_prompts.py --check. If it exits 1, you have drift — --push would resync.

State of the art note: Two independent observability backends sharing one OTEL provider is above-average production hygiene — many teams ship with only one (or worse, none). What's frontier: also wiring prompt-level evals (Promptfoo, Langfuse evaluations) and online A/B testing of prompt variants. You have the trace scoring infrastructure; running A/B tests on it would be the natural next step.


Session 8 — Health Data, Self-Extending Skills, and the State-of-the-Art Audit

Goal: the recently-shipped Spec 010 (health) and a frank comparison of Kwasi against frontier personal-assistant work.

8.1 Health data ingest (Spec 010)

  • The constraint: neither Samsung Health nor Google Health Connect exposes a server-side API. Data lives on-device.
  • The rejected option: Terra/Rook/Thryve (paid aggregators, ~$400/mo, still need an Android piece)
  • The chosen option: write a tiny Kotlin Android app (bridge-android/) that reads Health Connect locally via WorkManager every 15 min and POSTs to /health/ingest
  • Sideloaded — never hits Play Store, sidesteps Google's sensitive-permissions review entirely

8.2 The endpoint design (app/routers/health.py)

  • POST /health/ingestX-Health-Secret auth, accepts a batch of HealthSample rows
  • Idempotent on (metric_type, start_time, source_device) — bridge can retry safely after network blips
  • Single normalised table: health_samples with metric_type (text) + value (jsonb) — handles 11 metric types without a schema change per type
  • 4 read-only tools (get_recent_health, get_sleep_summary, get_hrv_trend, get_health_snapshot) on a dedicated health_agent

8.3 What's not yet shipped on this spec

  • Phase 3: wire health into morning briefing context, nightly reflection, and the alert-rule engine (e.g. "alert me if HRV drops 20% below baseline")

8.4 Spec 011 (self-extending skills) — currently a draft

  • Read specs/011-*.md if it exists — this is the next frontier piece you've been thinking about
  • The idea: let the agent itself author new skills based on observed patterns, with code review approval gates

8.5 State-of-the-art audit — what you've nailed, what's missing

Solidly modern (frontier or near-frontier): - ✅ Pydantic AI agent framework with type-safe tool registration - ✅ Domain-scoped tool sets (reduces "too many tools" failure mode) - ✅ Hybrid retrieval (semantic + keyword, RRF fusion) - ✅ Three-tier memory hierarchy (session → daily → permanent) - ✅ Cache-friendly system prompt design (datetime in user turn) - ✅ Dual observability with shared OTEL provider - ✅ Managed prompts with drift detection - ✅ Multi-step planning with scratchpad threading and resume - ✅ Approval gate via stateless sentinel pattern - ✅ MCP for external integrations - ✅ Push-model wearable ingest (sidesteps both privacy and OAuth issues)

Common but you've skipped (intentionally or not): - ⚠️ No re-ranking pass on retrieved context (cross-encoder or LLM-as-reranker) - ⚠️ No HyDE / query rewriting for retrieval - ⚠️ No mid-turn self-critique loop (reflection happens nightly, not within a request) - ⚠️ No structured output schemas on most tools (the planner uses one — others don't) - ⚠️ No prompt evals or A/B testing harness on top of your scoring infrastructure

Frontier patterns worth considering next: - 🔮 Computer use (Claude's screen-grab tool use) — let Kwasi see and click, not just call APIs - 🔮 Sub-agent delegation with shared workspace — your planner could spawn parallel sub-agents instead of sequential steps - 🔮 Memory consolidation via graph extraction (Mem0, Zep) — extract entity relationships from conversations into a knowledge graph - 🔮 Continuous fine-tuning of a small classifier on your routing decisions — your keyword router could become a learned router with zero LLM cost - 🔮 Agentic evaluation — have a separate "judge" agent score the main agent's responses, feed scores into Langfuse


Where to save this

After approval, this curriculum will be written to docs/curriculum.md and linked from docs/index.md so it's discoverable alongside the other docs. We then walk through one session per sitting; you can interrupt for "wait, explain that" or "show me the code" at any point.

How to run the walkthrough

For each session, the rhythm is: 1. You read the section (5–10 min) 2. I open the actual files at the line numbers cited and we trace through together 3. You do the exercises (10–20 min) — I'm here to answer questions 4. We discuss the state-of-the-art note — what would you change with another year of work?

Total time across all 8 sessions: roughly 6–10 hours, comfortably spread over a week or two.