Kwasi Codebase Curriculum — A Pedagogic Walkthrough¶
Context¶
You built Kwasi to learn the modern AI stack by doing. Along the way, with heavy AI assistance, the codebase grew into something substantial: a multi-interface agent with planning, memory, reflection, observability, multimodal I/O, and a wearable-data pipeline. You now want to step back and understand what you actually built — not as the person who wrote each commit, but as the engineer who needs to defend, extend, and benchmark it against the state of the art.
This curriculum is a guided tour of the codebase, ordered to build understanding from the inside out: the runtime spine first, then routing, then tools, then memory, then interfaces, then observability, finishing with a state-of-the-art comparison so you can see clearly which techniques you've internalised, which you're using cosmetically, and which gaps would be worth closing next.
Where it lives: the curriculum will be saved to docs/curriculum.md in the repo so it sits next to the existing docs and you can return to it anytime. We'll walk through one session per sitting; each session is self-contained.
Format of each session: - Concept primer — the technique in general (e.g. "what is a Pydantic AI agent loop") with the data-scientist-friendly framing - In Kwasi — the specific files, functions, and line numbers where it's implemented - Why these choices — the design tradeoffs the code embodies - Exercises — 2–3 small things to try (read this code, trace this call, modify this config) to lock the concept in - State of the art note — one paragraph: how this compares to what frontier teams are doing in 2026
Session 1 — Orientation & The Runtime Spine¶
Goal: see the whole system from 30,000 ft, then drop into the two most central files: app/main.py (the FastAPI process) and app/agent.py (the brain).
1.1 The 30,000-ft view¶
- Read
docs/index.mdanddocs/architecture.md(the mermaid diagram is the ground truth — return to it whenever you're lost) - The five layers, top to bottom: Interfaces (Telegram/CLI/WhatsApp/HTTP) → Planning gate (Spec 008) → Intent routing → Pydantic AI agent → Tools / Storage / LLM
- Three model roles (
MODEL_NAME,MINI_MODEL_NAME,REFLECTION_MODEL_NAME) and why they're separated
1.2 FastAPI lifespan as the orchestrator (app/main.py)¶
lifespan()is the startup/shutdown context manager — every long-lived asyncio task is born and killed here- Walk through what gets initialised:
build_deps()→ anchor cache pre-warm → Outlook MSAL persistence → Telegram bot → 13 background loops - Look at
app/main.py:243–1487(loop functions) and notice the pattern:while True: await sleep; do work— no Celery, no Redis queue, just asyncio - Concept: this is the "majestic monolith" pattern for a single-user system — defensible because the loops are low-frequency
1.3 The single AgentDeps factory (app/agent.py:82–115)¶
build_deps()is the only placeAgentDepsis constructed — every interface shares one instance- Per-request mutation of
telegram_chat_id/active_categorieshappens viadataclasses.replace(shallow copy) — not by mutating the singleton - Why: storage pools, MCP credentials, and HTTP clients are expensive to build; sharing them is correct, and the per-request copy keeps state clean
1.4 Pydantic AI in 5 minutes¶
- The agent loop: model decides → call tool → tool returns → model decides → ... → final text response
Agent(deps_type=AgentDeps, retries=3)(app/agent.py:118)- Tools registered with
@agent.tool(sync access toRunContext[AgentDeps]) - System prompt registered with
@agent.system_prompt(dynamic — runs once peragent.run()) agent.run()vsagent.run_stream()— the latter yields partial responses, used for Telegram's live "typing" effect
Exercises:
1. Open app/main.py, find every asyncio.create_task in lifespan(). Make a list. You should find ~13.
2. In app/agent.py, count the @agent.tool decorators. That's your native tool surface.
3. Trace one path: python -m app.main --cli → cli_loop() → process_cli_message() → agent.run(). What's the minimum stack between user input and LLM call?
State of the art note: Pydantic AI is one of three "production-grade" Python agent frameworks in 2026 (the others being LangGraph and OpenAI's Agents SDK). It wins on type safety (every tool argument is a Pydantic field) and provider portability (swap Gemini for Claude with one env var). It loses to LangGraph on graph-shaped agent topologies (yours is sequential, so this doesn't bite you).
Session 2 — Intent Routing & Multi-Step Planning¶
Goal: understand the two pre-LLM filters that decide which agent and how many steps before any expensive call happens.
2.1 The intent router (app/tools/router.py)¶
classify_intent(message, context_hint)(lines 668–739)- Stage 1 — keyword match against
_INTENT_KEYWORDS(a hand-tuned dict mapping phrases → categories): zero LLM cost, microsecond latency - Stage 2 — semantic fallback: embed the message, cosine-compare against pre-computed "anchor embeddings" per domain (one short paragraph per domain, embedded at startup), threshold ≥0.60
- Stage 3 — context inheritance: if no match and the previous turn was in a single non-utility domain, inherit it (handles "anything else?")
- Concept primer: this is an embedding-based nearest-centroid classifier — same family as k-NN. Cheap and explainable; an LLM-based router would cost a full inference per message.
2.2 Agent dispatch (app/routing/agents.py)¶
- Twelve domain agents (email, calendar, memory, github, jira, drive, slack, news, meetings, diagnostics, health, utility) +
briefing_agent+full_agent - Each domain agent is built once at import time with only its relevant tools — fewer tools = better LLM accuracy and lower per-call tokens
select_agent(categories)(line 535+) does the dispatch; multi-domain matches go through_get_composed_agent(frozenset(categories))which is@lru_cache(maxsize=32)'d- Why pre-build: the cost of
Agent(...)construction (tool schema generation) is non-trivial. Doing it per-request would add latency for nothing.
2.3 Multi-step planning (Spec 008, app/planning/)¶
classify_complexity(text)— regex pre-filter on connectives ("and then", "after that", "also send"), ≥30 chars. Skips planning entirely for simple messages — zero LLM cost on the common case.generate_plan(text, deps)— calls a tool-less_planner_agentwithoutput_type=ExecutionPlan. ReturnsNoneifneeds_planning=Falseor fewer than 2 steps.execute_plan(plan, deps, send_progress)— runs steps sequentially, each through the sameclassify_intent + select_agentmachinery- Scratchpad threading: each step's output (truncated to 300 chars) is prepended to the next step's message — later steps build on earlier results without re-fetching
- Resume on failure: a failed step saves a
PendingAction(action_type="plan_resume")carrying remaining steps + scratchpad — Confirm restarts from the failure point
Exercises:
1. Open app/tools/router.py:193+ and read the _INTENT_KEYWORDS dict for one domain (say email). Notice the bias toward verbs and concrete nouns.
2. In Telegram, send "what's the time?" then "and what about in Tokyo?" — observe how the second message inherits utility/datetime context.
3. Send a 3-step request ("check email, then summarise the top 3, then save them as notes") and watch the plan preview render.
State of the art note: Frontier agents (Devin, Claude Code, Cursor's agent mode) lean heavier on planning — they often run a "thinking" pass with extended reasoning before tool selection. Yours is closer to the router-then-actor pattern that dominated 2024–2025 agentic systems and is still the right choice for narrow personal-assistant scope. The piece that's missing relative to SOTA: no reflection/self-critique loop within a turn (your reflection happens nightly, not mid-conversation).
Session 3 — Tools, Skills, and the Approval Gate¶
Goal: understand the three layers between the LLM and the outside world.
3.1 Tool inventory tour (app/tools/, ~30 native tools)¶
Group by purpose:
- Web/research: search_web (Tavily), summarize_url, browse_web (Playwright), deep_research (skill)
- Productivity (MCP): Gmail/Outlook email (search, read, draft, send), calendars (Google + Outlook), Microsoft To Do, Google Drive
- Memory: notes, tasks, reminders, scheduled tasks, user_facts, journal entries, semantic search
- Code/DevOps: GitHub (PyGitHub), Jira, Slack, Logfire (diagnostics)
- Multimodal: transcribe_audio (Gemini STT), analyze_image (Gemini Vision), synthesize_speech (edge-tts)
- Maps/transit: search_places, get_directions, check_transit_status
- Code execution: execute_python (E2B sandbox)
- Health: get_recent_health, get_sleep_summary, get_hrv_trend, get_health_snapshot
3.2 The asyncio.to_thread pattern (app/interfaces/mcp/client.py)¶
- Gmail/Outlook/Drive/GitHub/Slack SDKs are all synchronous — calling them directly would block the event loop
- Each MCP wrapper does
await asyncio.to_thread(sync_fn, *args)— offloads to Python's default thread pool - Concept: this is the standard escape hatch for sync libraries in async Python. It's correct here because these calls are I/O-bound (network), not CPU-bound.
3.3 The skills file-drop registry (app/skills/)¶
@skilldecorator (just appends to a list)load_skills(agent)walksapp/skills/*.py, imports each, registers decorated functions on the agent- Idempotent (calling twice is a no-op)
- Built-in skills:
read_later,travel_briefing,cv,research,meeting_notes - Why: extending Kwasi without touching
agent.pyreduces merge-conflict surface and makes skills genuinely modular
3.4 The approval gate (app/approval.py)¶
This is one of the cleverest pieces in the codebase — read it carefully.
- Problem: how do you let the user veto a destructive tool call (send email, delete note) without pausing the LLM mid-run?
- Solution: the gated tool returns the string
[APPROVAL_PENDING:<uuid>]instead of executing. The agent treats this as a normal tool result, writes a "your action is pending approval" response, and exits cleanly. - The bot post-processes the response, finds the sentinel, looks up the
PendingAction, and renders Confirm/Cancel/Edit buttons - On Confirm, an
ACTION_REGISTRYlookup finds the actual executor and runs it (no second LLM call) - Why
approval.pyis a leaf module: it imports only fromapp/memory/ports. Bothagent.pyandbot.pyimport it, but neither imports the other. This breaks a circular-import chain that would form if the gate lived inbot.py.
Exercises:
1. Find the @skill decorator in app/skills/__init__.py. It's ~3 lines. Marvel at the simplicity.
2. Pick one MCP tool (say gmail_read_email_wrapper) and trace it from agent.py registration → client.py wrapper → underlying asyncio.to_thread call.
3. In app/approval.py, find approval_gate() and follow what happens when deps.interface != "telegram" (CLI/WhatsApp/API path). Notice the bypass.
State of the art note: The sentinel pattern is unusual — most agent frameworks (LangGraph, OpenAI Assistants) handle approval via "interrupt" primitives that pause the run. Yours is simpler and stateless, which is genuinely a smart design choice for a single-user system. The cost: you can't ask the LLM to react to the user's edit ("I changed my mind, send to Bob instead"); you re-run the agent with a [REVISION] prompt instead. For Kwasi's scope this is fine.
Session 4 — Memory I: Storage, Embeddings, Context Injection¶
Goal: understand how Kwasi remembers — in three time scales (per-request, per-day, per-lifetime).
4.1 The StoragePort protocol (app/memory/ports.py)¶
- A Python
Protocol(structural typing) defines the storage interface — both adapters implement it - Models:
Interaction,Note,Task,Reminder,UserContext,UserFact,ScheduledTask,AlertRule,PendingAction,PendingIntention,AgentLearning,JournalEntry,ReadLaterItem,NewsTopic,SeenStory,AuditEntry,SemanticSearchResult, plusHealthSamplefromapp/health/models.py - Why a protocol, not an ABC: ducks. SQLite/Postgres adapters don't inherit from anything — they just implement the methods.
4.2 Two adapters (app/memory/adapters/)¶
- SQLite (
sqlite.py): aiosqlite, embeddings stored as JSON-encoded text, cosine similarity computed in Python - Postgres (
postgres.py): asyncpg + pgvector with HNSW index, embeddings ashalfvec(3072)(16-bit; full-precisionvector(3072)exceeds pgvector's 2000-dim HNSW limit), cosine via<=>operator - Both implement hybrid search: keyword (ILIKE) + semantic in parallel, fused via Reciprocal Rank Fusion (RRF, k=60)
- Concept primer: HNSW (Hierarchical Navigable Small World) is the dominant ANN index for high-dim vectors — sub-linear search with high recall. Cosine via
<=>is the standard pgvector idiom.
4.3 Embeddings (app/tools/embedding.py)¶
- Model: Gemini
gemini-embedding-001(3072 dims), fallback togemini-embedding-2-previewon 404 - Direct REST call via
httpx— nogoogle-genaiSDK dependency for this hot path - Fire-and-forget on write: row is INSERTed first, then
embed_textis called and the embedding is UPDATEd onto the row. If embedding fails, the row exists without one — keyword search still works. - Why fire-and-forget: the user-facing operation (saving a note) shouldn't block on a 200ms embedding call
4.4 Semantic context injection (app/utils/message_utils.py)¶
Before every agent.run(), three retrieval layers prepend XML-tagged context to the user message:
find_relevant_notes(≥0.6 cosine) — top 2, with recency boost (+0.05 if <90 days old). Threshold lowered from 0.75 in May 2026 after measurement showed real matches sat in the 0.55-0.70 band.find_relevant_summaries(≥0.6 cosine) — top 2, matches notes prefixedSummary:. Lowered from 0.70 for the same reason.find_relevant_read_later— not semantic, just tag overlap (substring match), top 3 newest
All three share a 1,000-token budget; layers fill in priority order until exhausted.
- Why XML tags: the model can distinguish
<context type="notes">from instructions — reduces "the model treated my retrieved fact as a command" failure mode - Why datetime in user turn, not system prompt: keeps the system prompt byte-identical across requests, qualifying for Gemini's implicit prompt cache
Exercises:
1. Open app/memory/adapters/sqlite.py:1479+ and read the cosine computation. Then postgres.py:1586+ for the SQL version. Same idea, two implementations.
2. Trace one note save: save_note tool → adapter save_note() → INSERT → embed_text() → UPDATE. Find where it can fail safely.
3. In message_utils.py, find inject_context() and read how the budget accounting works.
State of the art note: The retrieval pattern (semantic + keyword + tag-overlap, with budget-capped XML-tagged injection) is essentially RAG done well for a single-user assistant. What you're missing relative to frontier RAG: no re-ranking (a cross-encoder pass to re-order the top-K), no query rewriting (HyDE / sub-query generation), no structured retrieval (over a knowledge graph). For a personal assistant where the corpus is small (your own notes), these are likely premature. For a multi-user system at scale, you'd want them.
Session 5 — Memory II: The Three-Tier Pipeline & Reflection Engine¶
Goal: understand how raw conversation becomes lasting structured memory.
5.1 Short-term memory: the message_history (app/utils/message_utils.py)¶
- Loaded via
fetch_message_history()per turn. Default mode: last 10 interactions chronological. WithENABLE_SEMANTIC_HISTORY=true: last 3 verbatim + top 3 semantically-relevant older interactions (recency-boosted). Falls back to chronological on any retrieval failure. build_message_history()then enforces a 6,000-token budget, dropping oldest first- Converted to Pydantic AI
ModelRequest/ModelResponsepairs - The retrieval step is wrapped in
@observe(name="message_history_retrieval")so it shows up under each turn in Langfuse with metadata{mode, recent_count, semantic_count, semantic_enabled} - Why a token budget, not a count: 10 short messages and 10 long messages cost wildly different tokens
- Why two modes: chronological is reliable but doesn't surface relevant older context (e.g. "what did we discuss about X two weeks ago?"). Semantic mode trades pure recency for relevance, with a recency tail to preserve dialog coherence.
5.2 The three-tier write pipeline¶
| Tier | Latency | What's written | Where |
|---|---|---|---|
| Post-message | seconds | explicit user facts | extract_facts_from_exchange (post_conversation.py:55) |
| Post-session | ~30 min after last message | session summary as Summary: <topic> notes |
summarise_session (post_conversation.py:96) |
| Nightly | 2 AM UTC | full profile rewrite, intentions, learnings, full-history clusters | ReflectionService.run() (reflection.py:213) |
5.3 The session-close timer trick (app/interfaces/telegram/bot.py)¶
- Module-level
_session_tasks: dict[str, asyncio.Task]— one pending task per chat - Every new message cancels the previous timer and schedules a fresh 30-min one
- After 30 min of silence,
summarise_session()runs once - Concept: this is a debounce. Same pattern as keystroke debouncing in UIs.
5.4 The reflection engine (app/memory/reflection.py)¶
Four outputs from a single LLM call:
1. ---PROFILE--- — narrative markdown, 6 sections, ≤550 words, capped at 4,800 chars before injection
2. ---FACTS--- — JSON array of new/changed UserFact records
3. ---INTENTIONS--- — JSON array of soft commitments ("I should call the dentist") with follow_up_days
4. ---LEARNINGS--- — JSON array of behavioral rules ("don't ask before saving notes") with category
Critical design choice: the prompt receives the existing profile, facts, intentions, and learnings as input, so the LLM only emits new records. Without this, every nightly run would re-emit the same facts and the dedup logic would have to handle it — much more expensive and error-prone.
Topic clustering (_summarise_conversations, reflection.py:635+): looks back 30 days, identifies up to 7 topics, writes each as a Summary: <topic> note. These are then findable by find_relevant_summaries on the next conversation.
5.5 Where memory is read¶
build_system_prompt() (app/agent.py:734+) on every request:
- Fetches UserContext (the narrative profile) → injected as ## Your Memory of This User
- Fetches all UserFact records → injected as ## What I Know About You, grouped by category
- Fetches active AgentLearning records → injected as ## Behavioral Guidelines
Exercises:
1. Open app/memory/post_conversation.py and read _FACT_EXTRACTION_PROMPT (lines 16–28). Notice how strict it is — only explicit facts.
2. In reflection.py, find _REFLECTION_PROMPT and trace how existing facts are interpolated to prevent duplicates.
3. Trigger a reflection manually: curl -X POST $URL/reflect -H "X-Reflection-Secret: ..." and check the response counts.
State of the art note: The three-tier write pipeline is more sophisticated than what most personal-assistant projects ship — including some commercial ones. The frontier comparison is MemGPT / generative agents (Stanford 2023) which use a similar reflection/summarisation hierarchy. Where you diverge from SOTA: no vector forgetting (old facts never decay; you'd need TTLs or relevance-decay scoring at scale) and no episodic vs semantic memory split (everything goes in the same pile). For one user with thousands of facts, fine; for millions, you'd partition.
Session 6 — Interfaces & Multimodality¶
Goal: see how the same agent serves four very different surfaces.
6.1 Telegram (primary, app/interfaces/telegram/bot.py)¶
handle_message(text),handle_voice_message,handle_photo_message,handle_document_messagepython-telegram-botlong-polling inside the FastAPI lifespan (no webhook needed)- Live-edit streaming: edits the placeholder message every 1.5s with the growing buffer
- Voice-reply trigger regex (
_VOICE_TRIGGER_RE): "tell me", "read it", "say that", "speak to me" → reply with TTS audio if response ≤500 words - Allowlist enforcement on every handler (
ALLOWED_TELEGRAM_USER_IDS) - Audit log written after each interaction
- Post-conversation
asyncio.create_task()calls fire and forget
6.2 CLI (app/interfaces/cli/client.py)¶
- Pure REPL — no approval gate, no streaming, no multimodal
- Same
AgentDeps, same agent — proves the interface abstraction works
6.3 WhatsApp webhook (app/interfaces/whatsapp/webhook.py)¶
- Meta platform requires HTTPS webhook (no polling option)
- Signature verification + dedup by message ID
- Same intent routing + agent path; voice gets text reply (no TTS)
6.4 External API (POST /message in app/main.py)¶
- For Android HTTP Shortcuts and other clients
X-API-Tokenauth, accepts text + optional image- Uses
BRIEFING_CHAT_IDasuser_idso history unifies with Telegram - Response delivered to Telegram and returned in the JSON body
6.5 Multimodal pipelines¶
- STT: Gemini
gemini-2.5-flashdirectly (model name derived fromMODEL_NAMEby stripping provider prefix) - Vision: same Gemini model, takes raw image bytes + mime type
- TTS:
edge-tts(Microsoft Neural voices, free, no API key) —TTS_VOICEdefaults toen-GB-RyanNeural - PDF: routed to Gemini Vision (handles PDFs natively)
- E2B code execution:
execute_pythonruns in an ephemeral cloud VM; chart outputs return as[CHART_PNG:<b64>]markers that the bot extracts and sends as photos
Exercises:
1. Run the CLI (uv run python -m app.main --cli) and watch the same agent answer with no approval prompts.
2. Send Kwasi a voice note saying "what's on my calendar tomorrow?" and trace the path: download → STT → agent → TTS.
3. Send the text "tell me the weather" — observe the voice-reply trigger fire on a typed message.
State of the art note: Single-codepath-multi-interface is increasingly the dominant pattern — the alternative ("one process per channel") is dead for systems this size. Your multimodal stack is all Gemini for input, edge-tts for output which is genuinely cost-optimal for a personal assistant. Frontier alternatives (Whisper-large-v3 for STT, ElevenLabs for TTS) give better quality but cost 10–100× more. You picked the right knee of the price/quality curve.
Session 7 — Background Loops & Observability¶
Goal: understand the "everything that happens when the user isn't talking to Kwasi" half of the system.
7.1 The 13 background loops (all in app/main.py lifespan)¶
Categorise by purpose: - Proactive comms: morning briefing, evening recap, weekly recap, weekly prep, read-later digest, journal digest, email intelligence - Reactive: reminders, alerts + intentions, meeting prep, user scheduled tasks - Maintenance: nightly reflection, approval expiry / audit pruning
Patterns to notice:
- Every loop is gated by an env var (BRIEFING_CHAT_ID, TELEGRAM_TOKEN, etc.) — missing config disables the loop cleanly
- Dedup via the context table acting as a KV store: keys like system:briefing (today's date), system:meeting_prep:<event_id> prevent duplicate sends across container restarts
- All cron-shaped loops use croniter for "next fire time" math
7.2 Observability (Spec 009, app/observability.py)¶
The cleverest piece: one OTEL tracer provider, two backends.
- init_observability(settings, app) configures Logfire AND Langfuse on the same global provider
- Agent.instrument_all() (Pydantic AI) emits spans once; both processors observe them
- Division of labor:
- Logfire owns infrastructure: FastAPI routes, loop spans, exceptions, latency by file
- Langfuse owns LLM telemetry: per-generation tokens/cost, prompt versions, sessions, scores
7.3 Per-turn trace grouping¶
langfuse_root_span(name, session_id, user_id, ...)is an async context manager wrapping a whole turn (message + agent run + post-conv tasks)- All
agent.run()calls inside nest under one Langfuse trace — without this you'd get one trace per LLM call instead of one per user turn session_idshape:"telegram:{chat_id}","cli:local", etc. — Langfuse aggregates by session
7.4 Asynchronous trace scoring¶
- When the user taps Confirm/Cancel/Edit minutes after the original turn,
score_trace(action.trace_id, name, value)writes a quality score to the original (already-closed) trace PendingAction.trace_idis captured at gate time via_current_otel_trace_id()precisely so this delayed score lands on the right trace- Three score names:
user_approval(1.0/0.0),user_edit,agent_error
7.5 Managed prompts (app/prompts.py + prompts.lock.json)¶
- Nine prompts (
persona,tone_calibration,morning_briefing,evening_recap,weekly_recap,weekly_prep,journal_digest,email_intel,reflection) are managed in Langfuse UI get_prompt(name, fallback)returns the production-labeled Langfuse version when reachable, else the code constantprompts.lock.jsonpins each constant's sha256check_drift()runs at startup — warns if code has been edited withoutsync_prompts.py --pushscripts/sync_prompts.pyis the only path that writes to Langfuse (--check / --push / --pull)
Exercises:
1. Open the Logfire dashboard, find a recent telegram/message span, look at its attributes (est_tokens_history, context_layers, etc.).
2. Open Langfuse, find a recent trace by session_id="telegram:<your_chat_id>". Look for an associated user_approval score.
3. Run uv run python scripts/sync_prompts.py --check. If it exits 1, you have drift — --push would resync.
State of the art note: Two independent observability backends sharing one OTEL provider is above-average production hygiene — many teams ship with only one (or worse, none). What's frontier: also wiring prompt-level evals (Promptfoo, Langfuse evaluations) and online A/B testing of prompt variants. You have the trace scoring infrastructure; running A/B tests on it would be the natural next step.
Session 8 — Health Data, Self-Extending Skills, and the State-of-the-Art Audit¶
Goal: the recently-shipped Spec 010 (health) and a frank comparison of Kwasi against frontier personal-assistant work.
8.1 Health data ingest (Spec 010)¶
- The constraint: neither Samsung Health nor Google Health Connect exposes a server-side API. Data lives on-device.
- The rejected option: Terra/Rook/Thryve (paid aggregators, ~$400/mo, still need an Android piece)
- The chosen option: write a tiny Kotlin Android app (
bridge-android/) that reads Health Connect locally via WorkManager every 15 min and POSTs to/health/ingest - Sideloaded — never hits Play Store, sidesteps Google's sensitive-permissions review entirely
8.2 The endpoint design (app/routers/health.py)¶
POST /health/ingest—X-Health-Secretauth, accepts a batch ofHealthSamplerows- Idempotent on
(metric_type, start_time, source_device)— bridge can retry safely after network blips - Single normalised table:
health_sampleswithmetric_type(text) +value(jsonb) — handles 11 metric types without a schema change per type - 4 read-only tools (
get_recent_health,get_sleep_summary,get_hrv_trend,get_health_snapshot) on a dedicatedhealth_agent
8.3 What's not yet shipped on this spec¶
- Phase 3: wire health into morning briefing context, nightly reflection, and the alert-rule engine (e.g. "alert me if HRV drops 20% below baseline")
8.4 Spec 011 (self-extending skills) — currently a draft¶
- Read
specs/011-*.mdif it exists — this is the next frontier piece you've been thinking about - The idea: let the agent itself author new skills based on observed patterns, with code review approval gates
8.5 State-of-the-art audit — what you've nailed, what's missing¶
Solidly modern (frontier or near-frontier): - ✅ Pydantic AI agent framework with type-safe tool registration - ✅ Domain-scoped tool sets (reduces "too many tools" failure mode) - ✅ Hybrid retrieval (semantic + keyword, RRF fusion) - ✅ Three-tier memory hierarchy (session → daily → permanent) - ✅ Cache-friendly system prompt design (datetime in user turn) - ✅ Dual observability with shared OTEL provider - ✅ Managed prompts with drift detection - ✅ Multi-step planning with scratchpad threading and resume - ✅ Approval gate via stateless sentinel pattern - ✅ MCP for external integrations - ✅ Push-model wearable ingest (sidesteps both privacy and OAuth issues)
Common but you've skipped (intentionally or not): - ⚠️ No re-ranking pass on retrieved context (cross-encoder or LLM-as-reranker) - ⚠️ No HyDE / query rewriting for retrieval - ⚠️ No mid-turn self-critique loop (reflection happens nightly, not within a request) - ⚠️ No structured output schemas on most tools (the planner uses one — others don't) - ⚠️ No prompt evals or A/B testing harness on top of your scoring infrastructure
Frontier patterns worth considering next: - 🔮 Computer use (Claude's screen-grab tool use) — let Kwasi see and click, not just call APIs - 🔮 Sub-agent delegation with shared workspace — your planner could spawn parallel sub-agents instead of sequential steps - 🔮 Memory consolidation via graph extraction (Mem0, Zep) — extract entity relationships from conversations into a knowledge graph - 🔮 Continuous fine-tuning of a small classifier on your routing decisions — your keyword router could become a learned router with zero LLM cost - 🔮 Agentic evaluation — have a separate "judge" agent score the main agent's responses, feed scores into Langfuse
Where to save this¶
After approval, this curriculum will be written to docs/curriculum.md and linked from docs/index.md so it's discoverable alongside the other docs. We then walk through one session per sitting; you can interrupt for "wait, explain that" or "show me the code" at any point.
How to run the walkthrough¶
For each session, the rhythm is: 1. You read the section (5–10 min) 2. I open the actual files at the line numbers cited and we trace through together 3. You do the exercises (10–20 min) — I'm here to answer questions 4. We discuss the state-of-the-art note — what would you change with another year of work?
Total time across all 8 sessions: roughly 6–10 hours, comfortably spread over a week or two.