Memory & Reflection¶

Kwasi has four layers of memory working together: short-term conversation history, a permanent facts store updated in real-time, session summaries created within ~30 minutes, and a long-term narrative profile built nightly by the Reflection Engine.

Memory Architecture¶

graph TD
    subgraph ShortTerm["Short-Term (per session)"]
        H[Last 10 interactions\nfetched from storage per request]
        H --> MH[message_history\nModelRequest / ModelResponse pairs]
    end

    subgraph PostConv["Post-Conversation (fire-and-forget, bot.py)"]
        PCE[extract_facts_from_exchange\nruns immediately after every message]
        SCS[summarise_session\nruns 30 min after last message in chat]
        PCE --> UF
        SCS --> SN["Summary: topic notes"]
    end

    subgraph PermanentFacts["Permanent Facts (never pruned)"]
        UF[(user_facts table\nkey-value pairs\nupdated_at on each write)]
        WU[User explicit:\n'remember that my home is...']
        WA[Agent proactive:\nuser mentions address mid-conversation]
        WR[Reflection auto-extraction:\nnightly LLM pass extracts new facts]
        WU --> UF
        WA --> UF
        WR --> UF
    end

    subgraph LongTerm["Long-Term Narrative (nightly 2 AM)"]
        I[All interactions in last 24h] --> RS[ReflectionService]
        RS --> LLM[LLM — tool-less agent]
        LLM --> UC[UserContext\nmarkdown profile ≤550 words\n6 sections]
        LLM --> NF[New facts JSON array]
        LLM --> SN2[Conversation cluster notes\nSummary: topic]
        UC --> CT[(context table)]
        NF --> UF
        SN2 --> Notes[(notes table)]
        SN --> Notes
    end

    subgraph SystemPrompt["Every Request — build_system_prompt"]
        UF --> FS["'What I Know About You'\ngrouped by category\nalways present if facts exist"]
        CT --> PS["'Your Memory of This User'\nnarrative profile"]
        Notes --> SRJ[find_relevant_summaries\ninjected as context if similar]
        FS --> Agent[Agent run]
        PS --> Agent
        MH --> Agent
        SRJ --> Agent
    end

Memory timeline¶

Tier	Latency	What gets saved
Post-message	Seconds	Explicit facts stated in the current exchange (`user_facts`)
Post-session	~30 minutes	Session summary clustered into `Summary: <topic>` notes (searchable immediately)
Nightly	2 AM UTC	Full profile rewrite, intention extraction, behavioral learnings, full-history clustering

Short-Term Memory¶

On every message (from any interface), the handler fetches recent interactions for that user via fetch_message_history() in app/utils/message_utils.py, then passes them to build_message_history() which applies a token budget before sending them to the model:

recent = await fetch_message_history(
    storage=deps.storage,
    user_id=user_id,
    message=text,
    settings=deps.settings,
)
message_history = build_message_history(recent)  # drops oldest if > 6,000 est. tokens

Two retrieval modes, controlled by ENABLE_SEMANTIC_HISTORY:

Chronological (default) — last recent_count + semantic_count interactions ordered newest-first. Same behaviour as before this feature shipped.
Semantic (ENABLE_SEMANTIC_HISTORY=true) — last 3 verbatim turns (preserves dialog coherence) plus top-3 semantically-relevant older interactions, with a small recency boost on similarity so equally-relevant recent matches surface first. Falls back to chronological on any failure (embed error, search error, hydrate error).

Both modes are wrapped in @observe(name="message_history_retrieval") so the retrieval step appears as a child observation under telegram.turn in Langfuse with metadata {mode, recent_count, semantic_count, semantic_enabled}.

build_message_history() then estimates tokens at ~4 chars/token and drops the oldest interactions first when the total exceeds TOKEN_HISTORY_BUDGET = 6000. The resulting list is converted into Pydantic AI ModelRequest / ModelResponse pairs and passed as message_history to agent.run() or agent.run_stream().

Why a token budget instead of a fixed count? A hard limit of 10 interactions behaves inconsistently — 10 one-line exchanges cost far fewer tokens than 10 multi-paragraph exchanges. The budget-based approach keeps context window usage predictable regardless of message length.

Tunable settings (all in app/config.py): - ENABLE_SEMANTIC_HISTORY (bool, default false) - SEMANTIC_HISTORY_RECENT_COUNT (default 3) - SEMANTIC_HISTORY_SEMANTIC_COUNT (default 3) - SEMANTIC_HISTORY_THRESHOLD (default 0.6)

Semantic Context Injection¶

Before every agent run, up to three retrieval layers prepend relevant context to the user message. All three layers share a single 1,000-token budget (CONTEXT_TOKEN_BUDGET) — they are filled in priority order until the budget is exhausted.

flowchart TD
    A[User message arrives] --> B{GOOGLE_API_KEY set?}
    B -- No --> Z[Send raw message to agent]
    B -- Yes --> C[budget = 1,000 tokens]
    C --> D[Layer 1: find_relevant_notes\n≥0.6 similarity · top 2 · +recency boost\nexcludes Summary and Research prefixes]
    D --> E{Notes found\nand budget > 0?}
    E -- No --> G
    E -- Yes --> F[Append XML block\nbudget -= note_tokens]
    F --> G[Layer 2: find_relevant_summaries\n≥0.6 similarity · top 2 · +recency boost\nmatches Summary: prefix only]
    G --> H{Summaries found\nand budget > 0?}
    H -- No --> J
    H -- Yes --> I[Append XML block\nbudget -= summary_tokens]
    I --> J[Layer 3: find_relevant_read_later\ntag-overlap matching · up to 3 items]
    J --> K{Items found\nand budget > 0?}
    K -- No --> M
    K -- Yes --> L[Append XML block]
    L --> M[Prepend datetime: local time in user timezone]
    M --> Z

Each layer wraps its output in XML tags so the model can distinguish retrieved memory from instructions:

<context type="notes">
- Note title: first 200 chars of content
Build on these — don't repeat them verbatim.
</context>

<context type="summaries">
- Topic name: first 200 chars of summary
Use for continuity — don't reference it explicitly.
</context>

<context type="read_later">
- "Article title" (https://...)
  Summary: ...
  Your note: ...
Mention these and include the URL.
</context>

Datetime in the user turn: the current local time is prepended as [Thursday, April 24, 2026 — 09:15 AM Europe/Paris] to the user turn (not the system prompt). This keeps the ~3,000–5,000 token system prompt identical across requests, qualifying it for Gemini's implicit prompt cache. A changing timestamp in the system prompt would bust the cache on every request.

Logfire span attributes: each Telegram message span records est_tokens_history, est_tokens_context, est_tokens_user_turn, context_layers, and history_interactions so token distribution is observable in Logfire per request.

Permanent User Facts¶

The user_facts table stores specific, verifiable facts about the user — the kind that get pruned from or never make it into a 550-word narrative profile: addresses, phone numbers, names, dietary restrictions, and similar.

How facts get written¶

Path	Trigger	`source` value	Latency
Explicit	User says "remember that my home is X"	`"agent"`	Immediate
Proactive	Agent calls `remember_fact` mid-conversation	`"agent"`	Immediate
Post-conversation	`extract_facts_from_exchange()` fires after every message; mini-model extracts explicitly stated facts. Pre-checks existing value — skips if unchanged, upserts if new or different.	`"agent"`	Seconds
Auto-extraction	Nightly reflection extracts new facts from 24h conversations	`"reflection"`	2 AM UTC

Agent tools¶

All three tools are registered on every domain agent (via _UTILITY_FNS) — not just the memory agent. The agent may learn personal facts during any conversation, regardless of domain.

Tool	What it does
`remember_fact(key, value, category)`	Upsert a fact by key. Overwrites previous value if the key already exists.
`recall_facts(query="")`	List all facts (empty query) or search keys + values. Returns grouped by category.
`forget_fact(key)`	Delete a fact by key.

Key naming: snake_case, descriptive. Examples: home_address, workplace, partner_name, preferred_transport, dietary_restrictions, manager_name, morning_routine, birthday.

Categories: location, personal, preference, work, health, general.

How facts get read¶

build_system_prompt() fetches all user facts on every request and injects them as a structured section before the reflection profile:

## What I Know About You
Verified permanent facts — treat these as ground truth.

*Location*
- Home Address: 124 Avenue Perretti, Neuilly-sur-Seine
- Workplace: La Défense, Paris

*Personal*
- Partner Name: Ana

The system prompt _MEMORY_INSTRUCTIONS instructs the agent: - Check permanent facts before saying you don't know something about the user - Save personal information proactively without asking permission - Priority order: permanent facts → reflection profile → recall_facts search → search_history - Never say "I don't have your address" without checking all four sources

Post-Conversation Memory Pipeline¶

app/memory/post_conversation.py runs two background functions after every handle_message call in bot.py via asyncio.create_task(). Both are fire-and-forget — they never raise and never block the response to the user.

Immediate fact extraction¶

extract_facts_from_exchange(user_message, agent_response, storage, settings) fires immediately after every Telegram message. It sends the last exchange (user + agent, capped at 1000 chars each) to mini_model_name with a prompt that extracts only explicitly stated facts:

"By the way, I moved to Berlin last month."
→ [{"key": "home_city", "value": "Berlin", "category": "location"}]

"Hi, how's it going?"
→ []

Before upserting each fact: 1. get_user_fact(key) is called to check the existing value 2. If value is identical — skip (no write, no re-embedding) 3. If value is different or absent — save_user_fact() upserts it

This means a fact like "I moved to Munich" correctly overwrites an existing home_city = "Berlin" within seconds of being stated, rather than at 2 AM.

Session-close summarisation¶

summarise_session(storage, settings, lookback_hours=3) fires after a 30-minute quiet period per chat. bot.py maintains a module-level _session_tasks: dict[str, asyncio.Task] — one pending task per chat_id. Each new message cancels the previous task and reschedules, so the timer resets on every reply.

When the timer fires: 1. Fetches all interactions from the last 3 hours via get_interactions_since() 2. If fewer than 2 interactions — returns immediately (nothing to summarise) 3. Runs mini_model_name on the conversation (capped at 50 turns × 150 chars) 4. Clusters into 1–3 topic summaries (returns [] for trivial/one-liner sessions) 5. Upserts each as Note(title="Summary: <topic>") — same format as the nightly reflection

These notes are immediately available to find_relevant_summaries() in message_utils.py, which injects matching summaries as context into the next conversation — no waiting until 2 AM.

Long-Term Memory — The Reflection Engine¶

The Reflection Engine (app/memory/reflection.py) runs nightly at 2 AM UTC. It produces four outputs from the last 24 hours of interactions:

An updated narrative profile (UserContext) — six sections, ≤550 words
A structured facts list — JSON array of new/changed facts, saved to user_facts
A structured intentions list — JSON array of newly detected personal commitments, saved to pending_intentions
A structured learnings list — JSON array of behavioral corrections, saved to agent_learnings

Reflection cycle¶

flowchart TD
    A([2 AM UTC trigger]) --> B[Fetch interactions\nsince 24h ago]
    B --> C{Any interactions?}
    C -- No --> D([Skip — log and return])
    C -- Yes --> E[Fetch existing UserContext\nfrom context table]
    E --> EF[Fetch existing user_facts + intentions + learnings\nto avoid re-extraction]
    EF --> F[Build reflection prompt\nexisting profile + facts + intentions + learnings + conversations]
    F --> G[Call LLM — tool-less Agent]
    G --> H{Parse output\n---PROFILE--- / ---FACTS--- / ---INTENTIONS--- / ---LEARNINGS--- markers}
    H --> I[Updated profile markdown\n≤550 words, 6 sections]
    H --> J[New facts JSON array\nonly new or changed facts]
    H --> JI[New intentions JSON array]
    H --> JL[New learnings JSON array]
    I --> K[Save UserContext\nto context table]
    J --> L[Save each UserFact\nto user_facts\nsource='reflection']
    JI --> LI[Save each PendingIntention\nto pending_intentions]
    JL --> LL[Save each AgentLearning\nto agent_learnings]
    K --> M([Sleep 24 hours])
    L --> M
    LI --> M
    LL --> M
    M --> A

The reflection prompt¶

The LLM is given four inputs: 1. The existing profile (or "start fresh" if none exists) 2. The existing facts — so it knows what's already stored and avoids duplicates 3. The last 24h of conversations — formatted as [timestamp]\nUser: ...\nKwasi: ... 4. Today's date (for temporal context)

It must produce output in this exact format:

---PROFILE---
## Personal
...
## Goals & Values
...
## Communication Style
...
## Patterns & Rhythms
...
## Preferences
...
## Notes
...

---FACTS---
[{"key": "home_address", "value": "...", "category": "location"}]

---INTENTIONS---
[{"text": "call doctor about knee", "follow_up_days": 3, "context": "Mentioned knee pain was getting worse"}]

---LEARNINGS---
[{"rule": "always confirm before deleting any item", "category": "decision_making"}]

Profile rules: Six sections with word budgets. Specific values (addresses, names) belong in FACTS, not in prose. Maximum 550 words. Return unchanged if nothing new was learned.

Facts rules: JSON array of objects with key, value, category. Only include facts that are new or have changed since the existing facts list. Empty array [] if nothing new.

Intentions rules: JSON array of objects with text and follow_up_days. Only include soft commitments not already in the existing intentions list. Empty array [] if nothing new. Examples: "I should call the dentist", "I want to start running again", "I need to finish that report".

Parsing and fallback¶

_parse_reflection_output(text) returns a tuple[str, list[dict], list[dict], list[dict]] (profile, facts, intentions, learnings): - Splits from the end: ---LEARNINGS--- first, then ---INTENTIONS---, then ---FACTS---, then ---PROFILE--- - Profile text = everything between ---PROFILE--- and ---FACTS--- - Facts = JSON parsed from the section between ---FACTS--- and ---INTENTIONS--- - Intentions = JSON parsed from between ---INTENTIONS--- and ---LEARNINGS--- - Learnings = JSON parsed from after ---LEARNINGS--- - If markers are absent: entire output treated as profile, no structured data extracted (backwards-compatible fallback with a logged warning) - Markdown code fences around the JSON arrays are stripped automatically

Where the profile is used¶

Every time the agent runs, build_system_prompt() fetches the current UserContext from storage and injects it after the permanent facts section, under a "## Your Memory of This User (narrative)" header that clarifies it should actively shape tone and priority, not just inform. The profile is capped at 4,800 characters (~1,200 tokens) before injection — if it has grown beyond that, it is truncated with a [...profile truncated for context] notice. This is fail-safe: if storage is down or the context table is empty, the agent still runs — just without the profile.

Triggering reflection manually¶

curl -X POST https://your-app.railway.app/reflect \
  -H "X-Reflection-Secret: your-secret"

The response includes facts_added, facts_updated, intentions_added, and learnings_added — counts of new records saved that cycle.

Semantic Search¶

Beyond keyword matching, Kwasi can find content by meaning using vector embeddings. When a note, interaction, or saved article is written to storage, an embedding is generated in the background via embed_text() (app/tools/embedding.py) using Gemini gemini-embedding-001 (3072 dimensions). If the embedding API is unavailable, the record is saved without an embedding — keyword search still works.

The `semantic_search` tool¶

Available on memory_agent, briefing_agent, and the full agent.

semantic_search(query: str, sources: list[str] | None = None, limit: int = 5)
# sources: "notes" | "interactions" | "read_later" — default: all three

Returns results grouped by source with a similarity percentage. Requires GOOGLE_API_KEY.

Recency weighting¶

find_relevant_notes(), find_relevant_summaries(), and the semantic context injection layer apply a small recency boost when ranking results. Notes written within the last 90 days receive up to +0.05 added to their cosine similarity score — so a note from last week at 0.78 similarity scores higher than an equivalent note from 18 months ago at 0.78. The boost decays linearly to 0 at 90 days. Notes without a created_at timestamp get no boost and sort by raw similarity only.

This means equally-relevant context from recent conversations surfaces first without entirely suppressing older material.

Where semantic search is used¶

Scenario	Tool
"Find anything about my career plans"	`semantic_search` (all sources)
"That conversation where I mentioned burnout"	`semantic_search(sources=["interactions"])` as fallback after `search_history`
"Find that article I saved about AI regulation"	`semantic_search(sources=["read_later"])` — primary, no keyword search exists for read_later
"Find anything about X" (unknown source)	`find_everything` — runs keyword + semantic in parallel
Note/task not found by keyword	`semantic_search` as automatic fallback

Backfilling existing data¶

Rows written before semantic search was added have embedding = NULL. To embed them:

curl -X POST https://your-app.railway.app/embed-backfill \
  -H "X-Reflection-Secret: your-secret"
# Returns: {"notes": 12, "interactions": 847, "read_later": 5}

Rate-limited to ~10 rows/second. For large histories, this may take several minutes — the endpoint processes synchronously.

Information Priority¶

When the user asks about themselves, the agent checks in this order:

1. Permanent facts (user_facts table — always in system prompt)
        ↓ not found
2. Reflection profile (UserContext — always in system prompt)
        ↓ not found
3. recall_facts(query) — active search over user_facts
        ↓ not found
4. search_history(query) — keyword search over past conversations
        ↓ not found
5. semantic_search(query) — meaning-based search over notes, interactions, saved articles
        ↓ not found
6. Acknowledge it's not known — offer to remember it now

What Gets Logged¶

Every interaction (from any handler) is saved via log_interaction():

Field	Content
`user_message`	The text sent to the agent (or transcript for voice)
`agent_response`	The full response returned
`tools_used`	JSON array of tool call records extracted from the agent result
`channel`	`"telegram"`, `"cli"`, or `"whatsapp"`
`user_id`	User ID as string (Telegram user ID, WhatsApp phone number, or `None` for CLI)
`created_at`	UTC timestamp

Logging is fire-and-forget — a failure never blocks the response.

Bootstrap: Seeding Initial Facts¶

After deploying, seed the facts store by telling Kwasi key facts in one message:

"Remember these facts about me: - My home address is 124 Avenue Perretti, Neuilly-sur-Seine - I work at [company] at [address] - My partner is [name] - I prefer public transport"

Kwasi will call remember_fact for each one. From that point, every prompt includes them permanently — no conversation history needed, no reflection cycle required.