Skip to content

Memory & Reflection

Kwasi has four layers of memory working together: short-term conversation history, a permanent facts store updated in real-time, session summaries created within ~30 minutes, and a long-term narrative profile built nightly by the Reflection Engine.


Memory Architecture

graph TD
    subgraph ShortTerm["Short-Term (per session)"]
        H[Last 10 interactions\nfetched from storage per request]
        H --> MH[message_history\nModelRequest / ModelResponse pairs]
    end

    subgraph PostConv["Post-Conversation (fire-and-forget, bot.py)"]
        PCE[extract_facts_from_exchange\nruns immediately after every message]
        SCS[summarise_session\nruns 30 min after last message in chat]
        PCE --> UF
        SCS --> SN["Summary: topic notes"]
    end

    subgraph PermanentFacts["Permanent Facts (never pruned)"]
        UF[(user_facts table\nkey-value pairs\nupdated_at on each write)]
        WU[User explicit:\n'remember that my home is...']
        WA[Agent proactive:\nuser mentions address mid-conversation]
        WR[Reflection auto-extraction:\nnightly LLM pass extracts new facts]
        WU --> UF
        WA --> UF
        WR --> UF
    end

    subgraph LongTerm["Long-Term Narrative (nightly 2 AM)"]
        I[All interactions in last 24h] --> RS[ReflectionService]
        RS --> LLM[LLM — tool-less agent]
        LLM --> UC[UserContext\nmarkdown profile ≤550 words\n6 sections]
        LLM --> NF[New facts JSON array]
        LLM --> SN2[Conversation cluster notes\nSummary: topic]
        UC --> CT[(context table)]
        NF --> UF
        SN2 --> Notes[(notes table)]
        SN --> Notes
    end

    subgraph SystemPrompt["Every Request — build_system_prompt"]
        UF --> FS["'What I Know About You'\ngrouped by category\nalways present if facts exist"]
        CT --> PS["'Your Memory of This User'\nnarrative profile"]
        Notes --> SRJ[find_relevant_summaries\ninjected as context if similar]
        FS --> Agent[Agent run]
        PS --> Agent
        MH --> Agent
        SRJ --> Agent
    end

Memory timeline

Tier Latency What gets saved
Post-message Seconds Explicit facts stated in the current exchange (user_facts)
Post-session ~30 minutes Session summary clustered into Summary: <topic> notes (searchable immediately)
Nightly 2 AM UTC Full profile rewrite, intention extraction, behavioral learnings, full-history clustering

Short-Term Memory

On every message (from any interface), the handler fetches recent interactions for that user via fetch_message_history() in app/utils/message_utils.py, then passes them to build_message_history() which applies a token budget before sending them to the model:

recent = await fetch_message_history(
    storage=deps.storage,
    user_id=user_id,
    message=text,
    settings=deps.settings,
)
message_history = build_message_history(recent)  # drops oldest if > 6,000 est. tokens

Two retrieval modes, controlled by ENABLE_SEMANTIC_HISTORY:

  • Chronological (default) — last recent_count + semantic_count interactions ordered newest-first. Same behaviour as before this feature shipped.
  • Semantic (ENABLE_SEMANTIC_HISTORY=true) — last 3 verbatim turns (preserves dialog coherence) plus top-3 semantically-relevant older interactions, with a small recency boost on similarity so equally-relevant recent matches surface first. Falls back to chronological on any failure (embed error, search error, hydrate error).

Both modes are wrapped in @observe(name="message_history_retrieval") so the retrieval step appears as a child observation under telegram.turn in Langfuse with metadata {mode, recent_count, semantic_count, semantic_enabled}.

build_message_history() then estimates tokens at ~4 chars/token and drops the oldest interactions first when the total exceeds TOKEN_HISTORY_BUDGET = 6000. The resulting list is converted into Pydantic AI ModelRequest / ModelResponse pairs and passed as message_history to agent.run() or agent.run_stream().

Why a token budget instead of a fixed count? A hard limit of 10 interactions behaves inconsistently — 10 one-line exchanges cost far fewer tokens than 10 multi-paragraph exchanges. The budget-based approach keeps context window usage predictable regardless of message length.

Tunable settings (all in app/config.py): - ENABLE_SEMANTIC_HISTORY (bool, default false) - SEMANTIC_HISTORY_RECENT_COUNT (default 3) - SEMANTIC_HISTORY_SEMANTIC_COUNT (default 3) - SEMANTIC_HISTORY_THRESHOLD (default 0.6)


Semantic Context Injection

Before every agent run, up to three retrieval layers prepend relevant context to the user message. All three layers share a single 1,000-token budget (CONTEXT_TOKEN_BUDGET) — they are filled in priority order until the budget is exhausted.

flowchart TD
    A[User message arrives] --> B{GOOGLE_API_KEY set?}
    B -- No --> Z[Send raw message to agent]
    B -- Yes --> C[budget = 1,000 tokens]
    C --> D[Layer 1: find_relevant_notes\n≥0.6 similarity · top 2 · +recency boost\nexcludes Summary and Research prefixes]
    D --> E{Notes found\nand budget > 0?}
    E -- No --> G
    E -- Yes --> F[Append XML block\nbudget -= note_tokens]
    F --> G[Layer 2: find_relevant_summaries\n≥0.6 similarity · top 2 · +recency boost\nmatches Summary: prefix only]
    G --> H{Summaries found\nand budget > 0?}
    H -- No --> J
    H -- Yes --> I[Append XML block\nbudget -= summary_tokens]
    I --> J[Layer 3: find_relevant_read_later\ntag-overlap matching · up to 3 items]
    J --> K{Items found\nand budget > 0?}
    K -- No --> M
    K -- Yes --> L[Append XML block]
    L --> M[Prepend datetime: local time in user timezone]
    M --> Z

Each layer wraps its output in XML tags so the model can distinguish retrieved memory from instructions:

<context type="notes">
- Note title: first 200 chars of content
Build on these — don't repeat them verbatim.
</context>

<context type="summaries">
- Topic name: first 200 chars of summary
Use for continuity — don't reference it explicitly.
</context>

<context type="read_later">
- "Article title" (https://...)
  Summary: ...
  Your note: ...
Mention these and include the URL.
</context>

Datetime in the user turn: the current local time is prepended as [Thursday, April 24, 2026 — 09:15 AM Europe/Paris] to the user turn (not the system prompt). This keeps the ~3,000–5,000 token system prompt identical across requests, qualifying it for Gemini's implicit prompt cache. A changing timestamp in the system prompt would bust the cache on every request.

Logfire span attributes: each Telegram message span records est_tokens_history, est_tokens_context, est_tokens_user_turn, context_layers, and history_interactions so token distribution is observable in Logfire per request.


Permanent User Facts

The user_facts table stores specific, verifiable facts about the user — the kind that get pruned from or never make it into a 550-word narrative profile: addresses, phone numbers, names, dietary restrictions, and similar.

How facts get written

Path Trigger source value Latency
Explicit User says "remember that my home is X" "agent" Immediate
Proactive Agent calls remember_fact mid-conversation "agent" Immediate
Post-conversation extract_facts_from_exchange() fires after every message; mini-model extracts explicitly stated facts. Pre-checks existing value — skips if unchanged, upserts if new or different. "agent" Seconds
Auto-extraction Nightly reflection extracts new facts from 24h conversations "reflection" 2 AM UTC

Agent tools

All three tools are registered on every domain agent (via _UTILITY_FNS) — not just the memory agent. The agent may learn personal facts during any conversation, regardless of domain.

Tool What it does
remember_fact(key, value, category) Upsert a fact by key. Overwrites previous value if the key already exists.
recall_facts(query="") List all facts (empty query) or search keys + values. Returns grouped by category.
forget_fact(key) Delete a fact by key.

Key naming: snake_case, descriptive. Examples: home_address, workplace, partner_name, preferred_transport, dietary_restrictions, manager_name, morning_routine, birthday.

Categories: location, personal, preference, work, health, general.

How facts get read

build_system_prompt() fetches all user facts on every request and injects them as a structured section before the reflection profile:

## What I Know About You
Verified permanent facts — treat these as ground truth.

*Location*
- Home Address: 124 Avenue Perretti, Neuilly-sur-Seine
- Workplace: La Défense, Paris

*Personal*
- Partner Name: Ana

The system prompt _MEMORY_INSTRUCTIONS instructs the agent: - Check permanent facts before saying you don't know something about the user - Save personal information proactively without asking permission - Priority order: permanent facts → reflection profile → recall_facts search → search_history - Never say "I don't have your address" without checking all four sources


Post-Conversation Memory Pipeline

app/memory/post_conversation.py runs two background functions after every handle_message call in bot.py via asyncio.create_task(). Both are fire-and-forget — they never raise and never block the response to the user.

Immediate fact extraction

extract_facts_from_exchange(user_message, agent_response, storage, settings) fires immediately after every Telegram message. It sends the last exchange (user + agent, capped at 1000 chars each) to mini_model_name with a prompt that extracts only explicitly stated facts:

"By the way, I moved to Berlin last month."
→ [{"key": "home_city", "value": "Berlin", "category": "location"}]

"Hi, how's it going?"
→ []

Before upserting each fact: 1. get_user_fact(key) is called to check the existing value 2. If value is identical — skip (no write, no re-embedding) 3. If value is different or absentsave_user_fact() upserts it

This means a fact like "I moved to Munich" correctly overwrites an existing home_city = "Berlin" within seconds of being stated, rather than at 2 AM.

Session-close summarisation

summarise_session(storage, settings, lookback_hours=3) fires after a 30-minute quiet period per chat. bot.py maintains a module-level _session_tasks: dict[str, asyncio.Task] — one pending task per chat_id. Each new message cancels the previous task and reschedules, so the timer resets on every reply.

When the timer fires: 1. Fetches all interactions from the last 3 hours via get_interactions_since() 2. If fewer than 2 interactions — returns immediately (nothing to summarise) 3. Runs mini_model_name on the conversation (capped at 50 turns × 150 chars) 4. Clusters into 1–3 topic summaries (returns [] for trivial/one-liner sessions) 5. Upserts each as Note(title="Summary: <topic>") — same format as the nightly reflection

These notes are immediately available to find_relevant_summaries() in message_utils.py, which injects matching summaries as context into the next conversation — no waiting until 2 AM.


Long-Term Memory — The Reflection Engine

The Reflection Engine (app/memory/reflection.py) runs nightly at 2 AM UTC. It produces four outputs from the last 24 hours of interactions:

  1. An updated narrative profile (UserContext) — six sections, ≤550 words
  2. A structured facts list — JSON array of new/changed facts, saved to user_facts
  3. A structured intentions list — JSON array of newly detected personal commitments, saved to pending_intentions
  4. A structured learnings list — JSON array of behavioral corrections, saved to agent_learnings

Reflection cycle

flowchart TD
    A([2 AM UTC trigger]) --> B[Fetch interactions\nsince 24h ago]
    B --> C{Any interactions?}
    C -- No --> D([Skip — log and return])
    C -- Yes --> E[Fetch existing UserContext\nfrom context table]
    E --> EF[Fetch existing user_facts + intentions + learnings\nto avoid re-extraction]
    EF --> F[Build reflection prompt\nexisting profile + facts + intentions + learnings + conversations]
    F --> G[Call LLM — tool-less Agent]
    G --> H{Parse output\n---PROFILE--- / ---FACTS--- / ---INTENTIONS--- / ---LEARNINGS--- markers}
    H --> I[Updated profile markdown\n≤550 words, 6 sections]
    H --> J[New facts JSON array\nonly new or changed facts]
    H --> JI[New intentions JSON array]
    H --> JL[New learnings JSON array]
    I --> K[Save UserContext\nto context table]
    J --> L[Save each UserFact\nto user_facts\nsource='reflection']
    JI --> LI[Save each PendingIntention\nto pending_intentions]
    JL --> LL[Save each AgentLearning\nto agent_learnings]
    K --> M([Sleep 24 hours])
    L --> M
    LI --> M
    LL --> M
    M --> A

The reflection prompt

The LLM is given four inputs: 1. The existing profile (or "start fresh" if none exists) 2. The existing facts — so it knows what's already stored and avoids duplicates 3. The last 24h of conversations — formatted as [timestamp]\nUser: ...\nKwasi: ... 4. Today's date (for temporal context)

It must produce output in this exact format:

---PROFILE---
## Personal
...
## Goals & Values
...
## Communication Style
...
## Patterns & Rhythms
...
## Preferences
...
## Notes
...

---FACTS---
[{"key": "home_address", "value": "...", "category": "location"}]

---INTENTIONS---
[{"text": "call doctor about knee", "follow_up_days": 3, "context": "Mentioned knee pain was getting worse"}]

---LEARNINGS---
[{"rule": "always confirm before deleting any item", "category": "decision_making"}]

Profile rules: Six sections with word budgets. Specific values (addresses, names) belong in FACTS, not in prose. Maximum 550 words. Return unchanged if nothing new was learned.

Facts rules: JSON array of objects with key, value, category. Only include facts that are new or have changed since the existing facts list. Empty array [] if nothing new.

Intentions rules: JSON array of objects with text and follow_up_days. Only include soft commitments not already in the existing intentions list. Empty array [] if nothing new. Examples: "I should call the dentist", "I want to start running again", "I need to finish that report".

Parsing and fallback

_parse_reflection_output(text) returns a tuple[str, list[dict], list[dict], list[dict]] (profile, facts, intentions, learnings): - Splits from the end: ---LEARNINGS--- first, then ---INTENTIONS---, then ---FACTS---, then ---PROFILE--- - Profile text = everything between ---PROFILE--- and ---FACTS--- - Facts = JSON parsed from the section between ---FACTS--- and ---INTENTIONS--- - Intentions = JSON parsed from between ---INTENTIONS--- and ---LEARNINGS--- - Learnings = JSON parsed from after ---LEARNINGS--- - If markers are absent: entire output treated as profile, no structured data extracted (backwards-compatible fallback with a logged warning) - Markdown code fences around the JSON arrays are stripped automatically

Where the profile is used

Every time the agent runs, build_system_prompt() fetches the current UserContext from storage and injects it after the permanent facts section, under a "## Your Memory of This User (narrative)" header that clarifies it should actively shape tone and priority, not just inform. The profile is capped at 4,800 characters (~1,200 tokens) before injection — if it has grown beyond that, it is truncated with a [...profile truncated for context] notice. This is fail-safe: if storage is down or the context table is empty, the agent still runs — just without the profile.

Triggering reflection manually

curl -X POST https://your-app.railway.app/reflect \
  -H "X-Reflection-Secret: your-secret"

The response includes facts_added, facts_updated, intentions_added, and learnings_added — counts of new records saved that cycle.


Beyond keyword matching, Kwasi can find content by meaning using vector embeddings. When a note, interaction, or saved article is written to storage, an embedding is generated in the background via embed_text() (app/tools/embedding.py) using Gemini gemini-embedding-001 (3072 dimensions). If the embedding API is unavailable, the record is saved without an embedding — keyword search still works.

The semantic_search tool

Available on memory_agent, briefing_agent, and the full agent.

semantic_search(query: str, sources: list[str] | None = None, limit: int = 5)
# sources: "notes" | "interactions" | "read_later" — default: all three

Returns results grouped by source with a similarity percentage. Requires GOOGLE_API_KEY.

Recency weighting

find_relevant_notes(), find_relevant_summaries(), and the semantic context injection layer apply a small recency boost when ranking results. Notes written within the last 90 days receive up to +0.05 added to their cosine similarity score — so a note from last week at 0.78 similarity scores higher than an equivalent note from 18 months ago at 0.78. The boost decays linearly to 0 at 90 days. Notes without a created_at timestamp get no boost and sort by raw similarity only.

This means equally-relevant context from recent conversations surfaces first without entirely suppressing older material.

Where semantic search is used

Scenario Tool
"Find anything about my career plans" semantic_search (all sources)
"That conversation where I mentioned burnout" semantic_search(sources=["interactions"]) as fallback after search_history
"Find that article I saved about AI regulation" semantic_search(sources=["read_later"]) — primary, no keyword search exists for read_later
"Find anything about X" (unknown source) find_everything — runs keyword + semantic in parallel
Note/task not found by keyword semantic_search as automatic fallback

Backfilling existing data

Rows written before semantic search was added have embedding = NULL. To embed them:

curl -X POST https://your-app.railway.app/embed-backfill \
  -H "X-Reflection-Secret: your-secret"
# Returns: {"notes": 12, "interactions": 847, "read_later": 5}

Rate-limited to ~10 rows/second. For large histories, this may take several minutes — the endpoint processes synchronously.


Information Priority

When the user asks about themselves, the agent checks in this order:

1. Permanent facts (user_facts table — always in system prompt)
        ↓ not found
2. Reflection profile (UserContext — always in system prompt)
        ↓ not found
3. recall_facts(query) — active search over user_facts
        ↓ not found
4. search_history(query) — keyword search over past conversations
        ↓ not found
5. semantic_search(query) — meaning-based search over notes, interactions, saved articles
        ↓ not found
6. Acknowledge it's not known — offer to remember it now

What Gets Logged

Every interaction (from any handler) is saved via log_interaction():

Field Content
user_message The text sent to the agent (or transcript for voice)
agent_response The full response returned
tools_used JSON array of tool call records extracted from the agent result
channel "telegram", "cli", or "whatsapp"
user_id User ID as string (Telegram user ID, WhatsApp phone number, or None for CLI)
created_at UTC timestamp

Logging is fire-and-forget — a failure never blocks the response.


Bootstrap: Seeding Initial Facts

After deploying, seed the facts store by telling Kwasi key facts in one message:

"Remember these facts about me: - My home address is 124 Avenue Perretti, Neuilly-sur-Seine - I work at [company] at [address] - My partner is [name] - I prefer public transport"

Kwasi will call remember_fact for each one. From that point, every prompt includes them permanently — no conversation history needed, no reflection cycle required.