Memory & Reflection¶
Kwasi has four layers of memory working together: short-term conversation history, a permanent facts store updated in real-time, session summaries created within ~30 minutes, and a long-term narrative profile built nightly by the Reflection Engine.
Memory Architecture¶
graph TD
subgraph ShortTerm["Short-Term (per session)"]
H[Last 10 interactions\nfetched from storage per request]
H --> MH[message_history\nModelRequest / ModelResponse pairs]
end
subgraph PostConv["Post-Conversation (fire-and-forget, bot.py)"]
PCE[extract_facts_from_exchange\nruns immediately after every message]
SCS[summarise_session\nruns 30 min after last message in chat]
PCE --> UF
SCS --> SN["Summary: topic notes"]
end
subgraph PermanentFacts["Permanent Facts (never pruned)"]
UF[(user_facts table\nkey-value pairs\nupdated_at on each write)]
WU[User explicit:\n'remember that my home is...']
WA[Agent proactive:\nuser mentions address mid-conversation]
WR[Reflection auto-extraction:\nnightly LLM pass extracts new facts]
WU --> UF
WA --> UF
WR --> UF
end
subgraph LongTerm["Long-Term Narrative (nightly 2 AM)"]
I[All interactions in last 24h] --> RS[ReflectionService]
RS --> LLM[LLM — tool-less agent]
LLM --> UC[UserContext\nmarkdown profile ≤550 words\n6 sections]
LLM --> NF[New facts JSON array]
LLM --> SN2[Conversation cluster notes\nSummary: topic]
UC --> CT[(context table)]
NF --> UF
SN2 --> Notes[(notes table)]
SN --> Notes
end
subgraph SystemPrompt["Every Request — build_system_prompt"]
UF --> FS["'What I Know About You'\ngrouped by category\nalways present if facts exist"]
CT --> PS["'Your Memory of This User'\nnarrative profile"]
Notes --> SRJ[find_relevant_summaries\ninjected as context if similar]
FS --> Agent[Agent run]
PS --> Agent
MH --> Agent
SRJ --> Agent
end
Memory timeline¶
| Tier | Latency | What gets saved |
|---|---|---|
| Post-message | Seconds | Explicit facts stated in the current exchange (user_facts) |
| Post-session | ~30 minutes | Session summary clustered into Summary: <topic> notes (searchable immediately) |
| Nightly | 2 AM UTC | Full profile rewrite, intention extraction, behavioral learnings, full-history clustering |
Short-Term Memory¶
On every message (from any interface), the handler fetches recent interactions for that user via fetch_message_history() in app/utils/message_utils.py, then passes them to build_message_history() which applies a token budget before sending them to the model:
recent = await fetch_message_history(
storage=deps.storage,
user_id=user_id,
message=text,
settings=deps.settings,
)
message_history = build_message_history(recent) # drops oldest if > 6,000 est. tokens
Two retrieval modes, controlled by ENABLE_SEMANTIC_HISTORY:
- Chronological (default) — last
recent_count + semantic_countinteractions ordered newest-first. Same behaviour as before this feature shipped. - Semantic (
ENABLE_SEMANTIC_HISTORY=true) — last 3 verbatim turns (preserves dialog coherence) plus top-3 semantically-relevant older interactions, with a small recency boost on similarity so equally-relevant recent matches surface first. Falls back to chronological on any failure (embed error, search error, hydrate error).
Both modes are wrapped in @observe(name="message_history_retrieval") so the retrieval step appears as a child observation under telegram.turn in Langfuse with metadata {mode, recent_count, semantic_count, semantic_enabled}.
build_message_history() then estimates tokens at ~4 chars/token and drops the oldest interactions first when the total exceeds TOKEN_HISTORY_BUDGET = 6000. The resulting list is converted into Pydantic AI ModelRequest / ModelResponse pairs and passed as message_history to agent.run() or agent.run_stream().
Why a token budget instead of a fixed count? A hard limit of 10 interactions behaves inconsistently — 10 one-line exchanges cost far fewer tokens than 10 multi-paragraph exchanges. The budget-based approach keeps context window usage predictable regardless of message length.
Tunable settings (all in app/config.py):
- ENABLE_SEMANTIC_HISTORY (bool, default false)
- SEMANTIC_HISTORY_RECENT_COUNT (default 3)
- SEMANTIC_HISTORY_SEMANTIC_COUNT (default 3)
- SEMANTIC_HISTORY_THRESHOLD (default 0.6)
Semantic Context Injection¶
Before every agent run, up to three retrieval layers prepend relevant context to the user message. All three layers share a single 1,000-token budget (CONTEXT_TOKEN_BUDGET) — they are filled in priority order until the budget is exhausted.
flowchart TD
A[User message arrives] --> B{GOOGLE_API_KEY set?}
B -- No --> Z[Send raw message to agent]
B -- Yes --> C[budget = 1,000 tokens]
C --> D[Layer 1: find_relevant_notes\n≥0.6 similarity · top 2 · +recency boost\nexcludes Summary and Research prefixes]
D --> E{Notes found\nand budget > 0?}
E -- No --> G
E -- Yes --> F[Append XML block\nbudget -= note_tokens]
F --> G[Layer 2: find_relevant_summaries\n≥0.6 similarity · top 2 · +recency boost\nmatches Summary: prefix only]
G --> H{Summaries found\nand budget > 0?}
H -- No --> J
H -- Yes --> I[Append XML block\nbudget -= summary_tokens]
I --> J[Layer 3: find_relevant_read_later\ntag-overlap matching · up to 3 items]
J --> K{Items found\nand budget > 0?}
K -- No --> M
K -- Yes --> L[Append XML block]
L --> M[Prepend datetime: local time in user timezone]
M --> Z
Each layer wraps its output in XML tags so the model can distinguish retrieved memory from instructions:
<context type="notes">
- Note title: first 200 chars of content
Build on these — don't repeat them verbatim.
</context>
<context type="summaries">
- Topic name: first 200 chars of summary
Use for continuity — don't reference it explicitly.
</context>
<context type="read_later">
- "Article title" (https://...)
Summary: ...
Your note: ...
Mention these and include the URL.
</context>
Datetime in the user turn: the current local time is prepended as [Thursday, April 24, 2026 — 09:15 AM Europe/Paris] to the user turn (not the system prompt). This keeps the ~3,000–5,000 token system prompt identical across requests, qualifying it for Gemini's implicit prompt cache. A changing timestamp in the system prompt would bust the cache on every request.
Logfire span attributes: each Telegram message span records est_tokens_history, est_tokens_context, est_tokens_user_turn, context_layers, and history_interactions so token distribution is observable in Logfire per request.
Permanent User Facts¶
The user_facts table stores specific, verifiable facts about the user — the kind that get pruned from or never make it into a 550-word narrative profile: addresses, phone numbers, names, dietary restrictions, and similar.
How facts get written¶
| Path | Trigger | source value |
Latency |
|---|---|---|---|
| Explicit | User says "remember that my home is X" | "agent" |
Immediate |
| Proactive | Agent calls remember_fact mid-conversation |
"agent" |
Immediate |
| Post-conversation | extract_facts_from_exchange() fires after every message; mini-model extracts explicitly stated facts. Pre-checks existing value — skips if unchanged, upserts if new or different. |
"agent" |
Seconds |
| Auto-extraction | Nightly reflection extracts new facts from 24h conversations | "reflection" |
2 AM UTC |
Agent tools¶
All three tools are registered on every domain agent (via _UTILITY_FNS) — not just the memory agent. The agent may learn personal facts during any conversation, regardless of domain.
| Tool | What it does |
|---|---|
remember_fact(key, value, category) |
Upsert a fact by key. Overwrites previous value if the key already exists. |
recall_facts(query="") |
List all facts (empty query) or search keys + values. Returns grouped by category. |
forget_fact(key) |
Delete a fact by key. |
Key naming: snake_case, descriptive. Examples: home_address, workplace, partner_name, preferred_transport, dietary_restrictions, manager_name, morning_routine, birthday.
Categories: location, personal, preference, work, health, general.
How facts get read¶
build_system_prompt() fetches all user facts on every request and injects them as a structured section before the reflection profile:
## What I Know About You
Verified permanent facts — treat these as ground truth.
*Location*
- Home Address: 124 Avenue Perretti, Neuilly-sur-Seine
- Workplace: La Défense, Paris
*Personal*
- Partner Name: Ana
The system prompt _MEMORY_INSTRUCTIONS instructs the agent:
- Check permanent facts before saying you don't know something about the user
- Save personal information proactively without asking permission
- Priority order: permanent facts → reflection profile → recall_facts search → search_history
- Never say "I don't have your address" without checking all four sources
Post-Conversation Memory Pipeline¶
app/memory/post_conversation.py runs two background functions after every handle_message call in bot.py via asyncio.create_task(). Both are fire-and-forget — they never raise and never block the response to the user.
Immediate fact extraction¶
extract_facts_from_exchange(user_message, agent_response, storage, settings) fires immediately after every Telegram message. It sends the last exchange (user + agent, capped at 1000 chars each) to mini_model_name with a prompt that extracts only explicitly stated facts:
"By the way, I moved to Berlin last month."
→ [{"key": "home_city", "value": "Berlin", "category": "location"}]
"Hi, how's it going?"
→ []
Before upserting each fact:
1. get_user_fact(key) is called to check the existing value
2. If value is identical — skip (no write, no re-embedding)
3. If value is different or absent — save_user_fact() upserts it
This means a fact like "I moved to Munich" correctly overwrites an existing home_city = "Berlin" within seconds of being stated, rather than at 2 AM.
Session-close summarisation¶
summarise_session(storage, settings, lookback_hours=3) fires after a 30-minute quiet period per chat. bot.py maintains a module-level _session_tasks: dict[str, asyncio.Task] — one pending task per chat_id. Each new message cancels the previous task and reschedules, so the timer resets on every reply.
When the timer fires:
1. Fetches all interactions from the last 3 hours via get_interactions_since()
2. If fewer than 2 interactions — returns immediately (nothing to summarise)
3. Runs mini_model_name on the conversation (capped at 50 turns × 150 chars)
4. Clusters into 1–3 topic summaries (returns [] for trivial/one-liner sessions)
5. Upserts each as Note(title="Summary: <topic>") — same format as the nightly reflection
These notes are immediately available to find_relevant_summaries() in message_utils.py, which injects matching summaries as context into the next conversation — no waiting until 2 AM.
Long-Term Memory — The Reflection Engine¶
The Reflection Engine (app/memory/reflection.py) runs nightly at 2 AM UTC. It produces four outputs from the last 24 hours of interactions:
- An updated narrative profile (UserContext) — six sections, ≤550 words
- A structured facts list — JSON array of new/changed facts, saved to
user_facts - A structured intentions list — JSON array of newly detected personal commitments, saved to
pending_intentions - A structured learnings list — JSON array of behavioral corrections, saved to
agent_learnings
Reflection cycle¶
flowchart TD
A([2 AM UTC trigger]) --> B[Fetch interactions\nsince 24h ago]
B --> C{Any interactions?}
C -- No --> D([Skip — log and return])
C -- Yes --> E[Fetch existing UserContext\nfrom context table]
E --> EF[Fetch existing user_facts + intentions + learnings\nto avoid re-extraction]
EF --> F[Build reflection prompt\nexisting profile + facts + intentions + learnings + conversations]
F --> G[Call LLM — tool-less Agent]
G --> H{Parse output\n---PROFILE--- / ---FACTS--- / ---INTENTIONS--- / ---LEARNINGS--- markers}
H --> I[Updated profile markdown\n≤550 words, 6 sections]
H --> J[New facts JSON array\nonly new or changed facts]
H --> JI[New intentions JSON array]
H --> JL[New learnings JSON array]
I --> K[Save UserContext\nto context table]
J --> L[Save each UserFact\nto user_facts\nsource='reflection']
JI --> LI[Save each PendingIntention\nto pending_intentions]
JL --> LL[Save each AgentLearning\nto agent_learnings]
K --> M([Sleep 24 hours])
L --> M
LI --> M
LL --> M
M --> A
The reflection prompt¶
The LLM is given four inputs:
1. The existing profile (or "start fresh" if none exists)
2. The existing facts — so it knows what's already stored and avoids duplicates
3. The last 24h of conversations — formatted as [timestamp]\nUser: ...\nKwasi: ...
4. Today's date (for temporal context)
It must produce output in this exact format:
---PROFILE---
## Personal
...
## Goals & Values
...
## Communication Style
...
## Patterns & Rhythms
...
## Preferences
...
## Notes
...
---FACTS---
[{"key": "home_address", "value": "...", "category": "location"}]
---INTENTIONS---
[{"text": "call doctor about knee", "follow_up_days": 3, "context": "Mentioned knee pain was getting worse"}]
---LEARNINGS---
[{"rule": "always confirm before deleting any item", "category": "decision_making"}]
Profile rules: Six sections with word budgets. Specific values (addresses, names) belong in FACTS, not in prose. Maximum 550 words. Return unchanged if nothing new was learned.
Facts rules: JSON array of objects with key, value, category. Only include facts that are new or have changed since the existing facts list. Empty array [] if nothing new.
Intentions rules: JSON array of objects with text and follow_up_days. Only include soft commitments not already in the existing intentions list. Empty array [] if nothing new. Examples: "I should call the dentist", "I want to start running again", "I need to finish that report".
Parsing and fallback¶
_parse_reflection_output(text) returns a tuple[str, list[dict], list[dict], list[dict]] (profile, facts, intentions, learnings):
- Splits from the end: ---LEARNINGS--- first, then ---INTENTIONS---, then ---FACTS---, then ---PROFILE---
- Profile text = everything between ---PROFILE--- and ---FACTS---
- Facts = JSON parsed from the section between ---FACTS--- and ---INTENTIONS---
- Intentions = JSON parsed from between ---INTENTIONS--- and ---LEARNINGS---
- Learnings = JSON parsed from after ---LEARNINGS---
- If markers are absent: entire output treated as profile, no structured data extracted (backwards-compatible fallback with a logged warning)
- Markdown code fences around the JSON arrays are stripped automatically
Where the profile is used¶
Every time the agent runs, build_system_prompt() fetches the current UserContext from storage and injects it after the permanent facts section, under a "## Your Memory of This User (narrative)" header that clarifies it should actively shape tone and priority, not just inform. The profile is capped at 4,800 characters (~1,200 tokens) before injection — if it has grown beyond that, it is truncated with a [...profile truncated for context] notice. This is fail-safe: if storage is down or the context table is empty, the agent still runs — just without the profile.
Triggering reflection manually¶
The response includes facts_added, facts_updated, intentions_added, and learnings_added — counts of new records saved that cycle.
Semantic Search¶
Beyond keyword matching, Kwasi can find content by meaning using vector embeddings. When a note, interaction, or saved article is written to storage, an embedding is generated in the background via embed_text() (app/tools/embedding.py) using Gemini gemini-embedding-001 (3072 dimensions). If the embedding API is unavailable, the record is saved without an embedding — keyword search still works.
The semantic_search tool¶
Available on memory_agent, briefing_agent, and the full agent.
semantic_search(query: str, sources: list[str] | None = None, limit: int = 5)
# sources: "notes" | "interactions" | "read_later" — default: all three
Returns results grouped by source with a similarity percentage. Requires GOOGLE_API_KEY.
Recency weighting¶
find_relevant_notes(), find_relevant_summaries(), and the semantic context injection layer apply a small recency boost when ranking results. Notes written within the last 90 days receive up to +0.05 added to their cosine similarity score — so a note from last week at 0.78 similarity scores higher than an equivalent note from 18 months ago at 0.78. The boost decays linearly to 0 at 90 days. Notes without a created_at timestamp get no boost and sort by raw similarity only.
This means equally-relevant context from recent conversations surfaces first without entirely suppressing older material.
Where semantic search is used¶
| Scenario | Tool |
|---|---|
| "Find anything about my career plans" | semantic_search (all sources) |
| "That conversation where I mentioned burnout" | semantic_search(sources=["interactions"]) as fallback after search_history |
| "Find that article I saved about AI regulation" | semantic_search(sources=["read_later"]) — primary, no keyword search exists for read_later |
| "Find anything about X" (unknown source) | find_everything — runs keyword + semantic in parallel |
| Note/task not found by keyword | semantic_search as automatic fallback |
Backfilling existing data¶
Rows written before semantic search was added have embedding = NULL. To embed them:
curl -X POST https://your-app.railway.app/embed-backfill \
-H "X-Reflection-Secret: your-secret"
# Returns: {"notes": 12, "interactions": 847, "read_later": 5}
Rate-limited to ~10 rows/second. For large histories, this may take several minutes — the endpoint processes synchronously.
Information Priority¶
When the user asks about themselves, the agent checks in this order:
1. Permanent facts (user_facts table — always in system prompt)
↓ not found
2. Reflection profile (UserContext — always in system prompt)
↓ not found
3. recall_facts(query) — active search over user_facts
↓ not found
4. search_history(query) — keyword search over past conversations
↓ not found
5. semantic_search(query) — meaning-based search over notes, interactions, saved articles
↓ not found
6. Acknowledge it's not known — offer to remember it now
What Gets Logged¶
Every interaction (from any handler) is saved via log_interaction():
| Field | Content |
|---|---|
user_message |
The text sent to the agent (or transcript for voice) |
agent_response |
The full response returned |
tools_used |
JSON array of tool call records extracted from the agent result |
channel |
"telegram", "cli", or "whatsapp" |
user_id |
User ID as string (Telegram user ID, WhatsApp phone number, or None for CLI) |
created_at |
UTC timestamp |
Logging is fire-and-forget — a failure never blocks the response.
Bootstrap: Seeding Initial Facts¶
After deploying, seed the facts store by telling Kwasi key facts in one message:
"Remember these facts about me: - My home address is 124 Avenue Perretti, Neuilly-sur-Seine - I work at [company] at [address] - My partner is [name] - I prefer public transport"
Kwasi will call remember_fact for each one. From that point, every prompt includes them permanently — no conversation history needed, no reflection cycle required.