How We Rebuilt Cereby's Memory to Feel Like It Actually Knows You
From tool-call roulette to automatic detection: a parallel classification pipeline that saves what matters without slowing down chat.
The memory that was not really there
A student tells Cereby they are studying for the MCAT. The next day, they open a new chat and Cereby has no idea. Not because the information was lost in transit, not because of a database bug. Because the chat model simply decided not to save it.
That was the original memory system. We gave the chat model a savememory tool alongside quiz generation, note creation, flashcards, and a dozen other capabilities, and trusted it to notice save-worthy facts while also answering the user's actual question. It did not. The save tool was called on roughly 5 to 10 percent of messages that contained durable, save-worthy information. Explicit "remember this" commands worked reliably. Implicit facts, preferences, and goals slipped through.
Our four-layer context post describes how Cereby assembles each request: system instructions, session metadata, user memory, recent summaries, and the live thread. The architecture was right. Layer 3 (user memory) just had almost nothing in it.
What the old system looked like end to end
When the model did save, it had no awareness of existing memories. A user who mentioned their major three times could end up with three near-identical rows. Retrieval made this worse: the retriever fetched all memories, computed embeddings in the application layer, and ranked by cosine similarity. At 50 memories that was acceptable; at 500 it added 200 to 400ms to every chat request.
The shape we landed on
The fix was to stop asking the chat model to do two jobs at once. A cheap, fast classifier runs in parallel with the main chat response and decides what to save. The chat model focuses on answering.
We chose Gemini 2.5 Flash Lite as the classifier. It responds in 1 to 2 seconds and costs a fraction of the main model. At roughly 300 tokens input and 50 tokens output per message, classification costs about $0.00003 per call. At 1,000 messages per user per month, that is $0.03 per user per month.
How the classifier works
The classifier receives the user's message and decides whether it contains a durable, save-worthy fact. We defined six categories: preference ("I prefer short explanations"), fact ("I'm a sophomore at UCLA"), goal ("I want to ace my physics final"), learningstyle ("Visual explanations help me"), schedule ("My quiz is next Thursday"), and general as a catch-all.
The prompt explicitly excludes task requests, greetings, questions about Cereby, and temporary conversational context. The classifier returns structured JSON with a content string, a category label, and a confidence score between 0 and 1 for each candidate. Anything below 0.7 is discarded.
In practice, the classifier returns an empty array for about 85% of messages. Most chat is questions and task requests, not self-disclosure. For those cases we skip the embedding computation, similarity check, and insert entirely. The median added overhead for the no-save case is under 50ms.
Deduplication and retrieval
Before saving any candidate, we compute its embedding (a 1536-dimensional vector) and ask the database for the user's nearest existing memory using pgvector's cosine distance operator. If the top match has similarity above 0.92, we skip the save. This catches rephrasings: "I study biology" and "I'm a biology major" reach about 0.94 cosine similarity and are correctly identified as duplicates.
We arrived at 0.92 after testing on a corpus of 200 memory pairs. Below 0.90, unrelated memories in the same domain (like "studying biology" and "biology quiz next week") started colliding. Above 0.95, legitimate duplicates slipped through.
Retrieval is now a single database call backed by an HNSW index (a structure that enables fast approximate nearest-neighbor search). The function takes a user ID, a query embedding, and a result limit, and returns the nearest memories sorted by cosine similarity. One SQL call, one network round trip, index-backed. Retrieval latency dropped from 200 to 400ms at 500 memories to under 2ms regardless of count.
The 3-second race and the inline badge
We wanted the "Memory updated" badge to appear inline without blocking the chat response and without a separate polling request. The solution is a race.
The detection promise starts at the top of the request handler, immediately after confirming that memory is enabled for that user. It runs in parallel with the main chat processing. Just before serializing the final response, we race the detection promise against a 3-second timer. If detection finished in time, the saved-memory metadata is merged into the response JSON and the badge appears. If the timer wins, the response goes out without it. The detection continues in the background and the memory is still saved; the user just does not see the badge for that particular message.
The classifier typically responds in 1 to 2 seconds. About 90% of the time, detection finishes before the main chat response is ready.
Schema, limits, and management
We added an embedding column (1536-dimensional vector), a category column, and an isactive boolean for soft-delete support. An HNSW index on the embedding column enables the nearest-neighbor search. Existing memories were backfilled by a one-time migration script.
Tier limits align storage with subscription value: Free users cap at 50 memories, Lite at 100, Pro at 500, and Premium is unlimited. Auto-detected saves are silently dropped when the limit is reached. Explicit saves return an upgrade prompt, which is a natural moment of value recognition: the user has already seen Cereby remember 50 things about them.
The forget tool now soft-deletes rows. AI-initiated forgetting is reversible from the settings panel; user-initiated deletion from the management UI is permanent.
Before and after
| Dimension | Old system | New system |
|---|---|---|
| Detection method | Model tool-call (inconsistent) | Parallel LLM classifier (automatic) |
| Save rate | ~5-10% of save-worthy messages | ~85-90% of save-worthy messages |
| Deduplication | None | Embedding similarity > 0.92 |
| Retrieval | O(n) in-memory cosine | O(1) pgvector HNSW |
| Retrieval latency (500 memories) | 200-400ms | < 2ms |
| Categories | None | 6 categories with filtering |
| Tier limits | None | Free: 50, Lite: 100, Pro: 500, Premium: unlimited |
| User feedback | Only on explicit save | Inline "Memory updated" badge |
| Management UI | Flat list plus delete | Categories, search, inline edit, manual add |
A student who mentions their major, their upcoming quiz, and their preference for concise explanations across three separate conversations will find all three facts waiting the next time they open a chat, without ever saying "remember."
What this taught us
Making the chat model responsible for both answering and saving was a design error, not a minor oversight. A cheap classifier running in parallel does one job well and costs almost nothing.
Embeddings belong in the database. Computing similarity in the application layer was expedient but did not scale. Moving vectors into the database and indexing them with HNSW made retrieval time-constant and eliminated a class of latency bugs.
Deduplication thresholds need empirical tuning. Our first guess of 0.90 was too aggressive and blocked distinct memories in the same domain. 0.92 was the sweet spot in our corpus, and it is the kind of parameter that needs periodic review as the embedding model changes.
What's next
Two things are near the top of the backlog. Memory consolidation: when a user accumulates several related memories ("studying for MCAT", "MCAT is in June", "nervous about MCAT verbal"), merge them into a single richer memory automatically. Transparency: show users which memories influenced a particular response ("Cereby used: your preference for bullet points, your biology major"). Temporal decay for schedule-category memories and category-weighted retrieval are also on the list.
