How Cereby's Four-Layer Context Makes AI Feel Continuous
Why a strict context stack beats a grab bag of facts: predictable shape, safe trimming, and continuity across sessions.
The problem with grab-bag prompts
Every time a user asks Cereby something, the server has to answer a quieter question first: what exactly do we send to the model?
Early on, the answer was something like "everything relevant." Preferences. Recent work. The current conversation. Whatever fields the product team had added since the last release. The result was a prompt that grew organically, appended to over time, with no consistent ordering and no principle for what to cut when token budgets ran out. Quality was hard to test because the shape of the prompt changed between releases. Trimming was guesswork. And when the model gave a strange answer, nobody could confidently point to what had or had not been included.
That is not a model problem. It is a contract problem.
The shape we landed on
We call it the four-layer context model. In practice it is five layers plus system instructions and the current turn, but the label stuck because four captures the parts that vary:
Order encodes priority. System and session anchors sit near the top, where they are hardest to displace. Long-tail material (older transcript turns, older summaries) sits near the bottom, where it is safest to trim first when the budget runs short.
What each layer actually does
Session metadata is built fresh from the request: User-Agent, timezone, local time, location when the client sends it, account tier and coin balance. It is what lets the model answer "what time does the library close tomorrow?" without guessing. Because it is computed per request, it never goes into the long-term memory store. When a user turns off memory, this layer still works correctly.
User memory is retrieved from Cereby's memory store using embeddings against the current message, capped by a token budget. We send what is relevant to the ask, not everything we have ever learned.
Recent conversation summaries are short, user-focused lines about past chats, generated when sessions start or continue rather than shipped as raw transcripts. Sending the last several gives the model cross-session continuity at a fraction of the token cost of full history.
Where it got complicated
The trim policy is where most of the hard decisions live. When a conversation runs long and the budget bites, something has to go. Without an ordering principle, the answer was always arbitrary, and bad trim decisions showed up as the model appearing to "forget" the start of the chat. With layer order as policy, the answer is mechanical: trim from the oldest transcript tail first, then older summaries, while leaving system instructions, session metadata, and top memory hits untouched. Order makes the tradeoff explicit and the rollback explainable.
The memory-off and temporary-chat paths required care for a related reason. Users who disable memory expect no durable recall. But they still expect "what time is it tomorrow in Tokyo?" to work. Session metadata had to remain strictly non-memory, or toggling privacy would silently break date and locale grounding.
What this taught us
Ephemeral environment facts belong in their own layer, not in the memory store. Mixing them in breaks privacy modes in ways that are invisible until a user turns memory off and discovers the model still knows what time zone they are in.
Encoding trim priority in layer order turns an editorial judgment into a mechanical rule. That matters most at 2 a.m. when something is behaving strangely and the on-call needs to reason about what the model actually saw.
