Optimizing Cereby AI: From 5-8 Seconds to Sub-Second Responses
How multi-layered caching, context compression, and parallel queries turned a slow learning assistant into a fast one.
It was fast enough, until it wasn't
When we first shipped Cereby AI, we were proud of the architecture. A plugin-based design with a dedicated Context Aggregator, an Intent Classifier, a Tool Orchestrator. The pieces were clean and the system worked. What we had not reckoned with was how those pieces would stack in wall-clock time once real users started asking follow-up questions back to back.
Every request was taking 5-8 seconds. Of that, context aggregation alone was eating 2-3 seconds. The request path looked like this:
The bottlenecks were structural: five sequential queries with no caching, intents re-classified on every request even when nothing had changed, and ~15,000 tokens of history sent to the AI each time, most of it irrelevant to the specific question. Follow-up questions, where a one-second response would have felt natural, hit the wall hardest.
The shape we landed on
The answer was a three-tier caching stack, context compression, and parallelized queries. Each layer reduces load on the one below it.
Most requests never reach the bottom. A cache hit at tier one costs nothing. A miss falls through to the database cache. Only a cold or invalidated entry triggers full aggregation, and even then the queries run in parallel.
Caching in three tiers
Tier one is an in-memory LRU cache with a 2-minute TTL for context and 5-minute TTL for intent classifications. Access is sub-millisecond. Tier two is a database-backed cache that persists across restarts and works across multiple instances, with a 5-minute TTL and automatic cleanup of expired entries. Cache invalidation fires on meaningful data changes: a completed quiz, an updated note, a new calendar event.
Tier three is intent classification caching specifically. Users tend to ask related questions in a session, and many intents are structurally similar ("what are my weakest topics in calculus?" vs. "where am I struggling in calculus?"). Caching the classification step reduced redundant AI calls for intent by 30-40%, with a direct reduction in both latency and cost.
Context compression: cutting token usage in half
The other big lever was what we sent to the AI once we did need a live response. A student asking about their calculus weak points does not need their entire quiz history across every subject in the payload. The compressor reads the classified intent and filters accordingly: quiz history trimmed to the relevant subject, weak points limited to high-severity entries in the relevant area, a top-N cap per category to keep the tail from growing back. The result was a drop from ~15,000 tokens per request to ~5,000-8,000.
The 40-50% token reduction surprised us most in terms of downstream impact. Smaller payloads mean faster inference and lower per-request cost, and because the AI is working with more focused context, response quality held or improved.
From sequential to parallel queries
Context aggregation was sequential by default: one query started after the last one finished, so total time was the sum of all five, roughly 2.9 seconds at median. Switching to parallel fetching brought that to the duration of the slowest query (roughly 800ms): a 65% reduction, independent of caching. Adding missing database indexes on user + completion date, user + event date, and user + subject dropped individual query times another 40-60%.
Before and after
| Metric | Before | After (cache hit) |
|---|---|---|
| Average response time | 5-8 seconds | 1-2 seconds |
| Context loading time | 2-3 seconds | 200-400ms |
| Database queries per request | 5-7 | 0-1 |
| Token usage per request | ~15,000 | ~5,000-8,000 |
| Cache hit rate | 0% | 60-70% |
| Cost per request | $0.15-0.25 | $0.05-0.10 |
Follow-up questions now feel instant. Database load dropped 60%. API costs dropped 40%.
What this taught us
Caching pays for itself fastest at the layer with the most redundancy. For us that was context aggregation: identical context, rebuilt identically, on every request from an active user. The in-memory tier alone solved most of the problem.
Token reduction compounds. Cutting context size by half reduces cost and latency on every single AI call, not just the cached ones. If you are sending large payloads to an LLM, filtering them to what matters is probably the cheapest optimization you can make.
Want the full architecture picture? Read Cereby AI System Design.
