Six Cuts to the Cereby Orchestration Layer: Lazy Context, Smarter Budgets, and Fewer Wasted Calls
How we restructured the core request pipeline to stop fetching data the model never uses, stop re-compressing content it already compressed, and stop treating every model's context window the same.
The work nobody sees
In our system design post we described the pipeline that takes an uploaded file and turns it into compressed, cite-able context. That pipeline works. But "working correctly" hid a separate problem: profiling the request path revealed that the orchestration layer around it was doing a lot of work whose results were immediately thrown away.
Consider a file-chat request before these changes:
Steps 1 and 3 are the problem. The aggregator fetches data the file handler never uses; the compressor re-does work it already did. The final answer is correct, but the student pays in latency and we pay in API costs. Six targeted changes fix this, each sharing a theme: the cheapest operation is the one you skip.
1. Lazy context fetching: classify first, fetch second
This is the highest-impact change. The intent classifier had a fast Phase 0 path (pure regex, no LLM call) that could detect file intents, but it ran after the context aggregator, so the seven database queries always fired first. We extracted Phase 0 into a standalone method and moved it before context aggregation:
When Phase 0 resolves the intent (file questions, context-performance queries, casual messages), the controller builds a minimal context object and routes directly to the handler. The seven-query aggregator never fires. When Phase 0 returns null, the existing flow runs unchanged.
The change is gated behind a feature flag with per-request path logging so we can measure the hit rate and detect regressions before removing the flag.
2. Compression result caching
Every follow-up re-ran the full two-phase compression pipeline: chunking, keyword scoring, Phase 1 candidate selection, and embedding API calls for Phase 2 reranking. Three questions about Chapter 7 meant three nearly identical runs.
We added a cache layer keyed on three components:
compression:{contentHash}:{keywordsHash}:{maxTokens}
Content hash is MD5 of the first 2,000 characters plus total length. Keywords hash is MD5 of the sorted significant query keywords; two questions with the same keywords produce the same hash. Max tokens is included because different budgets produce different compressions.
TTL is 10 minutes: long enough to cover a follow-up session, short enough that re-uploads expire naturally. We also bumped the in-memory cache manager from 100 to 200 max entries to accommodate compression results alongside embeddings.
3. Model-aware token budgets
The file context budget was hardcoded at 8,000 tokens, a number chosen for 32K-context GPT-4. Students on Claude Opus or Gemini Flash (both 1M) got the same window as students on GPT-4o-mini. We wired an existing gateway helper into the file context manager:
effectiveBudget = min(16,000, max(8,000, availableTokens 0.15))
| Model | Available tokens | File budget |
|---|---|---|
| GPT-4o (128K) | 112,000 | 16,000 |
| Claude Sonnet (200K) | 184,000 | 16,000 |
| Gemini Flash (1M) | 984,000 | 16,000 (capped) |
| Unknown / fallback (32K) | 16,000 | 8,000 (unchanged) |
The 16,000 cap prevents runaway token usage on large-context models. The 8,000 floor preserves existing behavior when no model ID is available.
4. Query-weighted cross-file budget allocation
The old multi-file logic split the budget equally: floor(totalBudget / fileCount). A textbook and an unrelated syllabus each got 4,000 tokens regardless of which one the question was about.
The compressor already had a private relevance scoring method (keyword overlap, scoring 0.3 to 1.0) used only in a less-common path. We exposed it and rewired the main file context builder:
For each file:
score = relevanceScore(file, query) // 0.3 – 1.0
allocation = max(400, floor(totalBudget score / totalScore))
With two files scoring 0.9 and 0.3 against a 16,000-token budget, the textbook gets 12,000 and the syllabus gets 4,000. Single-file queries are unaffected. The 400-token floor guarantees every file gets at least a minimal representation.
5. Prefetch embeddings on upload
The first question about a newly uploaded file paid the full embedding API latency. For a 50-page PDF with 200 paragraph-level chunks, that is 200 embedding calls, batched but still around 500ms.
We added a fire-and-forget background task to the parse endpoint: after pages are extracted, a second task splits them into paragraph chunks and batch-embeds in groups of 50. By the time the student types their first question, the embeddings are already cached.
While investigating, we found a || vs ?? bug: the embedding service set ttl=0 intending "permanent," but the cache manager treated 0 as falsy and fell back to the 5-minute default. Embeddings had been silently expiring for months. We fixed it with nullish coalescing and a permanent-entry path where ttl === 0 sets expiry to infinity.
6. Options object for the request handler
The main controller method took 20 positional parameters. Every call site was a wall of undefined placeholders (14 in the quiz tutor path alone):
// Before: count the undefined placeholders
handle(userId, message, model, false,
undefined, history, undefined, undefined, false,
undefined, style, tone, true, undefined,
undefined, undefined, undefined, undefined,
{ prompt }, undefined)
// After: named fields, omit what you don't need
handle({
userId, message, model,
useCache: false,
history,
responseStyle: style,
tone,
textOnly: true,
tutorOptions: { prompt },
})
Three call sites updated; the method body is identical. Zero behavioral change, but it made the lazy context restructuring significantly cleaner.
Things that bit us
The embedding TTL fix improved cache hit rates across all file-chat flows, not just the prefetch path. For proportional allocation, when many files hit the 400-token floor, the sum can slightly exceed the total budget. At two to five files this is negligible; at twenty or more, the floor guarantee matters more than strict adherence.
Before and after
| Dimension | Before | After |
|---|---|---|
| DB queries on file-chat | 7 parallel queries per request | 0 queries when Phase 0 resolves file intent |
| Compression on follow-ups | Full two-phase pipeline every turn | Cache hit when query keywords overlap (10-min TTL) |
| File budget on GPT-4o | Fixed 8,000 tokens | 16,000 tokens (model-aware, capped) |
| Multi-file allocation | Equal split (50/50) | Proportional by query relevance (roughly 75/25 in typical use) |
| First-query embedding latency | ~500ms (on-demand) | ~0ms (prefetched at upload) |
| Request handler call sites | 20 positional params, undefined walls | Typed options object, omit what you do not need |
What this taught us
The cheapest query is the one you skip. The context aggregator was well-optimized (parallel queries, 5-minute cache), but for file-chat, the right move was not faster aggregation. It was no aggregation at all. Classification is cheaper than fetching, and it tells you whether fetching is needed.
Cache the output, not just the input. We already cached embeddings. The expensive operation was not computing one embedding; it was the full compression pipeline that uses embeddings. Caching the final compressed string, keyed on the query keywords, eliminated the entire pipeline on follow-ups.
Budgets are product decisions with model-shaped inputs. 8,000 tokens was right for 32K models. It is wrong for 1M models. The budget should scale with the model, but with a cap: students do not benefit from 150,000 tokens of file context when attention degrades past 16,000.
Refactor the signature before restructuring the body. The options object change was zero behavioral change, but it made the lazy context restructuring significantly easier. Mechanical refactors that reduce coupling pay forward.
What's next
Persistent embedding storage is the most important open item. The in-memory cache loses prefetched embeddings on every server restart, so moving to a database or Redis would make the prefetch benefit durable.
Each file is still compressed independently, even when several are pinned. A shared retrieval index across all pinned files would let the compressor allocate tokens to the most relevant chunks regardless of which file they live in.
Casual messages ("hi", "thanks") still trigger full aggregation because the casual handler uses context for calendar events. A lightweight path that skips the aggregator there would close the last class of wasted queries.
