Six Cuts to the Cereby Orchestration Layer: Lazy Context, Smarter Budgets, and Fewer Wasted Calls

How we restructured the core request pipeline to stop fetching data the model never uses, stop re-compressing content it already compressed, and stop treating every model's context window the same.

The work nobody sees

In our system design post we described the pipeline that takes an uploaded file and turns it into compressed, cite-able context. That pipeline works. But "working correctly" hid a separate problem: profiling the request path revealed that the orchestration layer around it was doing a lot of work whose results were immediately thrown away.

Consider a file-chat request before these changes:

Steps 1 and 3 are the problem. The aggregator fetches data the file handler never uses; the compressor re-does work it already did. The final answer is correct, but the student pays in latency and we pay in API costs. Six targeted changes fix this, each sharing a theme: the cheapest operation is the one you skip.

1. Lazy context fetching: classify first, fetch second

This is the highest-impact change. The intent classifier had a fast Phase 0 path (pure regex, no LLM call) that could detect file intents, but it ran after the context aggregator, so the seven database queries always fired first. We extracted Phase 0 into a standalone method and moved it before context aggregation:

When Phase 0 resolves the intent (file questions, context-performance queries, casual messages), the controller builds a minimal context object and routes directly to the handler. The seven-query aggregator never fires. When Phase 0 returns null, the existing flow runs unchanged.

The change is gated behind a feature flag with per-request path logging so we can measure the hit rate and detect regressions before removing the flag.

2. Compression result caching

Every follow-up re-ran the full two-phase compression pipeline: chunking, keyword scoring, Phase 1 candidate selection, and embedding API calls for Phase 2 reranking. Three questions about Chapter 7 meant three nearly identical runs.

We added a cache layer keyed on three components:

compression:{contentHash}:{keywordsHash}:{maxTokens}

Content hash is MD5 of the first 2,000 characters plus total length. Keywords hash is MD5 of the sorted significant query keywords; two questions with the same keywords produce the same hash. Max tokens is included because different budgets produce different compressions.

TTL is 10 minutes: long enough to cover a follow-up session, short enough that re-uploads expire naturally. We also bumped the in-memory cache manager from 100 to 200 max entries to accommodate compression results alongside embeddings.

3. Model-aware token budgets

The file context budget was hardcoded at 8,000 tokens, a number chosen for 32K-context GPT-4. Students on Claude Opus or Gemini Flash (both 1M) got the same window as students on GPT-4o-mini. We wired an existing gateway helper into the file context manager:

effectiveBudget = min(16,000, max(8,000, availableTokens  0.15))

Model	Available tokens	File budget
GPT-4o (128K)	112,000	16,000
Claude Sonnet (200K)	184,000	16,000
Gemini Flash (1M)	984,000	16,000 (capped)
Unknown / fallback (32K)	16,000	8,000 (unchanged)

Model Available tokens File budget

GPT-4o (128K) 112,000 16,000
Claude Sonnet (200K) 184,000 16,000
Gemini Flash (1M) 984,000 16,000 (capped)
Unknown / fallback (32K) 16,000 8,000 (unchanged)

The 16,000 cap prevents runaway token usage on large-context models. The 8,000 floor preserves existing behavior when no model ID is available.

4. Query-weighted cross-file budget allocation

The old multi-file logic split the budget equally: floor(totalBudget / fileCount). A textbook and an unrelated syllabus each got 4,000 tokens regardless of which one the question was about.

The compressor already had a private relevance scoring method (keyword overlap, scoring 0.3 to 1.0) used only in a less-common path. We exposed it and rewired the main file context builder:

For each file:
  score = relevanceScore(file, query)    // 0.3 – 1.0
  allocation = max(400, floor(totalBudget  score / totalScore))

With two files scoring 0.9 and 0.3 against a 16,000-token budget, the textbook gets 12,000 and the syllabus gets 4,000. Single-file queries are unaffected. The 400-token floor guarantees every file gets at least a minimal representation.

5. Prefetch embeddings on upload

The first question about a newly uploaded file paid the full embedding API latency. For a 50-page PDF with 200 paragraph-level chunks, that is 200 embedding calls, batched but still around 500ms.

We added a fire-and-forget background task to the parse endpoint: after pages are extracted, a second task splits them into paragraph chunks and batch-embeds in groups of 50. By the time the student types their first question, the embeddings are already cached.

While investigating, we found a || vs ?? bug: the embedding service set ttl=0 intending "permanent," but the cache manager treated 0 as falsy and fell back to the 5-minute default. Embeddings had been silently expiring for months. We fixed it with nullish coalescing and a permanent-entry path where ttl === 0 sets expiry to infinity.

6. Options object for the request handler

The main controller method took 20 positional parameters. Every call site was a wall of undefined placeholders (14 in the quiz tutor path alone):

// Before: count the undefined placeholders handle(userId, message, model, false, undefined, history, undefined, undefined, false, undefined, style, tone, true, undefined, undefined, undefined, undefined, undefined, { prompt }, undefined)

// After: named fields, omit what you don't need handle({ userId, message, model, useCache: false, history, responseStyle: style, tone, textOnly: true, tutorOptions: { prompt }, })

Three call sites updated; the method body is identical. Zero behavioral change, but it made the lazy context restructuring significantly cleaner.

Things that bit us

The embedding TTL fix improved cache hit rates across all file-chat flows, not just the prefetch path. For proportional allocation, when many files hit the 400-token floor, the sum can slightly exceed the total budget. At two to five files this is negligible; at twenty or more, the floor guarantee matters more than strict adherence.

Before and after

Dimension	Before	After
DB queries on file-chat	7 parallel queries per request	0 queries when Phase 0 resolves file intent
Compression on follow-ups	Full two-phase pipeline every turn	Cache hit when query keywords overlap (10-min TTL)
File budget on GPT-4o	Fixed 8,000 tokens	16,000 tokens (model-aware, capped)
Multi-file allocation	Equal split (50/50)	Proportional by query relevance (roughly 75/25 in typical use)
First-query embedding latency	~500ms (on-demand)	~0ms (prefetched at upload)
Request handler call sites	20 positional params, undefined walls	Typed options object, omit what you do not need

What this taught us

The cheapest query is the one you skip. The context aggregator was well-optimized (parallel queries, 5-minute cache), but for file-chat, the right move was not faster aggregation. It was no aggregation at all. Classification is cheaper than fetching, and it tells you whether fetching is needed.

Cache the output, not just the input. We already cached embeddings. The expensive operation was not computing one embedding; it was the full compression pipeline that uses embeddings. Caching the final compressed string, keyed on the query keywords, eliminated the entire pipeline on follow-ups.

Budgets are product decisions with model-shaped inputs. 8,000 tokens was right for 32K models. It is wrong for 1M models. The budget should scale with the model, but with a cap: students do not benefit from 150,000 tokens of file context when attention degrades past 16,000.

Refactor the signature before restructuring the body. The options object change was zero behavioral change, but it made the lazy context restructuring significantly easier. Mechanical refactors that reduce coupling pay forward.

What's next

Persistent embedding storage is the most important open item. The in-memory cache loses prefetched embeddings on every server restart, so moving to a database or Redis would make the prefetch benefit durable.

Each file is still compressed independently, even when several are pinned. A shared retrieval index across all pinned files would let the compressor allocate tokens to the most relevant chunks regardless of which file they live in.

Casual messages ("hi", "thanks") still trigger full aggregation because the casual handler uses context for calendar events. A lightweight path that skips the aggregator there would close the last class of wasted queries.

How we restructured the core request pipeline to stop fetching data the model never uses, stop re-compressing content it already compressed, and stop treating every model's context window the same.

The work nobody sees

Consider a file-chat request before these changes:

1. Lazy context fetching: classify first, fetch second

The change is gated behind a feature flag with per-request path logging so we can measure the hit rate and detect regressions before removing the flag.

2. Compression result caching

We added a cache layer keyed on three components:

compression:{contentHash}:{keywordsHash}:{maxTokens}

3. Model-aware token budgets

effectiveBudget = min(16,000, max(8,000, availableTokens  0.15))

Model	Available tokens	File budget
GPT-4o (128K)	112,000	16,000
Claude Sonnet (200K)	184,000	16,000
Gemini Flash (1M)	984,000	16,000 (capped)
Unknown / fallback (32K)	16,000	8,000 (unchanged)

For each file:
  score = relevanceScore(file, query)    // 0.3 – 1.0
  allocation = max(400, floor(totalBudget  score / totalScore))

5. Prefetch embeddings on upload

The first question about a newly uploaded file paid the full embedding API latency. For a 50-page PDF with 200 paragraph-level chunks, that is 200 embedding calls, batched but still around 500ms.

6. Options object for the request handler

The main controller method took 20 positional parameters. Every call site was a wall of undefined placeholders (14 in the quiz tutor path alone):

// After: named fields, omit what you don't need handle({ userId, message, model, useCache: false, history, responseStyle: style, tone, textOnly: true, tutorOptions: { prompt }, })

Three call sites updated; the method body is identical. Zero behavioral change, but it made the lazy context restructuring significantly cleaner.

Things that bit us

Before and after

Dimension	Before	After
DB queries on file-chat	7 parallel queries per request	0 queries when Phase 0 resolves file intent
Compression on follow-ups	Full two-phase pipeline every turn	Cache hit when query keywords overlap (10-min TTL)
File budget on GPT-4o	Fixed 8,000 tokens	16,000 tokens (model-aware, capped)
Multi-file allocation	Equal split (50/50)	Proportional by query relevance (roughly 75/25 in typical use)
First-query embedding latency	~500ms (on-demand)	~0ms (prefetched at upload)
Request handler call sites	20 positional params, undefined walls	Typed options object, omit what you do not need

Six Cuts to the Cereby Orchestration Layer: Lazy Context, Smarter Budgets, and Fewer Wasted Calls

The work nobody sees

1. Lazy context fetching: classify first, fetch second

2. Compression result caching

3. Model-aware token budgets

4. Query-weighted cross-file budget allocation

5. Prefetch embeddings on upload

6. Options object for the request handler

Things that bit us

Before and after

What this taught us

What's next

Learn with Cereby

Six Cuts to the Cereby Orchestration Layer: Lazy Context, Smarter Budgets, and Fewer Wasted Calls

The work nobody sees

1. Lazy context fetching: classify first, fetch second

2. Compression result caching

3. Model-aware token budgets

4. Query-weighted cross-file budget allocation

5. Prefetch embeddings on upload

6. Options object for the request handler

Things that bit us

Before and after

What this taught us

What's next