Hierarchical Context Compression: Cutting AI Costs by 90% Without Losing Quality (part 1)
How we made File Chat affordable to operate by stopping ourselves from sending the whole textbook every time someone asked a question.
The query that cost $1.50
Cereby's File Chat lets students ask questions about uploaded study materials: PDFs, PowerPoints, textbooks spanning hundreds of pages. In the first version, a student who asked "explain the chain rule in chapter 5" got a correct answer. They also triggered this:
A 100-page textbook, sent in full, consumed 150,000+ tokens per question. Documents over roughly 85 pages hit the model's context limit and failed outright. On any given question, 80-90% of the pages we sent had nothing to do with what the student asked.
This was not a model problem. It was a naive input problem.
The shape we landed on
The solution is two phases. The first runs once, at upload time. The second runs on every query.
At upload, we preprocess each page and store the results: a 3-5 sentence AI-generated summary, extracted keywords and topics, token counts for both full content and the summary, and a baseline importance score derived from page position, length, and keyword density. This costs roughly $0.01-0.02 for a 100-page document and runs once.
At query time, we compress the context in five steps:
The AI sees the most relevant pages in full, a sentence-level sketch of moderately relevant pages, and nothing from the irrelevant ones. It does not see "the first N pages" or a random sample. It sees what the query actually needed.
The assembled context includes a note listing which page ranges were omitted, so the model knows what it did not see. When a document has no metadata (legacy uploads, processing failures), the system falls back to simple truncation rather than erroring.
Before and after
| Metric | Before | After |
|---|---|---|
| Token usage per query | 150,000 | 5,000-8,000 (92% reduction) |
| Cost per query | $1.50-2.00 | $0.05-0.10 (93% reduction) |
| Response time | 15-20 seconds | 2-3 seconds (85% faster) |
| Max document size | ~85 pages | 500+ pages (6x increase) |
| Preprocessing cost (100-page doc) | n/a | ~$0.01-0.02, one time |
Accuracy held. Against full-context responses across representative queries, the compressed version scored 4.6/5 vs 4.7/5 overall. Relevance actually improved (4.7 vs 4.5) because the model was no longer processing noise. Completeness dipped slightly when compression excluded a page that would have added peripheral context. That tradeoff is acceptable at the cost and speed numbers above.
What this taught us
Preprocessing pays for itself after the first query. The one-time summarization cost is recovered in a single use; at scale it is effectively free.
Simple heuristics are enough for structured academic content. We considered more complex NLP models for keyword and relevance scoring. Frequency analysis and pattern matching worked well enough that the added complexity was not justified.
Three tiers is the right number. We tried two (full or omit) and four (adding a partial-content tier). Full, summary, and omit gave the best balance between coverage and token efficiency.
Summaries have to be information-dense. Early summaries from looser prompts were too vague to contribute meaningfully to relevance scoring. Tighter prompts with specific output requirements improved both the scoring and the assembled context.
Part 2 covers what changed when we moved from the initial implementation to the version running in production today, including the edge cases that only appeared under real student traffic.
