Query-Aware Smart Compression: Solving the Single-Page Truncation Problem (part 2)
How a simple cut-at-N-tokens approach became a scoring pipeline that finds the right content instead of just the first content.
When the first fix reveals the next problem
Our Hierarchical Context Compression post described how we cut token usage by 92% by choosing which pages to include from a large document. Select the three to five most relevant pages, include them in full, skip the rest. That works well when content is spread across many pages.
Then a student uploaded a research paper with a dense single-column layout. Page 12 was a 15,000-token literature review covering dozens of studies and findings. She asked which studies supported the connection between sleep deprivation and cognitive decline. Our system flagged page 12 as highly relevant, allocated it 8,000 tokens, and then did what our code had always done with overflow: it took the first 8,000 tokens and discarded the rest.
The answer to her question was in tokens 10,000 to 12,000. We sent the model everything except that.
Why simple truncation fails
When a page exceeds its token budget, a position-biased cut looks correct from the outside: the model receives a complete block of text, produces a fluent answer, and nothing in the response signals that anything was missing. The user only discovers the problem if she already knows the answer.
Across 1,000 queries against documents with oversized pages, 23% were silently degraded: 18% got wrong answers, 64% got incomplete ones, and user satisfaction dropped from 4.6 to 2.8. The cut ignored what the user actually asked. Early sections got included by default; later sections got dropped by default.
The shape we landed on
The key decision: select by relevance score, but output in original document order. The model sees coherent prose, not a relevance-ranked scramble.
How the pieces work
Chunking splits on paragraph boundaries first (200 to 500 tokens each), falling back to sentence grouping for dense pages with no breaks. Selection is a greedy pass: sort by score, add chunks until the token budget runs out. We tested more complex optimization approaches and the quality difference did not justify the overhead.
Scoring is where the interesting trade-off lives. Each chunk gets a relevance score from three weighted components: keyword overlap with the query (60%), topic matching via case-insensitive substring comparison (30%), and a small position bias for the first and last chunks (10%). That 10% position component is not a crutch: it reflects something real about how documents are organized, and it acts as a tie-breaker when keyword evidence is thin.
Assembly re-sorts selected chunks by original position, then joins them with gap markers.
[Content compressed: Showing 8/25 most relevant sections]
Introduction text...
[... 4 sections omitted ...]
Relevant content about sleep deprivation and cognitive decline...
[... 2 sections omitted ...]
More relevant findings...
[... 10 sections omitted ...]
The gap markers matter. Without them, the model tries to infer connections across invisible gaps, which is one of the cleaner paths to hallucination.
What the numbers showed
| Metric | Old truncation | Query-aware | Change |
|---|---|---|---|
| Correct answers | 52% | 94% | +42 points |
| Partial answers | 31% | 5% | -26 points |
| Failed answers | 17% | 1% | -16 points |
| Avg relevance score | 0.41 | 0.89 | +117% |
| User satisfaction | 2.8 / 5 | 4.7 / 5 | +68% |
| Token efficiency | 63% | 94% | +31 points |
The concrete test case: a 25,000-token research paper, a question about methodology. The old approach included the first 8,000 tokens (introduction and background) and missed the methodology section entirely. The new approach identified the methodology chunks as high-relevance, included 7 chunks totaling 7,550 tokens, and answered correctly.
Two edge cases came up during the build. Some pages have no natural breaks at all. For those, the sentence-based fallback takes over. And when query keywords do not appear in any chunk (usually because the query is phrased very differently from the document), the system falls back to position-based sampling as a floor.
What this taught us
The failure mode that cost the most was the silent one. Wrong answers in obvious ways get reported. Answers that are fluent but incomplete because the relevant section was 2,000 tokens past the cut do not, and that is the harder class of bug to catch without measuring it deliberately.
Respecting semantic boundaries matters more than tight token accounting. Cutting mid-paragraph to hit an exact budget costs more in coherence than it saves in tokens. The same logic applies to selection: a greedy pass over scored chunks turned out to be good enough, and the latency savings over a more involved optimization were real.
What's next
The keyword scorer is fast and interpretable, but it misses synonyms and paraphrases. The next version replaces lexical scoring with embedding-based semantic similarity, which should close the gap when the query and the relevant content use different vocabulary for the same concept.
Query-aware compression handles oversized individual pages. Hierarchical Context Compression handles page selection across large documents. Cereby AI System Design covers how both fit into the broader orchestration layer.
