Accurate Citations with Compressed Context: A Two-Stage Verification System

How we made aggressive compression and accurate citations work at the same time.

A compression success that created a citation problem

Our previous work on hierarchical and query-aware compression showed that we could reduce a 150,000-token document to around 8,000 tokens without meaningfully degrading the quality of AI responses. That solved the cost and latency problem. It immediately created a different one.

When a student asks about cellular respiration and the AI quotes the textbook, someone needs to know which page. The AI saw pages 45 and 46 out of a hundred. If the cited quote does not actually appear on those pages, the whole system becomes untrustworthy. An incorrect citation in an educational context is not a minor inconvenience. It can support an accusation of academic dishonesty, cause a student to cite the wrong source in a paper, or simply destroy trust in the product the first time someone checks.

The obvious fix, giving the AI the full document, defeats the compression work entirely. So we built something else.

The two-stage architecture

The core idea is a clean separation: the AI generates responses from compressed content, and a separate verification pass checks every extracted quote against the full original document stored in the database. The AI never needs to see the whole file. The citation, though, is always validated against it.

The AI's response goes out the door only after every quote has a confirmed location in the source. Quotes the system cannot locate are marked unverified rather than silently dropped or fabricated.

How each stage works

The compression pipeline feeds the AI a context that is typically 5-8% of the original document: for a 100-page biology textbook and the query "explain cellular respiration," that might mean pages 45 and 46 in full plus summaries of surrounding chapters. The system prompt instructs the model to mark its quotes explicitly, which makes extraction in stage two more reliable.

Stage two takes the response text and pattern-matches quoted spans, then searches the full original document retrieved from the database. For a PDF that means recording a page number; for video it means a timestamp; for a web article a section heading. A concrete example: the AI response contains "converts glucose into ATP energy." Stage two searches the full 150,000-token original, finds that phrase on page 45, and attaches page: 45 to the citation. The AI never had access to pages 1 through 44 or 47 through 100.

Fallback strategies

Pattern matching on quoted text works most of the time. When it does not, the system falls back to phrase-based extraction: key phrases from the AI response are searched against the full document, and the best matches become citations with a lower confidence score so downstream systems and users can see the difference. A normalization pass also reconciles minor punctuation or whitespace differences before the system gives up on an exact match.

Performance on real traffic

We validated citation accuracy against 5,000 responses. Extraction completes in under one second even for 200-page documents.

Metric	Result
Correct page numbers	96.3%
Correct timestamps	94.8%
Correct sections	97.1%
Quote accuracy	98.2%
False citations (fabricated, not caught)	0.3%

False citations are cases where the AI attributed something to a source that does not contain it. The 0.3% that slip through are cases where a phrase appears in the document but in a different context than the AI implied.

What this taught us

The most important thing is the one the architecture makes obvious: never verify against what the AI saw. Verify against the source of truth. Compression is a rendering concern, not a storage concern. The full document stays in the database, and citation search always reads from there.

The fallback hierarchy matters more than we expected. A small percentage of AI responses do not mark quotes the way the system prompt asks. Having phrase-based extraction as a fallback, with honest confidence scoring, means those responses still get citations rather than returning nothing.

What's next

Visual citation highlighting in the document viewer, so a user can click a citation and see the source text highlighted in place. A cross-file citation mode for queries that span multiple uploads, showing where sources agree or conflict. And interactive correction, where users can flag wrong citations and the system learns from the corrections.

How we made aggressive compression and accurate citations work at the same time.

A compression success that created a citation problem

The obvious fix, giving the AI the full document, defeats the compression work entirely. So we built something else.

The two-stage architecture

The AI's response goes out the door only after every quote has a confirmed location in the source. Quotes the system cannot locate are marked unverified rather than silently dropped or fabricated.

How each stage works

Fallback strategies

Performance on real traffic

We validated citation accuracy against 5,000 responses. Extraction completes in under one second even for 200-page documents.

Metric	Result
Correct page numbers	96.3%
Correct timestamps	94.8%
Correct sections	97.1%
Quote accuracy	98.2%
False citations (fabricated, not caught)	0.3%

Accurate Citations with Compressed Context: A Two-Stage Verification System

A compression success that created a citation problem

The two-stage architecture

How each stage works

Fallback strategies

Performance on real traffic

What this taught us

What's next

Learn with Cereby

Accurate Citations with Compressed Context: A Two-Stage Verification System

A compression success that created a citation problem

The two-stage architecture

How each stage works

Fallback strategies

Performance on real traffic

What this taught us

What's next