Cereby AI System Design: From File Upload to Grounded Answer
How a raw file becomes something the model can cite, without burning the token budget or the student's coin balance.
The contract hidden in every file upload
Students do not chat with an AI in a vacuum. They paste a screenshot of a textbook page, drag in a forty-page PDF, or pin three sets of lecture notes and ask "quiz me on Chapter 7." Every one of those gestures carries an implicit contract: the assistant will read that material, understand it, and ground its answer in it.
Honoring that contract at scale forced us to solve a chain of problems most chatbot architectures sidestep. Input is heterogeneous: PDFs with selectable text, PDFs that are scanned images, photos of whiteboards, .docx exports, .pptx slide decks, audio recordings, video lectures. Token economics are brutal: a 200-page textbook cannot ride in the prompt, and students were burning $1.50 to $2.00 in tokens per question before we fixed it. Correctness matters: the model must cite this file, not hallucinate from training data. And perceived speed matters too: students abandon if the first token takes longer than a few seconds, so parsing and compression had to stay off the critical path.
The answer was a four-stage pipeline: detect file type, extract text (with an OCR fallback), compress to a token budget, and inject grounded context into the prompt. File chat went from a feature students avoided to a default study surface.
Why "just attach the file" does not work
The naive approach dumps every byte into the system prompt. It broke in three ways: token overflow (a single PDF chapter consumed the entire context window), cost explosion (students with large files burned through their coin balance in two or three questions), and relevance dilution (answers drifted toward whatever happened to be near the truncation boundary rather than the page the student asked about).
The shape we landed on
We broke the problem into four discrete stages, each with its own failure modes and tuning knobs.
The key insight that made iteration tractable: parsing and compression are separate concerns with separate failure modes. A PDF that parses cleanly can still blow the budget. A scanned image that needs OCR can still fit in a tiny context window once extracted. Keeping these stages independent let us fix one without regressing the others.
Stage 1: Upload and storage
Media files (video and audio) route to transcription endpoints. Documents and images route to the file upload endpoint, which validates the user's tier and coin balance, stores the file in cloud storage, and hands the URL to the parser. We store first and parse second so re-parses do not require a second upload.
One UX decision here shaped the whole OCR story: the client carries a "force OCR" flag. Some PDFs have selectable text that is garbage: corrupted fonts, ligature artifacts, copy-paste nonsense. Rather than guessing, we let the user toggle "Use OCR for file parsing" and re-parse in place. The resource ID stays the same; only the parsed content changes.
Stage 2: Intent classification
Once a file is stored (or when a user sends a plain message or pins existing material), the system decides what to do with it. The path diverges sharply depending on what arrived.
Files short-circuit classification. When at least one attachment is present, a dedicated file-intent detector runs first: before the chit-chat guard, before the pattern table, before the LLM classifier. It uses regex and keyword patterns in priority order. A student who uploads a PDF and says "quiz me on Chapter 7" gets routed through the file-quiz path, not the general quiz tool. The same underlying generation logic runs, but parameters fill from compressed file text rather than from the user's profile. If nothing specific matches, the default is file Q&A: the model answers grounded in the attachment. The LLM tool picker is skipped entirely, because the presence of an attachment is a strong, unambiguous signal that does not benefit from the latency of an extra model call.
Pins are context, not routing signals. Pinned materials (notes, quizzes, flashcards) merge into the user's selected context, but they do not trigger the file-intent detector. The full LLM classification pipeline runs, with one exception: if the user has pinned exactly one assessment-type item and the message matches performance-oriented wording ("how did I do," "weak points"), the controller routes directly to performance analysis. Otherwise, the LLM classifier sees the pinned content and can incorporate it into parameter extraction ("create a quiz on this pinned note's topic"), but the pin does not force a tool.
For a deeper look at the six-phase classification pipeline, multi-factor confidence scoring, the Tool Registry, and disambiguation, see Inside Cereby's Intent Classifier.
Stage 3: Compress
This is where the economics live. Parsed content, potentially hundreds of pages, needs to fit inside a default budget of roughly 8,000 tokens shared across all attached files, with a per-file floor of roughly 400 tokens so nothing is completely starved.
The file context manager orchestrates this in three steps. First, budget allocation: total budget divided by file count, with the per-file floor applied. If a student pins five files, each gets about 1,600 tokens. If they pin one, it gets the full 8,000. Second, a fast path: if the estimated token count is already under budget, the full text passes through untouched. Most single-page screenshots and short notes skip compression entirely. Third, the compression path: the file content compressor runs when content exceeds the budget.
The compressor has shipped two generations.
Hierarchical Context Compression (first generation): page-level selection. We score pages by query relevance, keep the top-scoring pages in full, and summarize the rest into one-line descriptions. On a representative large-textbook flow, this achieved roughly 92% token reduction while keeping the pages the student actually asked about.
Query-Aware Smart Compression (current): when a single page still exceeds the budget (common with dense reference material), we drop to chunk-level selection. The page is split into semantic chunks, each scored against the user's query using an embedding service, and the top-scoring chunks are assembled into the context window. This prevents the failure mode where the answer lives in paragraph 47 of a long page but truncation cuts off at paragraph 12.
For pinned content with embedded sources (a quiz that references a textbook chapter, for example), the same compressor runs on the embedded source text with a separate 6,000-token budget, producing a "Source material (relevant excerpts)" block appended to the item's content.
Stage 4: Inject and ground
Compressed file content enters the prompt in a clearly delimited block within the user message:
Attached Files:
- Chapter_7.pdf (42 pages)
===== FILE CONTENT =====
[compressed, query-relevant excerpts]
===== END OF FILE CONTENT =====
User Query: Explain the proof on page 12
Alongside the content, the prompt includes strict grounding rules: answer only from the attached file content, cite specific pages and sections, and explicitly state when the files do not contain enough information.
For pinned materials, the server hydrates lightweight payloads from the database, compresses embedded source text against the current query, and merges the result into the layered prompt stack alongside session metadata, memory, and thread history. Files that produced empty text (password-protected PDFs, failed parses) are filtered out before the prompt is assembled.
The failures that shaped the pipeline
Almost none of the incident time during this work came from inference. It came from the kind of small misalignment that does not show up in a design doc.
Garbage text from "selectable" PDFs was the first one. Some academic publishers embed fonts that produce valid Unicode but nonsensical text when extracted. We added the "force OCR" toggle so users can switch to Vision-based extraction without re-uploading.
OCR cost spikes on large PDFs came next. A 200-page scanned PDF with OCR enabled triggers 200 Vision calls at four pages concurrently. We added per-file page-count limits tied to the user's tier, and dimensioned logging (pages sent, chunks selected, model tier) so cost spikes show up in monitoring before they show up in invoices.
Compression regressions hit us twice. Changes to the chunking logic caused the compressor to select irrelevant chunks. We caught both through token accounting: total tokens consumed per file-chat message is logged, and a spike relative to query complexity flags a regression before users report it.
Server-side PDF rasterization varied across serverless environments in ways we did not anticipate. We documented the known failure modes and made the OCR path opt-in rather than default specifically because of this fragility.
Empty parse results on upload were the cleanest fix. When a parser returns no extractable text (encrypted PDF, empty image), the orchestrator now returns a clear error and gates the chat call on whether attachments contain readable text. Sending the model a prompt with an empty file block and getting a hallucinated answer was one of the simplest problems to fix and one of the most important.
Before and after
| Dimension | Before | After |
|---|---|---|
| File chat cost | Whole-document prompts, $1.50 to $2.00 per question on large PDFs | Page- and chunk-aware compression; order-of-magnitude token reduction on typical textbooks |
| OCR accuracy | Client-side Tesseract only; failed on equations, multi-column, diagrams | OpenAI Vision server-side; Tesseract retained for lightweight editor-flow OCR |
| Parse coverage | PDF and images only | PDF, images, .docx, .xlsx, .pptx, audio (Whisper), video (transcript + Whisper) |
| Compression strategy | Truncate at N tokens | Hierarchical (page-level) then query-aware (chunk-level) with embedding-based scoring |
| User control | None | OCR toggle, explicit coin costs, model tier selection |
| Error handling | Silent failures, hallucinated answers from empty context | Readable-text gate, clear error messages, dimensioned logging |
Qualitatively: file chat became a default study surface. Students started uploading entire course readers and asking follow-ups across sessions, behavior we did not see when every question cost a dollar and answers cited the wrong page.
What this taught us
Parsing and compression need to stay separate. Conflating extraction with budget management makes both harder to debug: a file that parses perfectly can still blow the budget, and a file that needs OCR can still fit in 400 tokens.
OCR is a product feature, not a fallback. Treating Vision-based extraction as a first-class path (with its own toggle, cost model, and monitoring) caught quality issues that automatic fallback would have hidden.
Token budgets are a UX decision. The 8,000-token default, the per-file floor, the coin cost: these are product choices, not implementation details. Students notice when compression is too aggressive before we do.
Design for the empty case. Gating the chat call on readable content is one of the simplest checks we shipped and one of the most consequential: a missing gate means a hallucinated answer, and users do not distinguish that from the model being wrong.
