Inside Cereby's Intent Classifier: How We Route Natural Language to the Right Tool
How we went from a growing if/else chain to a layered pipeline with deterministic gates, LLM inference, multi-factor confidence scoring, and structured disambiguation.
The if/else chain that stopped working
Every request Cereby handles begins with one question: what does the user actually want? "Quiz me on this," "explain photosynthesis," "add my classes to my calendar." Each sentence has to resolve to one of our registered tools, or to no tool at all if the user is just chatting.
When we launched, the answer was a function with a growing if/else chain. "Quiz" meant quiz generation. "Explain" meant concept explanation. It worked until the real world showed up.
The cracks were specific. "Cells" could mean Biology, Chemistry, or Computer Science. A keyword matcher has no way to decide, so we routed to the wrong subject roughly one in four times for polysemous terms. "Quiz me on this" without conversation history was unresolvable. "Explain derivatives then quiz me on them" was two actions, and we picked one and dropped the other. A user with a pinned quiz asking "how did I do" got a generic explanation instead of performance analysis, because the classifier had no concept of pinned materials. And the system never said "I don't know." Every input produced an action, even when the match was terrible.
The cost was measurable. About 23% of requests needed correction or rephrasing. Around 15% of users abandoned when the wrong tool ran. "Cereby doesn't understand me" was a top-three support category.
What we built instead
Rather than patching keywords, we designed a six-phase pipeline where earlier phases are deterministic and cheap, and later phases use LLM inference with structured confidence scoring.
Phases 1 through 4 are pure functions: no network calls, sub-millisecond latency. They handle the majority of unambiguous requests. The LLM in Phase 5 only runs when the message is genuinely ambiguous, and even then we constrain it to structured JSON output with logprobs so we can verify its confidence independently.
The tool registry
Before classification can happen, every tool must be registered. The registry holds two parallel maps: one for tool metadata (name, description, parameters, examples) and one for execution handlers. Registration enforces that the metadata ID and handler ID match at startup. A mismatch throws immediately.
Fourteen tools are currently registered, covering quiz and flashcard generation, notes, summaries, schedule creation, performance analysis, content search, memory, podcast creation, and personalization. The registry auto-generates the intent classification prompt that the LLM sees in Phase 5, iterating over every registered tool and emitting its name, ID, description, parameters, example phrases, and whether it requires confirmation. Registering a new tool automatically updates the classification prompt.
Phase 1: file intent detection
When files are attached, the file intent detector runs first and can short-circuit the entire pipeline. It checks a prioritized list of file intent patterns. Each entry specifies a required file type (PDF for schedule import, for example), regex patterns matching the message, and keywords with reference checks ("this," "the document," "the PDF").
If no specific file intent matches but files are present and the message is a question, the detector defaults to a general file Q&A intent. The controller then routes to a file-handling pipeline that parses and prepares the file content.
Why files gate everything: a user who uploads a syllabus PDF and says "add to calendar" wants schedule creation with file context, not a generic schedule flow. File presence changes the semantics of the request before any other signal has a chance to run.
Phase 2: pinned material and context intent
Cereby's @-mention system lets users pin materials (notes, quizzes, flashcards) to a conversation. These fundamentally change how classification works.
The context intent detector handles one specific but high-value case: a user pins a single learning tool and asks about their performance on it. Patterns like "how did I do on this quiz," "analyze my performance," or "what questions did I get wrong" trigger a dedicated performance analysis path with confidence 0.9.
There is an intentional guard: if two or more performance-type items are pinned, the detector returns nothing and lets the request fall through to the general performance analysis tool, which can merge data across items into a comprehensive report card.
Pinned materials also affect the pipeline at two other points. First, the system prompt builder appends a "Pinned context" section, instructing the model to use only pinned content and attached files rather than the user's broader data. This prevents the LLM from hallucinating a subject when the user says "quiz on this" with a pinned note. Second, in the controller, if the user has pinned content and their intent maps to a content-based action (quiz, notes, flashcards), the clarification gate is bypassed even when confidence is below the threshold. The pinned material itself provides sufficient context.
Phase 4: deterministic pattern registry
The pattern registry runs ordered pattern groups before the LLM. Two examples show why ordering matters: search patterns cover 20-plus regex forms including "what did someone mean by saying a quote" with optional source scoping; podcast patterns must appear before any visualization guard, because "create a podcast on geographic data" would otherwise match on the word "geographic" before the podcast group has a chance to fire. The remaining groups cover schedule creation, resource creation, and learning goals.
Each group returns the action type, extracted parameters, and a fixed confidence (typically 0.90 to 0.95). The first group that matches wins.
Phase 5: LLM classification
When deterministic phases do not match, we call the LLM. Three design decisions define this phase.
First, a dedicated model. Intent classification always runs on a small, fast model regardless of the user's chosen chat model. It completes in roughly 200ms and keeps classification cost fixed.
Second, structured output. We request JSON-mode output and define the expected schema in the prompt: action type, parameters, confidence, up to three alternatives, whether clarification is needed, a clarification question, context-derived subject and topic, and an optional follow-up action. The schema is baked into the prompt that the registry generates.
Third, logprobs for verification. We request token-level log-probabilities (a measure of how confident the model is in each word it produces) and extract the mean across the response. This gives us an independent confidence signal separate from the model's self-reported confidence field. Some providers reject logprobs or structured output; we catch the error and retry without both.
The system prompt is assembled from four components:
System prompt = tool definitions (from registry)
+ user context summary (weak points, events, recent materials, pinned content)
+ conversation history (last N messages + extracted concepts/subjects/topics)
+ session metadata (current date/timezone for relative date resolution)
The prompt includes roughly 25 few-shot examples covering edge cases: learning goals vs. explanations, search vs. explain, pinned context, deictic resolution ("quiz on this"), memory commands, personalization, casual conversation, and stated-fact vs. creation-request disambiguation.
Phase 6: post-processing and confidence scoring
After the LLM returns, three things happen.
Parameter extraction and merge. A regex-based extractor pulls subject and topics from the original message. These override the LLM's parameters when present, because the user's explicit words ("quiz on kubernetes") should beat the LLM's interpretation. For deictic references ("quiz on this"), placeholders are stripped and the LLM's context-derived subject fills the gap.
Rule-based overrides. Post-processor corrections catch systematic LLM mistakes: converting a Q&A classification to performance analysis when performance-specific language is detected, or converting a "stating a fact" message back to no action.
Multi-factor confidence scoring. This is where the real confidence work happens.
Multi-factor confidence scoring
The confidence scorer computes five independent factors and blends them with the LLM's self-reported confidence:
| Factor | Weight | How it's calculated |
|---|---|---|
| Pattern match | 0.25 | Regex keyword families per action type. No matches → low baseline, one match → good, two or more → very high |
| Model confidence | 0.25 | Derived from mean token log-probability, clamped to [0, 1]. Falls back to a moderate default if logprobs are unavailable |
| Historical accuracy | 0.20 | Success rate for recent classifications of the same user and action type; correction records reduce the rate |
| Parameter completeness | 0.15 | Fraction of required parameters present (quiz generation requires a subject, for example) |
| Context alignment | 0.15 | Do topics align with known weak points? Does the subject match recent quiz or note history? |
The weighted sum produces an overall score. The final confidence blends this multi-factor score with the LLM's self-reported confidence, giving more weight to the multi-factor score because LLMs tend to overestimate their certainty.
When the blended confidence drops below our threshold and no clarification question was already set by the LLM, the system flags the result as needing clarification and attaches a generic question. This prevents low-confidence classifications from silently executing the wrong tool.
Disambiguation and clarification
When confidence is low or the LLM flags ambiguity, three systems collaborate.
The ambiguity detector runs ordered checks: reference ambiguity ("this," "that," "it" without recent activity to resolve against); subject ambiguity (polysemous terms like "cells," "waves," "bonds," "function" mapped to multiple disciplines); action ambiguity (short generic messages without action words, with low confidence); parameter ambiguity (required parameters missing for the classified action); and scope ambiguity (broad subject without specific topics, with low confidence).
The clarification generator takes that detection (or the LLM's alternatives) and produces a structured response: a natural-language question, up to three action alternatives with descriptions and reasoning, and context hints that mention pinned item titles when present.
The controller checks whether clarification is needed or confidence is below threshold. If true, it returns a clarification response carrying the alternatives. The client renders these as tappable options. One tap selects the right action.
The pinned bypass applies here too: if the user has pinned content and the action is content-based, the clarification gate is skipped entirely. The pinned material provides the missing context, so asking "which topic?" would be redundant.
Multi-intent and follow-up actions
Users do not think in single actions. "Explain photosynthesis then quiz me on it" requires detecting the sequence ("then" signals dependency), classifying the primary action (explanation), and attaching the secondary action as a follow-up (quiz generation).
The LLM is prompted to return a follow-up action when the user clearly chains two actions. The follow-up is validated against the set of registered tools and attached to the response payload for the client to trigger after the primary action completes.
Operational details
If the LLM call throws (network failure, malformed JSON, provider rejection), a fallback classifier runs keyword-priority matching. The system always returns something usable.
Every core service (classifier, registry, confidence scorer, ambiguity detector, clarification generator, orchestrator) uses the singleton pattern to avoid re-instantiating AI clients, database connections, and compiled regex patterns on every request.
Before and after
| Metric | Before | After | Change |
|---|---|---|---|
| Classification accuracy | ~77% | ~95% | +18pp |
| Requests needing clarification | ~23% | ~7% | -70% |
| User corrections (rephrase/retry) | ~15% | ~3% | -80% |
| Avg. confidence score | N/A | 0.87 | New |
| Complex multi-intent success | ~45% | ~89% | +98% |
The qualitative version is shorter. Pinned context eliminated "what topic?" loops: users pin a note and say "quiz on this," and it works on the first try. Deterministic patterns removed model-dependent variance: "create a schedule" always routes to schedule creation regardless of LLM behavior. Confidence scoring made failures visible: instead of silently running the wrong tool, low-confidence results surface alternatives the user picks in one tap. Support tickets citing "AI doesn't understand me" dropped by roughly 31%.
What this taught us
Self-reported LLM confidence is not enough. LLMs overestimate their certainty. Blending self-reported confidence with logprobs, pattern match strength, and historical accuracy produces a score that actually correlates with correctness.
Pinned context is the strongest signal. When a user explicitly attaches material, the classifier should trust that context over everything else. The pinned-bypass rule eliminated most false clarifications.
Ordered pattern groups prevent priority collisions. "Create a podcast on geographic trends" was matching a visualization guard before we ordered patterns correctly. Explicit priority ordering (schedule, search, podcast, resource, learning goal, visualization) fixed an entire class of misroutes.
What's next
Two things we know are under-solved: several action types (flashcard generation, in particular) still rely entirely on the LLM because we haven't built deterministic pattern groups for them. And follow-up actions today are attached to the response for the client to trigger sequentially. Server-side orchestration with dependency tracking would let us pipeline independent follow-ups for lower perceived latency.
Questions: engineering@cereby.ai. Product: cereby.ai.
