From First-Match-Wins to Parallel Scoring: How We Fixed Cereby's Misclassification Problem
How we rebuilt intent classification from a sequential cascade into a parallel candidate pipeline with ambiguity-aware scoring, eliminating the class of misclassifications where regex patterns overrode the LLM's correct answer.
A fast system that was wrong in the right places
A user typed "create a algebra test i can study on." Cereby handed back a flashcard set.
The classifier had seen the word "study," matched a regex pattern, returned createflashcards at confidence 0.90, and never asked the LLM for a second opinion. The word that mattered was "test." The system got credit for being fast and the user lost a coin.
In our previous post, we described a six-phase pipeline that brought classification accuracy from roughly 77% to roughly 95% and cut user corrections by 80%. It had one structural flaw: regex patterns returned early on first match, and for anything with competing signals, the LLM never got a vote. The Semantic Scoring Pipeline fixes this by running every classifier in parallel and scoring the results. Instead of asking "which phase handles this," we ask "whose case is strongest."
The original pipeline's blind spots
The old pipeline had a clear priority order.
Phase 4's early return bypassed Phase 5 entirely. Three failure modes came out of that.
Keyword collision. The opening example. "Study" beat "test" because the LLM never got a vote.
Dead follow-up actions. "Create a quiz then flashcard on algebra." The MultiIntentParser detected two actions, stored the second as followUpAction, and stopped. The controller never read that field. The quiz generated. The flashcard quietly vanished.
Confidence that answered the wrong question. The confidence scorer ran after the winner was already chosen. It measured how well the winner fit (pattern match strength, parameter completeness, context alignment), not whether the winner was better than the alternatives. A regex match with a high pattern score and complete parameters could score well even when the underlying classification was wrong.
The shape we landed on
We kept the structural early returns. File intents, context intents, unclear detection: these rely on unambiguous signals (a file is attached or it isn't) and short-circuiting is the right call. Everything else moved into a parallel pipeline.
Before this, each source returned its own shape with no way to compare them. Now every classifier produces a ClassificationCandidate with a common set of fields including the source, an ambiguity score, and an optional followUpAction. The unified type is what makes scoring possible.
Ambiguity scoring: the key mechanism
Every action type has two keyword lists: signals (keywords that support the action) and anti-signals (keywords that suggest a different action).
createflashcards:
signals: ["flashcards", "study cards", "memorize", "review cards"]
antiSignals: ["test", "quiz", "questions on", "create a", "make a", "grade"]
generatequiz:
signals: ["quiz", "test me", "test on", "questions on", "test i can"]
antiSignals: ["flashcards", "study cards", "memorize", "review cards"]
The ambiguity score (antiSignalsFound / (signalsFound + antiSignalsFound)) is 0 for LLM candidates, which read the full sentence rather than matching isolated keywords. The full scoring formula:
finalScore = baseConfidence
- (0.35 x ambiguityScore) // penalty for competing signals
+ contextBonus // up to +0.10 for matching active subjects
+ parameterBonus // up to +0.10 for complete parameters
+ agreementBonus // +0.15 if another source agrees
// capped at [0, 1]
The ambiguity penalty of 0.35 is calibrated so a regex match with ambiguity above 0.5 drops below an LLM match with moderate confidence. We tested 0.25, 0.30, 0.35, and 0.40 against our corpus of misclassified messages. At 0.25, some regex candidates with moderate ambiguity still beat the LLM. At 0.35, every wrong result flipped without degrading a correct one. We would not have found that number by reasoning about it.
The agreement bonus of 0.15 rewards consensus. When regex and the LLM independently pick the same action, both get boosted, and unambiguous cases resolve with higher confidence than before despite passing through a scoring layer.
Worked examples
The misclassification case:
"create a algebra test i can study on"
Regex candidate: createflashcards
baseConfidence: 0.90
ambiguityPenalty: -0.35 x 0.67 = -0.23 (signals: "study" / antiSignals: "test", "create a")
parameterBonus: +0.00 (no subject extracted)
agreementBonus: +0.00 (LLM disagrees)
FINAL: 0.67
LLM candidate: generatequiz
baseConfidence: 0.88
ambiguityPenalty: 0.00
parameterBonus: +0.10 (subject: "algebra")
agreementBonus: +0.00 (regex disagrees)
FINAL: 0.98
Winner: generatequiz (correct)
An unambiguous case: "create a podcast about biology." Both sources pick createpodcast, no anti-signals, agreement bonus of +0.15 on each. Both scores cap at 1.0. No regression.
Multi-intent confirmation flow
The second fix closed the dead followUpAction path. The new flow:
User: "create a quiz then flashcard on algebra"
Pipeline produces a multi-intent candidate with followUpAction.
Scorer evaluates it alongside single-intent candidates.
Controller executes quiz, then appends:
[Quiz generated and displayed]
"I also have a flashcard set on algebra ready to create.
Want me to go ahead?"
User: "yes"
Controller executes pendingAction. Flashcards generated.
This reused the existing pendingAction mechanism already wired for schedule confirmations. CerebyResponse already had followUpAction and pendingAction; shouldExecutePendingAction() already existed in the utils. Connecting them for multi-intent cases required no new UI components and no new API fields.
Latency and compatibility
Two conditions now trigger a clarification prompt instead of a guess: best candidate scores below 0.5, or the top two candidates (different action types) are within 0.1 of each other. Previously, clarification only fired when the LLM explicitly set clarificationNeeded: true.
The old pipeline skipped the LLM for regex-matched messages (roughly 60% of traffic). The new pipeline always calls it, but in parallel.
| Phase | Old pipeline | New pipeline |
|---|---|---|
| Regex patterns | ~1ms | ~1ms (parallel) |
| LLM classification | ~200ms (only if regex missed) | ~200ms (always, parallel) |
| Multi-intent parsing | ~1ms (separate step) | ~1ms (parallel) |
| Scoring | N/A | ~1ms |
| Total for regex-matched | ~1ms | ~201ms |
| Total for LLM-required | ~201ms | ~201ms |
A misclassification costs the user a coin, a rephrase, and 3 to 5 seconds of wasted generation. We consider 200ms an acceptable tradeoff for the 60% of traffic that previously short-circuited.
The existing matchIntentPatterns() function is untouched; we added matchAllIntentPatterns() alongside it. The IntentClassificationResult interface is unchanged. If the LLM call fails, the pipeline falls back to whatever candidates remain, which is strictly better than the old fallbackClassification function with its own separate keyword logic.
Verification
We tested the scoring formula against 47 messages that were previously misclassified.
| Scenario | Count | Old winner | New winner | Correct? |
|---|---|---|---|---|
| Regex matched wrong keyword | 31 | Regex (early return) | LLM (ambiguity penalty flipped it) | 31/31 |
| Multi-intent dropped second action | 9 | First action only | First + followUp confirmation | 9/9 |
| Close confidence, wrong pick | 7 | Higher raw confidence | Context-aligned candidate | 6/7 |
The one failure: a message where both the regex and the LLM agreed on the wrong action. The agreement bonus amplified a shared mistake. When both classifiers are wrong, scoring cannot help.
We also ran the full existing test suite: 261 tests across 30 test files. Zero regressions.
What this taught us
Ambiguity is measurable. The ratio of anti-signals to total signals is a reliable proxy for misclassification risk, and the scoring weight that made it work (0.35) came from testing against real failures, not from intuition.
Existing infrastructure often has the right fields but missing wiring. followUpAction, pendingAction, and shouldExecutePendingAction() all existed before this work. The multi-intent confirmation flow required zero new data structures.
What's next
Two concrete items on the backlog. The ActionChainer module already detects data-flow dependencies ("analyze my weak points then quiz me on them"); wiring it into the pipeline would let the second action use output from the first automatically. For the 200ms penalty on regex-matched messages, an LLM fast-path cache could skip the LLM call for patterns where regex and LLM have agreed more than 95% of the time historically.
Postscript: Referent Resolution for Pinned Content
Three weeks later, a user pinned "Quiz: the Algebra" and asked "what can you say about my performance on this." Cereby replied: "Algebra: 0.00%, Stable. It looks like you're just getting started in this subject."
The classifier was correct: analyzeperformance. The bug was one layer down. The LLM read the pinned context section (which only emitted quiz: Quiz: the Algebra), extracted subject: "Algebra" from the title, and analyzePerformance(context, "Algebra") filtered quizHistory by that string and found zero records. The extraction looked locally reasonable given the prompt it saw. Nothing in the parallel scorer catches this.
Why a single fix wasn't enough
We had three options, and shipped all three.
Option A (fix the prompt): give the LLM richer pinned-item descriptors so it stops extracting subjects from titles. Root cause repair, but trusts the LLM to follow new instructions.
Option B (fix the override): overrideForPinnedPerformance already wiped parameters.subject when the LLM returned askquestions with a pinned assessment. Extending it to also cover analyzeperformance catches the case the original guard missed.
Option C (fix the analyzer): analyzeAllSubjects had a pinned-content branch that bypassed quizHistory lookup. Its sibling analyzePerformance(subject) did not. Extracting the shared detection logic into tryBuildFromPinned gave both paths the same behavior.
Each fixes the bug in isolation and fails differently when something else breaks. Defense in depth, because referent resolution is fragile by nature.
What we changed
The pinned context section of the prompt now emits full per-item metadata:
- quiz "Quiz: the Algebra" (id=12345, subject=Algebra, attempts=2, avgScore=60%, topics=[Linear Equations, Quadratic Functions])
- note "AP Biology Overview" (id=note-abc, subject=Biology, words=1820, lastUpdated="2 days ago")
- flashcard-set "Midterm Practice" (id=42, subject=Calculus, sessions=1, avgScore=75%)
plus an instruction: when the user says "this," "that," or "it," they refer to pinned items above. For performance queries on a pinned assessment, leave parameters.subject empty. We added few-shot examples showing the desired empty-parameter output.
The override guard and the shared tryBuildFromPinned helper were implemented as described in the options above. Prompt plus override plus handler-side defense is now our default posture for "action was right, referent was wrong" bugs.
Verification
Three new test files cover the fix: one for enriched pinned descriptors in the context builder, one for the broadened override guard (LLM-returned analyzeperformance with subject="Algebra" now wipes the subject; generatequiz is untouched), and one that confirms analyzePerformance(context, "Algebra") returns the pinned-quiz analysis (60% avg over 2 attempts) rather than an empty subject-filter result.
The original bug case ("performance on this" with pinned "Quiz: the Algebra") now returns the actual attempt history with sparkline trajectories, weak-point breakdown, and topic-level scores.
Questions: engineering@cereby.ai. Product: cereby.ai. Previous post: Inside Cereby's Intent Classifier.
