How We Fix Typos Without Wild Guesses
A small, honest normalization layer for the hot path: phrase rules, lexicon-backed fuzzy repair, and a conservative tie policy that leaves the original message intact for the model.
The problem with trusting raw text
Cereby runs a fast path before the language model. Pattern matchers, heuristics, and lightweight gates make routing decisions based on what the user typed. When the user types cleanly, this works. When they type "flaskcards" instead of "flashcards," or open with "test me on chapter 3" instead of the canonical "quiz on chapter 3," the matchers miss. The model would have understood both variants without blinking. The deterministic matchers do not.
The naive approach is to just ask the model to handle it. But that costs tokens and latency on every message, including the ones that the fast path could have routed in microseconds. The opposite extreme, running a general-purpose spellchecker over every message, is worse: a spellchecker "fixing" arbitrary words into the wrong intent keywords creates false positives that are quiet and hard to debug.
We needed something narrower. Only correct toward vocabulary that matters for routing. Only correct when the evidence points at a single best answer. And never touch what the model sees.
The pipeline
Normalization runs in two ordered stages.
Stage one: phrase and synonym rules. An ordered list of regular-expression rules rewrites multi-word spans into forms the matchers already understand. "Test me on" becomes "quiz on." Longer patterns run before single-token work so context beats isolated word fixes.
Stage two: token scan. For each word-like token (alphabetic, bounded length), the pipeline asks three questions in order. Is it already in the canonical set? Keep it. Is it in the blocklist? Keep it. Neither? Run the fuzzy step.
Fuzzy repair: Damerau-Levenshtein with pruning
For tokens that are neither canonical nor blocklisted, we look for a match in the intent lexicon using Damerau-Levenshtein distance (minimum edits via insert, delete, substitute, or transpose of adjacent characters). The transposition operation is what distinguishes it from plain Levenshtein and is why it handles "flaskcards" and "fsalchcards" correctly.
Two things keep this cheap:
Length pruning. If the absolute difference between the token length and a candidate's length already exceeds the per-token maximum edit distance, the dynamic programming step is skipped entirely. Keywords in the intent lexicon are short, and the length window is small, so the number of candidates that reach the DP step is typically tiny.
Strict thresholds by token length. Short tokens get a tighter cap on allowed distance because collisions are more common. A two-character token that is one edit away from a keyword is more likely to be a different word than a typo.
The policy layer: when not to correct
Edit distance alone does not decide the outcome. After scoring candidates, the policy layer applies.
| Situation | Decision |
|---|---|
| Exactly one minimum-distance winner | Adopt the correction |
| Multiple candidates tie | Apply deterministic tie-breakers (longest unique candidate, optional plural/singular agreement) |
| Still ambiguous after tie-breaking | Leave the token unchanged |
| Token is in the blocklist | Leave the token unchanged regardless of edit distance |
The blocklist handles tokens that look like typos of intent keywords but mean something else in context. The tie-abstention rule is the more important one in practice. A skipped correction is invisible to the user. A wrong correction silently routes them to the wrong intent. We picked the trade-off deliberately.
What this taught us
Abstention is a product decision. The normalizer leaving a token unchanged is not a failure. It is the right answer when the evidence does not support a confident correction. A skipped correction is invisible to the user; a wrong one silently routes them somewhere they did not ask to go.
Scope the algorithm. Edit distance is the right tool inside a curated lexicon. Applied to arbitrary English, it will find creative and incorrect matches. The length-gap pruning that keeps it cheap is not an optimization bolted on later; without it, every medium-length token would multiply DP tables across the lexicon on every message.
