Performance analysis without the pin
How we rebuilt the chat surface so users could actually ask about a subject and get an answer.
The placeholder that should not have been there
A user types "analyze my performance in math." We had a real answer ready. A whole pipeline of attempt normalization, topic aggregation, weak-point detection, and study-consistency scoring, all of it sitting one prop away from being rendered. Instead the chat showed:
Per-source attempt detail and question-by-question review will appear here once the pinned assessment finishes loading. If this persists, pin the assessment again from the @ menu.
The product was telling the user to do something they had no reason to know about. That message is technically accurate. The system was waiting for context that would never come, because nobody pins a quiz before asking a question about it. The natural thing to do, the thing every user actually does, is type the subject and expect a real answer.
That gap is where this work started.
Two paths, one component
The detailed UI was built for one path: pinned content. When a user @-tagged a quiz in the input, the chat already had its full identifier and could resolve attempt history directly from local storage or the server. The renderer asked a single question, "did the user pin something?" If yes, render the rich accordion with per-source breakdown. If no, render the placeholder above and wait.
This was reasonable when pinning was the only way performance analysis was triggered. But the chat agent also exposed a tool for analyzing performance by subject. When a user typed "how am I doing in algebra," the tool ran. The aggregate analyzer produced real data. The narrative came back. Then the rich UI looked at the pinned-summary count, saw zero, and showed the placeholder anyway.
Two execution paths converged on the same component, with one path always losing.
The shape we landed on
The fix is additive, not a rewrite. The pinned flow stayed byte-for-byte unchanged. We added a second producer of the same per-source summary shape, gated on a different condition.
The handler decides which producer runs. The component does not change. A reader of the rendering branch sees the same data type no matter where it came from. Pinned users notice nothing. Everyone else stops seeing the placeholder.
First version: keyword matching, and why it failed
The first cut of the subject classifier was deliberately dumb. Pull every quiz, flashcard, and exam belonging to the user. Walk each one. Check whether the requested subject appeared as a substring in the title, in the tags, in the embedded learning-content metadata, or in the inner quiz topic. Aliases were normalized before searching: "math" became "Mathematics", "bio" became "Biology", and the search tried both forms.
This worked when the connection was lexical. The user said "algebra," and a quiz titled "Algebra Diagnostic" matched. The user said "biology," and a quiz tagged with biology matched. We shipped it, opened the debug page on a real account, and tested.
Then we typed "math."
Zero matches.
The user had a quiz titled "Quiz: geometry." Geometry is mathematics. Every literate human knows that. A substring search does not.
That was the moment we knew keyword matching was the wrong primitive.
LLM classification as the primary
We ripped the keyword path out and replaced it with a single call to Gemini Flash 2.5 Lite.
The prompt is plain: here are the subjects the user asked about, here are titles, topics, and tags for every quiz and flashcard they own; for each subject, return which materials belong to it. The model has been doing this kind of light classification at human-grade quality for two years. Flash Lite does it cheaply and quickly enough to call on every analyze-performance request.
A few decisions worth naming.
The model is hardcoded. Whatever the user picked for chat does not flow into the classifier. Classification is independent of conversation style and should not burn the user's chosen budget. Flash Lite is cheap enough that we can call it on every relevant request without thinking about it.
The temperature is 0.1. We want the same materials surfaced for the same query.
There is no keyword fallback. If the call fails or the response is malformed, all named subjects are reported as unmatched, and the UI shows an empty state with suggestion chips for generating new material on those subjects. We considered a hybrid where keyword matching backfilled the LLM, but adding a less-accurate fallback to a more-accurate primary makes the user experience harder to reason about. Strict failure surfaces real problems; soft failure hides them.
After the swap, "math" found both the algebra quiz and the geometry quiz. "Calculus" would find a "Derivatives Practice" quiz. "Literature" would find "Romeo and Juliet Test." The same primitive that lets the chat understand a user's request now decides which of their materials count toward that request.
A walkthrough
A student getting ready for finals has three materials in their account:
- quiz "Quiz: the Algebra" (id=qalg, 3 attempts, avgScore 60%, topics=[Linear Equations, Quadratic Functions])
- quiz "Quiz: geometry" (id=q
geo, 1 attempt, avgScore 80%, topics=[Triangles, Circles])flashcards "Spanish Vocab Set 1" (id=fces, 2 sessions, tags=[spanish])
Nothing pinned, nothing tagged manually. They open the chat and type:
analyze my performance in math and science
The handler splits the request into two subjects, sees no pinned content, and hands the classifier the full material catalog.
classifier input:
subjects: ["Mathematics", "Science"]
materials: [qalg, qgeo, fces] // titles, topics, tags as above
classifier response (Gemini Flash 2.5 Lite, temperature 0.1):
{
"Mathematics": ["qalg", "qgeo"],
"Science": []
}
The component builds per-source summaries from the matched IDs:
Mathematics (2)
qalg → 3 attempts, avgScore 60%, sparkline + topic breakdown, openByDefault: true
qgeo → 1 attempt, avgScore 80%, sparkline + topic breakdown, openByDefault: false
Science (0)
empty-state: "We couldn't find any quizzes or flashcards related to Science yet. Want me to make some?"
chips: [create quiz on Science, generate flashcards on Science, build learning path]
Both math cards render under an "In Mathematics (2)" header with the higher-attempt-count quiz expanded. The Science block renders below as a named empty state instead of falling back to the comprehensive readout. If somebody asks about Science, we owe them an honest answer about Science; the chips turn the empty answer into a starting point. Tapping one submits the prompt as a fresh chat turn, and two minutes later the student has practice material they did not have when they opened the chat.
The contrast case is the one that killed keyword matching. The same student types "how am I doing in math":
substring scan:
"math" in "Quiz: the Algebra" → false
"math" in "Quiz: geometry" → false
"math" in "Spanish Vocab Set 1" → false
matched: [] → renders placeholder
Gemini Flash 2.5 Lite:
{ "Mathematics": ["qalg", "qgeo"] } → renders both quizzes
Same materials, same metadata, two different answers. No title literally contains "math," so the substring scan returns nothing. The classifier returns both quizzes because geometry is mathematics. That gap is the entire reason the primitive changed.
Before and after
| Before | After |
|---|---|
| "Analyze my performance in math" returned a placeholder asking the user to pin first | The same query returns the full per-source breakdown, automatically |
| Subject coverage was substring matching on titles and tags | Subject coverage is semantic, via Gemini Flash 2.5 Lite |
| "math" missed any notebook titled "Geometry" or "Algebra" | "math" surfaces both, plus calculus, statistics, and anything else the model recognizes as math |
| Single subject only | "math, science" runs a classification per subject and renders separate groups |
| No matches showed a placeholder | No matches show the subject by name and three chips that generate fresh material |
What this taught us
The hardest fix was finding the one prop where rendering decided to give up. The pipeline had been right for months; the bug was a single boolean question asked at the wrong altitude.
LLMs are absurdly good at the boring classification tasks we keep solving with regex. A Flash Lite call costs less and works better than the heuristics we used to write. If a feature needs a fuzzy "is X about Y" decision, that is now a service call, not a function.
responseformat: jsonobject is not portable. Different providers have different surface areas, and the gateway does not paper over them. parseLLMJson exists in the codebase because we learned that lesson before; grep before writing the next workaround.
Pinned-only flows are easy to ship and easy to forget. When the natural-language version of the same query lands on a different code path, users do not know which form is "correct," so they pick the most natural one. That is the path that has to work.
Empty states with concrete next actions are worth more than empty states with apologies. "We couldn't find Algebra materials, here are three buttons that make some" is product. "No data" is a wall.
What's next
Two things on the backlog. We do not have semantic search on materials yet, so the classification call ships every relevant material's metadata. Embedding each material once at write time and pre-filtering with a similarity match would shrink the prompt and the latency, especially for power users with hundreds of quizzes. That is one migration and a backfill job. The other lever is auto-tagging at creation time. Quiz titles vary in quality, and "Untitled Quiz 3" still classifies because the inner content carries enough signal, but tagging at write time would remove the long tail of cases where the model has to guess.
