Seeing Page Freezes in Our Error Dashboard
How a forty-line client component made one of the most frustrating user-facing failures finally show up where we already look.
The bug we could not see
A user emailed us a screenshot. It was Chrome's "Page Unresponsive" dialog with the two buttons that tell you the day is over: "Wait" and "Exit page." No console error. No network failure. No row in our error dashboard. From our side, the session looked perfectly healthy until it ended, which is the worst possible shape for a bug to have.
The dashboard we look at every morning is built around thrown exceptions. A component crashes, we get a row. A fetch fails, we get a row. A render loop blows the stack, we get a row. But a freeze is none of those things. The main thread is doing exactly what JavaScript told it to do. There is nothing to throw. The browser silently watches a heartbeat go missing for long enough that it puts up a dialog, and that dialog never dispatches an event we can listen for.
So we had no idea how often it was happening, on which pages, after which interactions, or whether one user was hitting it ten times an hour while everyone else was fine. We had a screenshot.
What we actually wanted
A freeze is invisible from inside the page that froze, but only while it is freezing. The moment the main thread comes back, the page can look around and notice that more time has passed than it should have. That is the trick the whole detector turns on.
We wanted three things from a freeze report:
The first was simply for a freeze to show up in the same dashboard everything else does. No new admin page, no new pipeline, no new place to remember to look on Monday morning. If we already triage React render errors and API failures in one place, freezes belong there too.
The second was useful grouping. A 7-second freeze and a 17-second freeze on the same page are almost always the same bug. The user clicked the same button, our code did the same expensive thing, the only difference was how much data happened to be loaded that day. If those two reports show up as separate rows because their durations are different, the dashboard fills with noise and the underlying bug looks rare.
The third was a hint at what was running. "Something blocked the main thread on this page for nine seconds" is a starting point. "Something blocked the main thread on this page for nine seconds, and during that window these long tasks ran with attribution to this script" is a starting point and a flashlight.
How a page notices it stopped
The shape we landed on is small enough to fit on a coffee shop napkin.
The browser gives every page a function called requestAnimationFrame whose job is "call me back before the next paint." On a healthy page that fires roughly sixty times a second. The detector is a single function that schedules itself with requestAnimationFrame, records the wall-clock time on each tick, and notices when more than five seconds have passed since the last tick fired. If that happens, the main thread was blocked, and we know exactly how long for: the gap.
That is the entire freeze detector in one sentence. Everything else is making the report useful.
For attribution we run a PerformanceObserver on the browser's "long task" stream, which fires for any single piece of work that holds the main thread for more than fifty milliseconds. We keep a small rolling buffer of the most recent twenty long tasks so that when a freeze is reported, the report carries a list of what the main thread was busy doing right before it choked. The browser sometimes attaches a script URL to a long task; when it does, that URL goes in the report too.
For grouping, we normalize the page's URL before putting it in the report's message. A pathname like /admin/error-groups/7a30071b1c2d3e4f becomes /admin/error-groups/, so two freezes on different error-group pages share a fingerprint. The dashboard's existing fingerprinter lower-cases the message and replaces digits with a placeholder, which means freezes of different durations on the same normalized route also collapse into one group. That is the property we wanted: one row per kind of freeze, not one row per freeze.
The boring detail that made it work
The first version of the detector reported a freeze every time the user came back from another tab.
When a tab is hidden, the browser stops scheduling animation frames, because there is no point painting something nobody is looking at. As soon as the user switches back, the next animation frame fires and notices that yes, in fact, several seconds have elapsed since the last tick. The detector was technically correct. The main thread had been idle. It had also not been doing anything wrong.
The fix is a small flag tied to the visibilitychange event. When the page becomes visible again, the detector marks the next tick as one to ignore. The first frame after a visibility transition is a known-bogus measurement and we throw it away. Without that flag the dashboard would fill with phantom freezes every time anyone alt-tabbed.
This is the kind of detail that does not appear in the design doc until the second draft of the design doc, because you have to actually run the thing once to find it. It is also the difference between a detector that operators trust and a detector that gets disabled in week two because it cried wolf.
A second guardrail sits at build time. The detector only arms itself in production builds. Local development runs hit hot-module reloads, debugger pauses, and React's strict-mode double renders, all of which can look like multi-second main-thread gaps to a heartbeat that does not know any better. None of those are real freezes. Gating the whole detector to production keeps the signal clean.
What the report carries
When a freeze fires, the row that shows up in the dashboard has the shape of any other client error, with a few specifics.
| Field | What it carries |
|---|---|
| Message | "Page unresponsive: main thread blocked for 17s on /admin/error-groups/" |
| Level | warn for 5 to 15 seconds, error for 15 seconds and above |
| Scope | pagefreeze |
| Context | The exact freeze duration, the original pathname, the normalized route, and the rolling buffer of recent long tasks with their script attribution |
The fifteen-second cutoff for error level was not picked at random. It is roughly where Chrome itself decides the dialog needs to go up. Below that the page has frozen badly enough that we want to know, but the user probably did not see a system prompt. Above that we are in "Wait or Exit" territory, which is the bug class we cared about in the first place.
Stack traces are deliberately empty. There is no thrown exception to capture, only an observed gap. The long-task buffer is the closest thing to a stack we have, and it is more useful in practice than a synthetic one would be.
What this taught us
requestAnimationFrame is the only API that knows when the main thread is blocked and_ knows when it has come back. Anything you build on setTimeout or setInterval either misses the freeze or misreports its duration, because timers are themselves at the mercy of the event loop they are trying to measure.
The visibility flag is not optional. Any page-level main-thread monitor that does not handle tab backgrounding will spam reports the first time a user comes back from lunch, and the second-week reaction to that is always to disable the monitor.
Grouping is what makes a new signal usable, not noise. The detector itself is small. The work that mattered was making sure two freezes that are obviously the same bug land in the same dashboard row, so triage is "here is one freeze on the editor page, ten occurrences this week" instead of "here are ten different rows you have to mentally merge."
A freeze that the page never recovers from is still invisible. If Chrome kills the tab, our heartbeat dies with it, and there is no second chance to phone home. Catching those would mean a separate watchdog running in a Web Worker. We have not built it yet, because we want to see how much of our freeze population is recoverable first. The ones we cannot see are by definition the ones we cannot count, and we would rather act on real numbers than a guess at the dark matter.
What's next
The first month of data will tell us whether five seconds is the right floor. If we are drowning in 5-to-7 second reports that are all the same one slow render path, we will fix that path and possibly raise the threshold. If a long tail of 30-second-plus freezes shows up, that is the signal to build the Web Worker watchdog and start catching the unrecoverable ones too.
Either way, the next time someone emails us a screenshot of the "Page Unresponsive" dialog, we will already have the row open.
