Voice-Powered Learning: Cereby's New Audio Capabilities
Designing study workflows for speaking and listening, not only typing and reading
The problem with typing everywhere
Most learning products assume you're sitting at a desk. Real study happens while commuting, walking, or cooking, contexts where typing is the bottleneck and eyes-on-screen time is scarce. Long, complex questions never get asked because pulling out a keyboard mid-commute is friction enough to kill the thought.
So we added voice as a transport layer, not a second product.
Two pipelines, one assistant
We shipped two capabilities. Dictation (push-to-talk) lets a learner speak a message, see a transcript, edit if needed, and send it through the same assistant that handles typed messages. The routing, tools, and memory behave identically. The microphone is pre-processing for the message bus, nothing more.
The second capability is AI-generated podcasts. A learner picks a note, a thread, or a block of study material. A script generation pass rewrites it for listening: shorter sentences, explicit transitions, fewer tables. A TTS (text-to-speech) job then produces audio that lands in a library with playback speed control.
Both paths depend on the same learner context already in Cereby. That dependency is the point.
Decisions that shaped the design
Hold-to-record rather than always-on. We use a push-to-talk interaction so learners opt in per utterance. Always-on microphones generate accidental sends; hold-to-talk slows you down by a second and cuts misfires dramatically. The client shows obvious recording state so there is no ambiguity about whether the mic is live.
Preview before send. In noisy environments and on mobile, speech-to-text makes mistakes. Treating the transcript as an editable draft rather than an instant send is a trust feature, not a nicety. A visible transcript before submission means a bad STT result gets caught before it reaches the routing layer.
Podcast generation is priced at 12 Coins per episode. Visible metering matters: audio generation is more expensive than a text reply, and learners should see that cost before they generate a long episode and retry it twice.
What we learned operating it
Bad audio is usually bad prose for listening, not bad TTS. The script pass matters more than voice selection. A paragraph that works on a screen reads badly aloud because it is too dense, full of parentheticals, and assumes the eye can scan back. Rewriting for ear is a distinct step.
Long podcasts have higher failure risk than short ones. Timeouts and partial jobs need explicit failure states. A silent drop in a text assistant is annoying; a silent drop after waiting twenty seconds for audio erodes trust fast. We added explicit failure states and made retries cheap.
The offshore-of-context problem was the one we wanted to avoid most. Every time we considered making dictation "simpler" by limiting what tools it could call, we were sliding toward a voice-lite product that would not actually replace the keyboard. Keeping full parity was the right call.
Before and after
| Area | Before | After |
|---|---|---|
| Mobile study | High friction for long asks | Speak and listen on the go |
| Assistant parity | Text only | Same routing and context after STT |
| Modality mix | Read and write | Listen during dead time |
| Accessibility | Typing-bound | Speech input plus audio output |
What's next
Stronger offline and low-bandwidth behavior for playback. Clearer in-app signaling when STT confidence is low. Deeper observability on script quality and listen-through rates by style, so we can tune the script pass rather than guessing.
