Voice-Powered Learning: Cereby's New Audio Capabilities

Designing study workflows for speaking and listening, not only typing and reading

The problem with typing everywhere

Most learning products assume you're sitting at a desk. Real study happens while commuting, walking, or cooking, contexts where typing is the bottleneck and eyes-on-screen time is scarce. Long, complex questions never get asked because pulling out a keyboard mid-commute is friction enough to kill the thought.

So we added voice as a transport layer, not a second product.

Two pipelines, one assistant

We shipped two capabilities. Dictation (push-to-talk) lets a learner speak a message, see a transcript, edit if needed, and send it through the same assistant that handles typed messages. The routing, tools, and memory behave identically. The microphone is pre-processing for the message bus, nothing more.

The second capability is AI-generated podcasts. A learner picks a note, a thread, or a block of study material. A script generation pass rewrites it for listening: shorter sentences, explicit transitions, fewer tables. A TTS (text-to-speech) job then produces audio that lands in a library with playback speed control.

Both paths depend on the same learner context already in Cereby. That dependency is the point.

Decisions that shaped the design

Hold-to-record rather than always-on. We use a push-to-talk interaction so learners opt in per utterance. Always-on microphones generate accidental sends; hold-to-talk slows you down by a second and cuts misfires dramatically. The client shows obvious recording state so there is no ambiguity about whether the mic is live.

Preview before send. In noisy environments and on mobile, speech-to-text makes mistakes. Treating the transcript as an editable draft rather than an instant send is a trust feature, not a nicety. A visible transcript before submission means a bad STT result gets caught before it reaches the routing layer.

Podcast generation is priced at 12 Coins per episode. Visible metering matters: audio generation is more expensive than a text reply, and learners should see that cost before they generate a long episode and retry it twice.

What we learned operating it

Bad audio is usually bad prose for listening, not bad TTS. The script pass matters more than voice selection. A paragraph that works on a screen reads badly aloud because it is too dense, full of parentheticals, and assumes the eye can scan back. Rewriting for ear is a distinct step.

Long podcasts have higher failure risk than short ones. Timeouts and partial jobs need explicit failure states. A silent drop in a text assistant is annoying; a silent drop after waiting twenty seconds for audio erodes trust fast. We added explicit failure states and made retries cheap.

The offshore-of-context problem was the one we wanted to avoid most. Every time we considered making dictation "simpler" by limiting what tools it could call, we were sliding toward a voice-lite product that would not actually replace the keyboard. Keeping full parity was the right call.

Before and after

Area	Before	After
Mobile study	High friction for long asks	Speak and listen on the go
Assistant parity	Text only	Same routing and context after STT
Modality mix	Read and write	Listen during dead time
Accessibility	Typing-bound	Speech input plus audio output

What's next

Stronger offline and low-bandwidth behavior for playback. Clearer in-app signaling when STT confidence is low. Deeper observability on script quality and listen-through rates by style, so we can tune the script pass rather than guessing.

Designing study workflows for speaking and listening, not only typing and reading

The problem with typing everywhere

So we added voice as a transport layer, not a second product.

Two pipelines, one assistant

Both paths depend on the same learner context already in Cereby. That dependency is the point.

Decisions that shaped the design

What we learned operating it

Before and after

Area	Before	After
Mobile study	High friction for long asks	Speak and listen on the go
Assistant parity	Text only	Same routing and context after STT
Modality mix	Read and write	Listen during dead time
Accessibility	Typing-bound	Speech input plus audio output

Voice-Powered Learning: Cereby's New Audio Capabilities

The problem with typing everywhere

Two pipelines, one assistant

Decisions that shaped the design

What we learned operating it

Before and after

What's next

Learn with Cereby

Voice-Powered Learning: Cereby's New Audio Capabilities

The problem with typing everywhere

Two pipelines, one assistant

Decisions that shaped the design

What we learned operating it

Before and after

What's next