Why Cereby Supports Multiple AI Models (and How to Choose One)
From a fixed two-model stack to resilient, cost-aware routing across a curated multi-provider catalog.
Two models was a ceiling, not a foundation
When we shipped Cereby's first AI features, the model strategy was simple: pick two defaults and move on. Simple is fine for early traffic. It becomes a problem as usage diversifies.
By the time we sat down to address it, three things were clearly wrong. First, most of our traffic was running through a single vendor, which meant their rate limits and outages were our rate limits and outages. Second, every task, whether a one-line clarification or a three-page document rewrite, went through the same latency and cost envelope. Short queries paid premium prices; hard queries sometimes landed on the wrong capability tier. Third, learners had no way to know which model they were using or why, which eroded trust whenever behavior changed unexpectedly.
Experimenting with new model families was also harder than it should have been. Adding a new provider meant touching client code in more than one place.
The shape we landed on
We put a gateway in front of all model traffic and moved to a curated allowlist across multiple providers. The app talks to one integration contract regardless of which provider is serving the request. The allowlist and tier configuration live in one place that ops can update without touching client code.
The key insight is to separate "which model family" from "how the app talks to models." The gateway owns the provider-specific logic. The allowlist and tiers give product and ops room to move without engineering involvement.
Not every feature uses the selector. Background pipelines and specialized paths that need determinism, throughput guarantees, or domain-specific quality stay pinned. When a pinned model is deprecated, the fix is an infra change, not a request for users to switch tiers.
Tiers in the selector
Models are grouped into three tiers: low, medium, and high. The selector copy explains the rough performance and cost posture of each. Entitlements gate the premium tiers so upgrades are explicit rather than accidental.
Coins still track tokens. Tiers orient the choice but do not replace metering or change what you pay per token.
What this looks like in operations
Allowlist deploys became a routine ops lever. A bad entry shows up as hard 4xx or 5xx clusters, not silent quality drift. Monitoring per model ID matters because different entries can fail in different ways.
Spreading load across providers reduced the thundering-herd pressure on any single API key. Rate limits still move with provider traffic, but the blast radius is smaller when traffic is not concentrated.
Pinned paths need explicit ownership. When a pinned model reaches end of life, someone on the infrastructure side has to move it. That ownership has to be documented, not assumed.
Before and after
| Area | Narrow two-model era | Multi-provider with tiers |
|---|---|---|
| Vendor risk | Concentrated on one provider | Spread across the allowlist |
| Task fit | Single latency/cost profile | Explicit tier selection |
| Learner trust | Opaque defaults | Visible tradeoffs, entitlement-aware premium |
| Operations | Model swaps required code changes | Allowlist updates without core rewrites |
What this taught us
The copy in the selector matters as much as the tiers themselves. Without clear explanations and entitlements, users cannot reason about cost, and they will not trust the system when they get a surprise.
Not every pipeline should float with user model choice. The selector is the right tool for interactive surfaces. Background jobs and specialized pipelines are better served by stable, pinned defaults with explicit ownership. Provider-specific logic belongs behind one boundary; when it leaks into client code, adding or rotating a provider becomes an engineering project.
