Voice AI is becoming a full-stack problem

Cartesia's Sonic-3.5 and Ink-2 launch shows why voice AI is becoming a full-stack engineering problem, not just a nicer text-to-speech demo.

#AI

#Voice AI

#Speech Models

#AI Engineering

#Product Development

✍️jenuel.dev

Jun. 16, 2026. 8:04 AM

The next useful AI app may not look like a chatbot at all. It may sound like a calm support rep, a patient tutor, or a field assistant that can listen while someone is busy with both hands.

That is why Cartesia's new Sonic-3.5 and Ink-2 launch is worth paying attention to. The headline is not just another text-to-speech upgrade. The more important signal is that voice AI is turning into a full-stack engineering problem: speech-to-text, text-to-speech, latency, turn-taking, interruptions, safety checks, and tool calls all have to feel like one product.

For builders, this is the shift. A voice agent is not a language model with audio bolted on. It is a real-time system where every extra delay makes the product feel less intelligent.

What changed

Cartesia announced Sonic-3.5 for text-to-speech and Ink-2 for speech-to-text, positioning them as a paired stack for real-time voice agents. The company says Sonic-3.5 is built for naturalness, low latency, and 40+ languages, while Ink-2 focuses on transcription accuracy and fast turn-taking.

The most practical claim is the pipeline framing. Cartesia is selling STT and TTS as parts of the same real-time loop instead of two separate vendors that developers have to stitch together. Its launch page points to sub-90ms TTS and 100ms transcript latency with native turn detection. If that holds up in real applications, it matters more than a demo voice sounding slightly nicer.

Why developers should care

Voice agents fail in small moments. A half-second pause after every sentence feels robotic. Poor interruption handling makes users repeat themselves. Bad transcription turns a simple request into a support ticket. A beautiful generated voice is not enough if the agent cannot listen, stop, recover, and call tools quickly.

This is where the developer opportunity is. The best voice products will not be built by choosing the most impressive model in isolation. They will be built by measuring the whole conversation loop.

Support agents: detect intent, pull order data, answer naturally, and escalate when confidence drops.
Healthcare and field workflows: capture spoken notes while the user is working, then structure them for review instead of pretending the transcript is final truth.
Education apps: let students talk through a problem and interrupt the tutor when they are confused.
Internal tools: create voice interfaces for dashboards, incident updates, and hands-free task capture.

The common thread is not voice for novelty. It is voice where typing is slower, unsafe, or unnatural.

The weak spots to test before shipping

I would not ship a production voice agent just because a model page says low latency. Builders should test the boring edge cases first.

Interruptions: can the user cut the agent off without the conversation state breaking?
Noisy audio: does transcription degrade gracefully in a cafe, car, warehouse, or cheap headset?
Accents and code-switching: does the system handle real users, not just studio samples?
Tool-call delay: what happens when the LLM and backend API take longer than the speech layer?
Consent and recording: is it clear when audio is captured, stored, or used for review?

The mistake is treating speech as a UI skin. Voice changes the trust model. People reveal more when they speak, and they notice awkward timing faster than they notice a slow web page.

A practical builder checklist

If you are evaluating Sonic-3.5, Ink-2, or any competing voice stack, build a small benchmark around your own product instead of relying on generic leaderboard claims.

Measure time from user speech ending to agent response starting.
Track word error rate on your real vocabulary, including names, product terms, and acronyms.
Test barge-in behavior: interrupt the agent mid-sentence and see if it adapts.
Log every failed turn with audio, transcript, intent, tool call, and final response.
Decide when the agent should stop talking and ask a human to take over.

That last point is important. A good voice agent should not be endlessly confident. In many products, the trust-building moment is the handoff: 'I am not sure, so I am sending this to a person with the context attached.'

The bigger signal

The AI industry has spent years making models that can answer. The next competition is around systems that can participate. Voice makes that obvious because participation has rhythm: listening, pausing, interrupting, confirming, and acting.

Cartesia's launch is one signal in that direction. Whether its models become the default stack or not, the direction is clear: builders need to think less about isolated model calls and more about complete interaction loops.

For developers, the question is no longer 'Can I add a voice mode?' The better question is: 'Where would a fast, interruptible, trustworthy conversation make this product meaningfully better?'

References

Thanks for reading! If you enjoyed this article and like this kind of content, you're always welcome to buy me a little coffee, but only if you'd like to. No pressure at all, and either way I'm truly grateful you stopped by. ☕