Building Production Voice AI: What We've Learned
CONTENTS
Contact Us

Building Production Voice AI: What We've Learned

The Demo Problem

Voice AI demos are remarkable. You speak, an AI responds in natural language a moment later. It feels like the future.

Then you try to ship it.

In a demo, you control the environment. Good microphone, quiet room, predictable questions, no concurrent users. In production, users are calling from cars on highways, asking questions your demo never anticipated, and you have 10,000 of them doing it simultaneously.

Voice AI is a systems engineering problem disguised as an AI problem. Here's what we've learned building production voice systems.

The Latency Budget

Natural conversation has a rhythm. When you ask someone a question, you expect a response within about 500-800ms. If it takes longer, the conversation feels broken.

Your voice AI pipeline has three stages, each with a latency budget:

ASR (Automatic Speech Recognition): Converting audio to text. Target: 200-400ms. LLM inference: Generating the response. Target: 300-500ms. TTS (Text-to-Speech): Converting text to audio. Target: 100-300ms.

Total target: under 1 second end-to-end for the first audio chunk.

This is tight. Hitting it requires:

  • Streaming at every layer (don't wait for the full transcript before calling the LLM; don't wait for the full LLM response before starting TTS)
  • Choosing models optimized for speed, not just quality
  • Low-latency infrastructure (same region for all three components)
  • Aggressive caching for common responses

If any layer can't be streamed, you need to optimize that layer's latency instead.

STT Provider Comparison

Speech-to-text accuracy varies significantly by accent, background noise, and domain vocabulary. The main options:

Deepgram: Fastest in our testing. Excellent streaming support. Strong on diverse accents. Good for general use cases. Their Nova models have competitive accuracy with best-in-class latency.

OpenAI Whisper: Best raw accuracy on clean audio. Self-hostable (which matters for some use cases). The API version can be slow; the self-hosted version requires GPU infrastructure.

AssemblyAI: Strong on noisy audio. Good speaker diarization (who said what) for multi-speaker use cases. API is well-designed.

Google Speech-to-Text: Reliable and scalable. Strong multilingual support. Can be slower than Deepgram on streaming workloads.

Our default: Deepgram for most production voice AI. Switch to Whisper self-hosted when you need maximum accuracy and have the infrastructure budget.

Interruption Handling

Real conversations involve interruptions. Users don't always wait for the AI to finish speaking before they start talking. If your system doesn't handle this, it feels robotic.

Interruption handling requires:

Barge-in detection: The ability to detect when the user starts speaking while the AI is still talking, and stop the AI's audio output immediately.

State management: When interrupted, the AI needs to stop its current response and process what the user just said. This means canceling any in-flight TTS and LLM requests.

Graceful acknowledgment: Often the AI should acknowledge the interruption ("Of course, what were you saying?") rather than abruptly starting a new response.

This is harder than it sounds. The audio pipeline is stateful, and canceling mid-stream without audio artifacts requires careful implementation.

Silence Detection

How do you know when the user has finished speaking? This seems obvious but is genuinely hard.

Naive approach: a fixed silence threshold (if no audio for 1.5 seconds, the turn is over). This works but fails when:

  • Users pause mid-sentence to think
  • Background noise creates false positives
  • Users have speech patterns with longer natural pauses

Better approach: combine voice activity detection (VAD) with endpoint detection models trained specifically on conversational patterns. These models predict whether a pause is a mid-sentence pause or a turn-ending pause based on context.

Deepgram's endpointing feature does this reasonably well. For high-stakes production systems, training your own endpoint model on data from your specific use case is worth the investment.

Fallback to Text

Not all users can or want to use voice. Network conditions, environment, accessibility needs — there are many reasons a user might not be able to complete a voice interaction.

Every voice AI product should have a text fallback:

  • Display a text input alongside the voice interface
  • Allow users to switch mid-conversation
  • If STT fails consistently, offer text entry automatically

This also helps when voice recognition fails on specific terms. Users can type the word the AI keeps getting wrong.

Phone vs Browser Deployment

The deployment environment changes everything.

Browser-based voice AI: The user's browser captures audio via the Web Audio API. You have control over the UI. Latency is typically lower. Works well for desktop and mobile web.

Phone (PSTN) deployment: Audio quality is compressed (8kHz vs 16kHz+ for browser). Latency budget is different — phone users are accustomed to some delay. Requires a telephony provider (Twilio, Vonage, Telnyx) that handles the SIP/PSTN layer. DTMF (touch-tone keypad input) needs to be handled as a fallback.

For phone deployment, Twilio's Media Streams is the standard approach for real-time audio streaming. Vonage is a good alternative with competitive pricing. Both require you to handle WebSockets between your application and their infrastructure.

The Production Checklist

Before going live with any voice AI product:

  • End-to-end latency under 1 second for first audio chunk (tested under load)
  • Interruption handling implemented and tested
  • Silence detection tuned for your user base
  • Text fallback available
  • Error handling for STT failures (what does the user hear when transcription fails?)
  • Load testing at 10x expected peak concurrent calls
  • Logging of every conversation turn (for quality improvement and compliance)
  • Monitoring on latency, error rate, and cost per call

Voice AI is one of the most technically demanding surfaces to build well. But when it works — when latency is low, interruptions are handled gracefully, and the conversation feels natural — it's among the most powerful experiences you can deliver to users.

Meet the author