A voice agent feels human at sub-1-second p95 latency. It feels janky at 1.5 seconds. It feels broken at 2.5 seconds. The threshold is psychological — the moment a user perceives a hang, they switch from listening for the answer to wondering whether the agent is still there. Below a second, the conversation flows. Above it, every turn is a small interruption.
Our production voice deployments hold p95 latency between 700ms and 950ms across English, Spanish, German, and Hindi conversations. Here is how the latency budget breaks down, where teams lose seconds, and the architectural tricks that buy them back.
The latency budget
End-to-end voice latency = STT finalisation + LLM generation + TTS first-audio. Real numbers from our LiveKit + Deepgram + Claude + ElevenLabs stack:
- STT finalisation (Deepgram streaming): 80-150ms after the user stops speaking
- LLM first token (Claude Sonnet, prompt cached): 200-400ms
- LLM full sentence (until first natural break): 300-600ms more
- TTS first audio chunk (ElevenLabs streaming): 250-400ms after first LLM token
- Network + telephony overhead (Twilio / LiveKit): 80-200ms
1. Stream the STT, do not wait for finalisation
The biggest single latency win is consuming the streaming STT transcript before the user has finished speaking. Most voice agents wait for the "final" transcript event from Deepgram or Whisper, then send it to the LLM. That waits for an endpoint-of-speech detection, which adds 200-400ms on top of the actual speech.
Instead, consume the interim transcript stream, run a short "is the user done speaking" classifier on each delta, and start the LLM call as soon as the classifier signals end-of-turn. The LLM is already generating by the time STT finalises.
2. Speculative LLM responses on partial transcripts
Take it further: kick off two or three speculative LLM calls on partial transcripts, then cancel the wrong ones once the user finishes. If the partial transcript at 600ms reads "can I change my appointment to..." — start an LLM call assuming a date is coming, and another assuming a cancellation request. When the final transcript arrives at 900ms, you have already saved 200-300ms because one of the speculative calls is already mid-generation.
This works because LLM API costs are negligible compared to the user experience cost of latency. We typically run 2 speculative calls per turn, with a 70-90% hit rate on the right one. The wasted tokens cost cents per thousand turns; the latency win is felt every turn.
3. TTS chunking and pre-warming
Most TTS providers (ElevenLabs, Cartesia, OpenAI) support streaming synthesis — they start generating audio while the text is still being received. Pipe the LLM tokens directly into the TTS endpoint as they generate, not after the full response is complete. The user starts hearing the first words while the LLM is still finishing the sentence.
- Use sentence-boundary chunking: send each completed sentence to TTS as it arrives from the LLM
- Pre-warm the TTS connection at conversation start so the first chunk has no cold-start penalty
- Cache and replay common phrases ("one moment please", greetings, hold messages) instead of re-synthesising every time
- For multilingual agents, pre-load the voice model for the detected language during STT, not after LLM generation
4. Cut LLM thinking time with prompt caching and smaller models
Voice does not need the same LLM size as a deep RAG query. For agent dialogue with structured tool calls, Claude Sonnet or Llama 3 8B is usually plenty — and 2-4× faster than the top-tier model. Reserve the larger model for the small fraction of turns that require deep reasoning, and route to it dynamically based on a classifier on the partial transcript.
Prompt caching cuts another 30-60% off first-token latency for the system prompt and tool definitions, which on a voice agent are repeated every turn. Anthropic and OpenAI both support caching now; the win is immediate and the cost reduction at scale (60-90% on cached portions) compounds.
5. Tool calls without blocking the user
If your voice agent needs to call a CRM, a calendar, or a billing system, those tool calls happen mid-conversation. The naive pattern is: user speaks → STT → LLM decides to call tool → wait for tool → continue. Latency death.
Better: have the agent emit a verbal acknowledgement ("let me check that for you") while the tool call runs in the background. The acknowledgement is a 1.5-second audio buffer that hides the entire tool round-trip — and unlike silence, it does not feel like a hang. The conversation stays warm even when the system is doing real work behind it.
Putting it together
Streaming STT consumption, speculative LLM calls, streaming TTS with sentence chunking, prompt caching, right-sized models, and verbal acknowledgements for tool calls. Each technique buys 100-400ms; together they take a 2.5-second baseline into the 700-900ms range where voice feels human.
The hard part is the instrumentation. You need per-step latency telemetry on every turn in production, and an alert on p95 regression. Without that, the latency creeps back up as the prompt grows, the corpus grows, or the model provider tweaks something — and the user-facing quality degrades without anyone noticing until churn shows up in the dashboard.
