A voice agent feels human at sub-1-second p95 latency. It feels janky at 1.5 seconds. It feels broken at 2.5 seconds. The threshold is psychological — the moment a user perceives a hang, they switch from listening for the answer to wondering whether the agent is still there. Below a second, the conversation flows. Above it, every turn is a small interruption.

Our production voice deployments hold p95 latency between 700ms and 950ms across English, Spanish, German, and Hindi conversations. Here is how the latency budget breaks down, where teams lose seconds, and the architectural tricks that buy them back.

The latency budget

End-to-end voice latency = STT finalisation + LLM generation + TTS first-audio. Real numbers from our LiveKit + Deepgram + Claude + ElevenLabs stack:

STT finalisation (Deepgram streaming): 80-150ms after the user stops speaking
LLM first token (Claude Sonnet, prompt cached): 200-400ms
LLM full sentence (until first natural break): 300-600ms more
TTS first audio chunk (ElevenLabs streaming): 250-400ms after first LLM token
Network + telephony overhead (Twilio / LiveKit): 80-200ms

1. Stream the STT, do not wait for finalisation

The biggest single latency win is consuming the streaming STT transcript before the user has finished speaking. Most voice agents wait for the "final" transcript event from Deepgram or Whisper, then send it to the LLM. That waits for an endpoint-of-speech detection, which adds 200-400ms on top of the actual speech.

Instead, consume the interim transcript stream, run a short "is the user done speaking" classifier on each delta, and start the LLM call as soon as the classifier signals end-of-turn. The LLM is already generating by the time STT finalises.

2. Speculative LLM responses on partial transcripts

Take it further: kick off two or three speculative LLM calls on partial transcripts, then cancel the wrong ones once the user finishes. If the partial transcript at 600ms reads "can I change my appointment to..." — start an LLM call assuming a date is coming, and another assuming a cancellation request. When the final transcript arrives at 900ms, you have already saved 200-300ms because one of the speculative calls is already mid-generation.

This works because LLM API costs are negligible compared to the user experience cost of latency. We typically run 2 speculative calls per turn, with a 70-90% hit rate on the right one. The wasted tokens cost cents per thousand turns; the latency win is felt every turn.

3. TTS chunking and pre-warming

Most TTS providers (ElevenLabs, Cartesia, OpenAI) support streaming synthesis — they start generating audio while the text is still being received. Pipe the LLM tokens directly into the TTS endpoint as they generate, not after the full response is complete. The user starts hearing the first words while the LLM is still finishing the sentence.

Use sentence-boundary chunking: send each completed sentence to TTS as it arrives from the LLM
Pre-warm the TTS connection at conversation start so the first chunk has no cold-start penalty
Cache and replay common phrases ("one moment please", greetings, hold messages) instead of re-synthesising every time
For multilingual agents, pre-load the voice model for the detected language during STT, not after LLM generation

4. Cut LLM thinking time with prompt caching and smaller models

Voice does not need the same LLM size as a deep RAG query. For agent dialogue with structured tool calls, Claude Sonnet or Llama 3 8B is usually plenty — and 2-4× faster than the top-tier model. Reserve the larger model for the small fraction of turns that require deep reasoning, and route to it dynamically based on a classifier on the partial transcript.

Prompt caching cuts another 30-60% off first-token latency for the system prompt and tool definitions, which on a voice agent are repeated every turn. Anthropic and OpenAI both support caching now; the win is immediate and the cost reduction at scale (60-90% on cached portions) compounds.

5. Tool calls without blocking the user

If your voice agent needs to call a CRM, a calendar, or a billing system, those tool calls happen mid-conversation. The naive pattern is: user speaks → STT → LLM decides to call tool → wait for tool → continue. Latency death.

Better: have the agent emit a verbal acknowledgement ("let me check that for you") while the tool call runs in the background. The acknowledgement is a 1.5-second audio buffer that hides the entire tool round-trip — and unlike silence, it does not feel like a hang. The conversation stays warm even when the system is doing real work behind it.

Putting it together

Streaming STT consumption, speculative LLM calls, streaming TTS with sentence chunking, prompt caching, right-sized models, and verbal acknowledgements for tool calls. Each technique buys 100-400ms; together they take a 2.5-second baseline into the 700-900ms range where voice feels human.

The hard part is the instrumentation. You need per-step latency telemetry on every turn in production, and an alert on p95 regression. Without that, the latency creeps back up as the prompt grows, the corpus grows, or the model provider tweaks something — and the user-facing quality degrades without anyone noticing until churn shows up in the dashboard.

Taggedvoice AI latencyvoice agent architectureSTT TTS pipelineDeepgram streamingElevenLabs voice latencyLiveKit voice agent

Voice Agents Under One Second — Latency Playbook

The latency budget

1. Stream the STT, do not wait for finalisation

2. Speculative LLM responses on partial transcripts

3. TTS chunking and pre-warming

4. Cut LLM thinking time with prompt caching and smaller models

5. Tool calls without blocking the user

Putting it together

More articles

Voice Agent ROI: The Real Cost Math Behind 4,000 Calls a Day

Hiring an AI Development Company in the USA in 2026: What to Ask, What to Verify

UK GDPR for AI Development: A Practical 2026 Guide

PIPEDA + Quebec Law 25 for AI in Canada: 2026 Compliance Checklist

Australian Privacy Act + APPs for AI Development in 2026

RAG vs Fine-Tuning in 2026: Cost, Latency, and When to Pick Which

Ready to ship the system this post describes?