Aiinfox logo
All articles
Voice AI February 26, 2026 8 min read

Voice Agents Under One Second — Latency Playbook

Voice agents that feel natural live under 1 second p95 latency. Here is the latency budget, the architecture, and the speculative tricks that get you there.

AE

Aiinfox Engineering

Senior engineering team · Aiinfox

A voice agent feels human at sub-1-second p95 latency. It feels janky at 1.5 seconds. It feels broken at 2.5 seconds. The threshold is psychological — the moment a user perceives a hang, they switch from listening for the answer to wondering whether the agent is still there. Below a second, the conversation flows. Above it, every turn is a small interruption.

Our production voice deployments hold p95 latency between 700ms and 950ms across English, Spanish, German, and Hindi conversations. Here is how the latency budget breaks down, where teams lose seconds, and the architectural tricks that buy them back.

The latency budget

End-to-end voice latency = STT finalisation + LLM generation + TTS first-audio. Real numbers from our LiveKit + Deepgram + Claude + ElevenLabs stack:

  • STT finalisation (Deepgram streaming): 80-150ms after the user stops speaking
  • LLM first token (Claude Sonnet, prompt cached): 200-400ms
  • LLM full sentence (until first natural break): 300-600ms more
  • TTS first audio chunk (ElevenLabs streaming): 250-400ms after first LLM token
  • Network + telephony overhead (Twilio / LiveKit): 80-200ms

1. Stream the STT, do not wait for finalisation

The biggest single latency win is consuming the streaming STT transcript before the user has finished speaking. Most voice agents wait for the "final" transcript event from Deepgram or Whisper, then send it to the LLM. That waits for an endpoint-of-speech detection, which adds 200-400ms on top of the actual speech.

Instead, consume the interim transcript stream, run a short "is the user done speaking" classifier on each delta, and start the LLM call as soon as the classifier signals end-of-turn. The LLM is already generating by the time STT finalises.

2. Speculative LLM responses on partial transcripts

Take it further: kick off two or three speculative LLM calls on partial transcripts, then cancel the wrong ones once the user finishes. If the partial transcript at 600ms reads "can I change my appointment to..." — start an LLM call assuming a date is coming, and another assuming a cancellation request. When the final transcript arrives at 900ms, you have already saved 200-300ms because one of the speculative calls is already mid-generation.

This works because LLM API costs are negligible compared to the user experience cost of latency. We typically run 2 speculative calls per turn, with a 70-90% hit rate on the right one. The wasted tokens cost cents per thousand turns; the latency win is felt every turn.

3. TTS chunking and pre-warming

Most TTS providers (ElevenLabs, Cartesia, OpenAI) support streaming synthesis — they start generating audio while the text is still being received. Pipe the LLM tokens directly into the TTS endpoint as they generate, not after the full response is complete. The user starts hearing the first words while the LLM is still finishing the sentence.

  • Use sentence-boundary chunking: send each completed sentence to TTS as it arrives from the LLM
  • Pre-warm the TTS connection at conversation start so the first chunk has no cold-start penalty
  • Cache and replay common phrases ("one moment please", greetings, hold messages) instead of re-synthesising every time
  • For multilingual agents, pre-load the voice model for the detected language during STT, not after LLM generation

4. Cut LLM thinking time with prompt caching and smaller models

Voice does not need the same LLM size as a deep RAG query. For agent dialogue with structured tool calls, Claude Sonnet or Llama 3 8B is usually plenty — and 2-4× faster than the top-tier model. Reserve the larger model for the small fraction of turns that require deep reasoning, and route to it dynamically based on a classifier on the partial transcript.

Prompt caching cuts another 30-60% off first-token latency for the system prompt and tool definitions, which on a voice agent are repeated every turn. Anthropic and OpenAI both support caching now; the win is immediate and the cost reduction at scale (60-90% on cached portions) compounds.

5. Tool calls without blocking the user

If your voice agent needs to call a CRM, a calendar, or a billing system, those tool calls happen mid-conversation. The naive pattern is: user speaks → STT → LLM decides to call tool → wait for tool → continue. Latency death.

Better: have the agent emit a verbal acknowledgement ("let me check that for you") while the tool call runs in the background. The acknowledgement is a 1.5-second audio buffer that hides the entire tool round-trip — and unlike silence, it does not feel like a hang. The conversation stays warm even when the system is doing real work behind it.

Putting it together

Streaming STT consumption, speculative LLM calls, streaming TTS with sentence chunking, prompt caching, right-sized models, and verbal acknowledgements for tool calls. Each technique buys 100-400ms; together they take a 2.5-second baseline into the 700-900ms range where voice feels human.

The hard part is the instrumentation. You need per-step latency telemetry on every turn in production, and an alert on p95 regression. Without that, the latency creeps back up as the prompt grows, the corpus grows, or the model provider tweaks something — and the user-facing quality degrades without anyone noticing until churn shows up in the dashboard.

Taggedvoice AI latencyvoice agent architectureSTT TTS pipelineDeepgram streamingElevenLabs voice latencyLiveKit voice agent

More articles

Voice AI

Voice Agent ROI: The Real Cost Math Behind 4,000 Calls a Day

Voice agents pencil at 10-30 cents per call when built right. They pencil at $1.20 a call when built wrong. Here is the actual cost model behind a production deployment doing 4,000 calls a day.

Jun 2026 · 12 minRead
Industry

Hiring an AI Development Company in the USA in 2026: What to Ask, What to Verify

Most AI vendors will not survive a real verification call. Here is what US CTOs and VPs of Engineering should actually ask before signing — and what evidence to insist on.

Jun 2026 · 12 minRead
Industry

UK GDPR for AI Development: A Practical 2026 Guide

Most UK GDPR posts read like a legal essay. This one is the engineering version — DPIAs, lawful bases, Article 22, ICO guidance, SCCs — written for CTOs shipping production AI.

Jun 2026 · 13 minRead
Industry

PIPEDA + Quebec Law 25 for AI in Canada: 2026 Compliance Checklist

PIPEDA is the federal floor. Quebec Law 25 is the strictest provincial overlay. OSFI E-23 sits on top for federally-regulated banks. Here is the engineering checklist that ties them together.

Jun 2026 · 12 minRead
Industry

Australian Privacy Act + APPs for AI Development in 2026

The Privacy Act sets the federal floor. APRA CPS 234 and CPS 230 add the financial-services overlay. The NDB clock is unforgiving. Here is the practical engineering checklist.

Jun 2026 · 12 minRead
Generative AI

RAG vs Fine-Tuning in 2026: Cost, Latency, and When to Pick Which

RAG is the default for most production AI in 2026. Fine-tuning is the right call about a third of the time it gets requested. Here is the honest cost math.

Jun 2026 · 12 minRead
Production AI, not slideware

Ready to ship the system this post describes?

30-minute scoping call. Senior engineers. Fixed-price scope in 72 hours.