Aiinfox logo
All articles
Voice AI June 2, 2026 12 min read

Voice Agent ROI: The Real Cost Math Behind 4,000 Calls a Day

Honest 2026 cost math on production voice agents — per-call STT/LLM/TTS/telephony costs, latency budget, the 4,000-calls-a-day case study, and when ROI does not pencil.

MS

Manjeet Singh

Senior engineering team · Aiinfox

Voice agents are the AI category where the per-call unit economics either work cleanly or do not work at all. There is no middle ground. A correctly architected voice agent runs at 10-30 cents per call on telephony, model, and infrastructure combined. A mis-architected one runs at $1.20 a call before you count the engineering time burning down on tail-latency incidents. The difference between the two is not the LLM — it is the latency budget, the model choices at each pipeline stage, and the telephony provider.

I am writing this from the cost model behind a production voice agent we ship that handles 4,000+ calls a day for a B2C services brand, and the back-office voice tooling that has saved 1,400 hours a month for an EU insurance customer. Both engagements are profitable at the per-call rates below. Neither would be profitable if the pipeline were architected the way most voice-agent demos are.

The five components in the per-call cost stack

A production voice agent has five distinct cost components, each billed differently and each with different optimization levers. The per-call number is the sum of these five. Most vendors quote the LLM piece and ignore the rest, which is why the projected ROI in the pitch deck rarely matches the real bill at month three.

  • Telephony (inbound or outbound carrier minutes, DID/SIP termination): $0.008 to $0.030 per minute depending on geography and provider.
  • Speech-to-text (streaming STT, audio in): $0.004 to $0.012 per minute depending on provider and accuracy tier.
  • LLM inference (input + output tokens, plus prompt cache hits): $0.001 to $0.020 per turn depending on model and prompt size.
  • Text-to-speech (premium voice TTS, audio out): $0.012 to $0.030 per 1k characters of output speech.
  • Infrastructure (orchestration, observability, eval platform, session state): $0.005 to $0.015 per call amortised.

On a typical 90-second consumer support call, the math runs roughly: 1.5 minutes of telephony at $0.015 (~$0.023), 1.5 minutes of STT at $0.008 (~$0.012), 4 LLM turns averaging $0.004 each (~$0.016), 600 characters of TTS output at $0.018/1k (~$0.011), and $0.008 of infrastructure overhead. Total: roughly $0.07 per call for a tight architecture. That is the floor for English consumer voice in 2026.

Where the cost blows out

The vendors quoting $1.20 a call are not lying. They are just running an architecture that is two to three years out of date for production voice. The cost-blow-out failure modes are consistent.

  • Sending the full conversation history to the LLM on every turn instead of using prompt caching — 3-5x the input-token bill.
  • Using a flagship model for routing decisions a 7B classifier could make in 50ms — 10-20x the LLM bill on the turns that route.
  • Using a non-streaming STT that waits for the user to finish speaking before transcribing — adds 800ms-1.5s latency and forces the user to re-explain when they get impatient and hang up.
  • Premium-voice TTS on every turn including yes/no acknowledgements — 3-4x the TTS bill for no perceived quality gain.
  • No silence detection or VAD tuning, so the agent waits 2-3 seconds at every utterance boundary — calls drag on, telephony minutes accumulate.

The latency budget — and why it determines the cost ceiling

Conversation feels natural at sub-2-second p95 turn latency. Above 3 seconds, users start interrupting and the call structure breaks down. Above 5 seconds, abandonment climbs into double digits. The latency budget directly determines the cost ceiling because every model choice that reduces latency is a model choice that also reduces cost. See our [voice agents under one second post](/blog/voice-agents-sub-second-latency) for the underlying architecture pattern.

The canonical sub-2-second latency budget for a customer-service voice agent:

  • STT first-partial: 150-300ms (streaming STT with VAD tuning).
  • Routing decision: 50-150ms (small classifier model, prompt-cached).
  • LLM first-token: 300-600ms (mid-size model, aggressive prompt caching, streaming response).
  • TTS first-audio: 150-400ms (streaming TTS, low-latency voice tier).
  • Network and orchestration overhead: 100-200ms.

Total p95: 750-1650ms. The teams hitting sub-2-second consistently are the teams instrumenting each component with p50/p95/p99 and shaving 50ms here and 100ms there. The teams running at 4-5 seconds typically have one or two components doing 1.5+ seconds on their own and no observability to find which one.

The 4,000-calls-a-day reference deployment

The reference engagement: a multi-location consumer services brand handling outbound appointment reminders, inbound rescheduling, and post-service follow-up. Pre-deployment, the call volume was burning roughly 28 hours per location per week of front-desk time. Post-deployment, the voice agent handles 4,000+ calls a day across the locations, with human handoff on roughly 9% of calls (complex rescheduling, billing escalations, complaint handling).

  • Per-call cost: $0.09 fully loaded (telephony + STT + LLM + TTS + infrastructure).
  • Per-call equivalent human-handled cost: roughly $4.50 (loaded staff cost plus retry overhead).
  • Net per-call savings: $4.41, applied across 91% of calls that complete without human handoff.
  • Monthly cost savings at 4,000 calls/day x 30 days: roughly $480,000 vs. the human-only baseline.
  • Engagement build cost recovered: roughly 6 weeks post-launch.

See the [outbound voice agent case study](/case-studies/voice-agent) for the architecture and the metrics that gated the rollout. The build was a 12-week engagement with eval-first delivery; the eval set was 280 representative call scripts covering happy-path, edge-case, refusal-required, and escalation categories.

The 1,400-hours-a-month EU insurance build

The second reference deployment is an EU insurance customer where the back-office team was burning roughly 1,400 hours per month on inbound policy-question calls — many of them repetitive (policy coverage lookup, claim status, document requests). The voice agent fronts the call, handles 73% of inquiries without escalation, and routes the remaining 27% to a human agent with full call context pre-populated.

The cost math is different here. The customer is not measuring per-call cost reduction — they are measuring back-office hours released back to the team for high-judgment work. 1,400 hours a month at the loaded cost per back-office FTE is roughly €52,000/month of reclaimed capacity, against a per-call cost stack that runs at €0.14 (premium voice tier, EU-resident infrastructure, GDPR-compliant audit trail). The ROI is real, but the unit economics work because the alternative cost (loaded back-office labour) is high. The same architecture in a country with cheaper labour might not pencil.

When voice agent ROI does not pencil

Voice agent ROI does not pencil in three recurring cases, and an honest vendor will tell a buyer in the first call if their use case is in one of them.

  • Low call volume (under ~200 calls/day). The fixed infrastructure cost (orchestration, observability, eval platform) does not amortise. A well-trained chatbot or async messaging may be the better channel.
  • Highly variable conversation structure with no repeatable patterns. If 60%+ of calls require human judgment, the voice agent is just a routing layer with extra cost — a smarter IVR is cheaper.
  • Markets with very low labour cost. If the per-call alternative is $0.40 of human time, the voice agent's $0.09 per call savings does not justify the engagement build cost.

What buyers should ask in the procurement call

Voice agent procurement questions that surface the architecture quality fast:

  • Show me your per-call cost breakdown on a recent deployment, by component (telephony, STT, LLM, TTS, infra).
  • What is your p95 turn latency on production traffic? Show me the latency dashboard.
  • Which LLM provider do you use for the routing model versus the flagship model, and why?
  • How is prompt caching configured, and what is the cache hit rate on your production deployments?
  • What does the eval set look like for a voice deployment — happy path, edge cases, refusal categories, escalation triggers?
  • Show me the audit log for a single call — the full trace from STT input to TTS output, with timing and cost per step.

A vendor running a real production voice stack can answer all six in narrative on the call. A vendor reading from a deck will hedge on at least three of them. The hedge points are the architectural decisions that determine whether the per-call cost lands at $0.09 or $1.20.

What the regional buyer should weigh

Voice agent economics vary by region, mostly because labour cost (the alternative) and telephony cost (the bill) vary by region.

  • [US deployments](/ai-development-company-usa) — voice agent ROI works above ~300 calls/day in most service industries; loaded labour cost is high enough that the per-call savings compound fast.
  • [UK deployments](/ai-development-company-uk) — similar economics to the US; ICO/UK GDPR audit-trail requirements add infrastructure cost but the per-call math still pencils above ~400 calls/day.
  • [Canada deployments](/ai-development-company-canada) — PIPEDA + Law 25 add a small infrastructure layer; Quebec deployments may need a French-language voice tier with separate STT/TTS providers.
  • [Australia deployments](/ai-development-company-australia) — Privacy Act + APP-compliant data residency adds infrastructure cost; Australian-resident telephony is more expensive but the per-call math works above ~500 calls/day in financial services and healthcare.

Wrapping up — the per-call math is the engagement

Voice agents are not a category where the ROI conversation can be deferred to a phase-2 measurement. The unit economics either work in week one or they do not work at all. The architectural decisions that determine the per-call cost — the routing model, the STT provider, the prompt caching strategy, the TTS tier per turn — all get made in the first 14 days of the engagement. A vendor that does not have an opinion on each of those is a vendor that will deliver a working demo and a $0.90-per-call production bill.

If you are scoping a voice agent build and want a written cost model — per-call breakdown by component, latency budget per stage, and a projected monthly run-rate against your call volume — [book a discovery call](/contact-us). One conversation, one fixed-price scope inside 72 hours, and an honest read on whether the per-call math actually pencils for your specific traffic.

Taggedvoice agent ROIvoice AI costSTT TTS pricingvoice agent economicsproduction voice AIvoice agent latency budget
Production AI, not slideware

Ready to ship the system this post describes?

30-minute scoping call. Senior engineers. Fixed-price scope in 72 hours.