Skip to main content
Latency in a voice AI call is the gap between the caller finishing their sentence and the agent starting its response. Even 500ms of extra latency noticeably degrades call quality. At 1.5 seconds or more, callers start to feel like the agent is broken or distracted. This page explains where latency comes from and exactly how to reduce it.

The Latency Stack

Every agent turn goes through this pipeline:
Caller speaks → Transcription → LLM inference → TTS synthesis → Audio playback
Each stage adds latency. The total perceived latency is roughly the sum of all four stages:
StageTypical rangeWhat it depends on
End-of-speech detection200-500msVoice activity detection sensitivity
Transcription (STT)200-600msTranscriber model, audio length
LLM inference400ms - 2sModel, prompt length, response length
TTS synthesis200-800msVoice provider, response length
Network round-trips50-200msGeographic proximity to DialNexa servers
Total~1.0-4.0sAll of the above
A well-configured agent targeting a domestic Indian deployment should achieve end-to-end latency of 1.0-1.5 seconds. A poorly configured agent can easily hit 3-4 seconds.

Source 1: End-of-Speech Detection

Before transcription starts, the platform needs to know the caller has finished speaking. This uses voice activity detection (VAD). Problem: VAD has a natural delay - it needs to wait to distinguish a pause within a sentence from the end of a turn. Fixes:
  • Reduce the end-of-speech silence threshold if your callers tend to speak in short, complete sentences. A threshold that’s too long adds unnecessary pause.
  • Increase the threshold if the agent is frequently cutting the caller off mid-sentence (the caller pauses but hasn’t finished).

Source 2: Transcription Latency

Transcription is typically the fastest stage, but model choice matters.

Deepgram Nova 3

Fastest available transcriber. Optimized for real-time speech recognition. Recommended for most deployments. Latency: ~200-350ms.

Deepgram Flux

Optimized for difficult audio conditions (noisy environments, heavy accents). Slightly higher latency than Nova 3 but more accurate in those conditions. Latency: ~300-500ms.

Soniox

Well-suited for Indian English and regional accents. Use when caller accuracy is more important than raw speed. Latency: ~350-600ms.
Recommendation: Default to Deepgram Nova 3. Only switch if transcription accuracy is poor for your specific caller base.

Source 3: LLM Inference Latency

LLM inference is typically the largest source of latency, especially for longer prompts and longer responses. Model comparison (time to first token):
ModelSpeedQualityBest for
GPT-4o MiniFastest (~300-600ms)GoodDefault for most use cases
GPT-4oMedium (~600ms - 1.2s)Better reasoningComplex multi-step conversations
Groq Llama 4Fast (~300-500ms)GoodHigh-volume, latency-sensitive deployments
DeepSeek V3Fast - MediumGoodCost-optimised use cases
GPT-4o Mini is the default model for good reason - it is significantly faster than GPT-4o while being accurate enough for the vast majority of voice agent use cases. Only upgrade to a larger model if you are seeing quality issues that GPT-4o Mini cannot handle.
Reducing LLM latency through prompt design:
  • Shorten your system prompt: Every token in the prompt increases processing time. Keep instructions concise. Use bullet points, not paragraphs.
  • Reduce response length: Instruct the agent to keep responses short: “Respond in 1-2 sentences only.” Shorter outputs have lower time-to-complete.
  • Avoid complex tool chains: A tool call that triggers another tool call doubles the LLM round-trips. Flatten your tool logic where possible.
  • Pre-compute context: If your agent always needs business hours or product info, inject it as a dynamic variable at call start rather than having the agent fetch it via a tool mid-call.

Source 4: TTS Synthesis Latency

Text-to-speech synthesis is the final audio-producing stage. Both voice providers support streaming synthesis, meaning audio playback begins before the full response is synthesized.
ProviderStreamingLatency (first audio chunk)Quality
CartesiaYes~80-150msVery natural
ElevenLabsYes~150-300msHighly natural
Cartesia is the fastest TTS option in DialNexa. If latency is your primary concern, select a Cartesia voice.
Both providers support streaming: the first audio chunk plays back while subsequent chunks are still being synthesized. The latency figures above represent time-to-first-audio, not time-to-complete-utterance.

Source 5: Audio Cache

Audio Cache is DialNexa’s most powerful single optimization for latency reduction on static phrases. How it works: You pre-specify phrases that your agent says frequently and predictably - greetings, confirmations, hold messages. DialNexa pre-synthesizes these phrases and caches the audio. When the agent needs to say one of these phrases, the cached audio is served instantly - zero TTS latency. Example phrases that benefit from caching:
  • “Thank you for calling. How can I help you today?”
  • “Please hold while I look that up.”
  • “I’ve successfully booked your appointment.”
  • “Is there anything else I can help you with?”
How to configure it: Go to your agent settings → Audio Cache → add the phrase → select the voice → save. DialNexa pre-generates the audio file using your selected TTS provider. Cost benefit: Cached phrases also reduce TTS API costs because the audio is generated once and reused.
Audio Cache is only effective for phrases that are used verbatim. If the phrase contains dynamic content (e.g., a caller’s name or a specific date), it cannot be cached because it changes every call.

Network Latency

Network round-trip time between the caller’s phone, the carrier, and DialNexa’s servers adds latency that varies by geography. For Indian deployments: Enable India region routing to process calls through DialNexa’s India-based infrastructure. This reduces round-trip time for Indian callers significantly compared to routing through global (typically US-based) servers. See Indian Server Routing for setup.

Measuring Latency

Every call detail page shows per-turn latency:
  • Transcription latency: Time from end of caller speech to transcript ready
  • LLM latency: Time from transcript ready to LLM response received
  • TTS latency: Time from LLM response to first audio byte
  • Total turn latency: Sum of the above
To measure aggregate latency across many calls, configure a post-call analysis field:
{
  "name": "average_turn_latency_acceptable",
  "type": "boolean",
  "prompt": "Did the agent respond within a natural conversational delay on every turn? Flag as false if there were noticeable pauses."
}

Latency Optimization Checklist

SettingRecommendation
TranscriberDeepgram Nova 3
LLMGPT-4o Mini (or Groq Llama 4)
TTSCartesia
Audio CacheEnable for all static phrases
India routingEnable for Indian caller base
System promptKeep under 500 tokens
Response length instruction”Respond in 1-2 sentences”
Tool callsMinimize tool calls per turn
Following all recommendations in this checklist should bring most agents to the 1.0-1.5 second total turn latency range for domestic Indian calls.