Latency - DialNexa Documentation

Latency in a voice AI call is the gap between the caller finishing their sentence and the agent starting its response. Even 500ms of extra latency noticeably degrades call quality. At 1.5 seconds or more, callers start to feel like the agent is broken or distracted. This page explains where latency comes from and exactly how to reduce it.

The Latency Stack

Every agent turn goes through this pipeline:

Caller speaks → Transcription → LLM inference → TTS synthesis → Audio playback

Each stage adds latency. The total perceived latency is roughly the sum of all four stages:

Stage	Typical range	What it depends on
End-of-speech detection	200-500ms	Voice activity detection sensitivity
Transcription (STT)	200-600ms	Transcriber model, audio length
LLM inference	400ms - 2s	Model, prompt length, response length
TTS synthesis	200-800ms	Voice provider, response length
Network round-trips	50-200ms	Geographic proximity to DialNexa servers
Total	~1.0-4.0s	All of the above

A well-configured agent targeting a domestic Indian deployment should achieve end-to-end latency of 1.0-1.5 seconds. A poorly configured agent can easily hit 3-4 seconds.

Source 1: End-of-Speech Detection

Before transcription starts, the platform needs to know the caller has finished speaking. This uses voice activity detection (VAD). Problem: VAD has a natural delay - it needs to wait to distinguish a pause within a sentence from the end of a turn. Fixes:

Reduce the end-of-speech silence threshold if your callers tend to speak in short, complete sentences. A threshold that’s too long adds unnecessary pause.
Increase the threshold if the agent is frequently cutting the caller off mid-sentence (the caller pauses but hasn’t finished).

Source 2: Transcription Latency

Transcription is typically the fastest stage, but model choice matters.

Deepgram Nova 3

Fastest available transcriber. Optimized for real-time speech recognition. Recommended for most deployments. Latency: ~200-350ms.

Deepgram Flux

Optimized for difficult audio conditions (noisy environments, heavy accents). Slightly higher latency than Nova 3 but more accurate in those conditions. Latency: ~300-500ms.

Soniox

Well-suited for Indian English and regional accents. Use when caller accuracy is more important than raw speed. Latency: ~350-600ms.

Recommendation: Default to Deepgram Nova 3. Only switch if transcription accuracy is poor for your specific caller base.

Source 3: LLM Inference Latency

LLM inference is typically the largest source of latency, especially for longer prompts and longer responses. Model comparison (time to first token):

Model	Speed	Quality	Best for
GPT-4o Mini	Fastest (~300-600ms)	Good	Default for most use cases
GPT-4o	Medium (~600ms - 1.2s)	Better reasoning	Complex multi-step conversations
Groq Llama 4	Fast (~300-500ms)	Good	High-volume, latency-sensitive deployments
DeepSeek V3	Fast - Medium	Good	Cost-optimised use cases

GPT-4o Mini is the default model for good reason - it is significantly faster than GPT-4o while being accurate enough for the vast majority of voice agent use cases. Only upgrade to a larger model if you are seeing quality issues that GPT-4o Mini cannot handle.

Reducing LLM latency through prompt design:

Shorten your system prompt: Every token in the prompt increases processing time. Keep instructions concise. Use bullet points, not paragraphs.
Reduce response length: Instruct the agent to keep responses short: “Respond in 1-2 sentences only.” Shorter outputs have lower time-to-complete.
Avoid complex tool chains: A tool call that triggers another tool call doubles the LLM round-trips. Flatten your tool logic where possible.
Pre-compute context: If your agent always needs business hours or product info, inject it as a dynamic variable at call start rather than having the agent fetch it via a tool mid-call.

Source 4: TTS Synthesis Latency

Text-to-speech synthesis is the final audio-producing stage. Both voice providers support streaming synthesis, meaning audio playback begins before the full response is synthesized.

Provider	Streaming	Latency (first audio chunk)	Quality
Cartesia	Yes	~80-150ms	Very natural
ElevenLabs	Yes	~150-300ms	Highly natural

Cartesia is the fastest TTS option in DialNexa. If latency is your primary concern, select a Cartesia voice.

Both providers support streaming: the first audio chunk plays back while subsequent chunks are still being synthesized. The latency figures above represent time-to-first-audio, not time-to-complete-utterance.

Source 5: Audio Cache

Audio Cache is DialNexa’s most powerful single optimization for latency reduction on static phrases. How it works: You pre-specify phrases that your agent says frequently and predictably - greetings, confirmations, hold messages. DialNexa pre-synthesizes these phrases and caches the audio. When the agent needs to say one of these phrases, the cached audio is served instantly - zero TTS latency. Example phrases that benefit from caching:

“Thank you for calling. How can I help you today?”
“Please hold while I look that up.”
“I’ve successfully booked your appointment.”
“Is there anything else I can help you with?”

How to configure it: Go to your agent settings → Audio Cache → add the phrase → select the voice → save. DialNexa pre-generates the audio file using your selected TTS provider. Cost benefit: Cached phrases also reduce TTS API costs because the audio is generated once and reused.

Audio Cache is only effective for phrases that are used verbatim. If the phrase contains dynamic content (e.g., a caller’s name or a specific date), it cannot be cached because it changes every call.

Network Latency

Network round-trip time between the caller’s phone, the carrier, and DialNexa’s servers adds latency that varies by geography. For Indian deployments: Enable India region routing to process calls through DialNexa’s India-based infrastructure. This reduces round-trip time for Indian callers significantly compared to routing through global (typically US-based) servers. See Indian Server Routing for setup.

Measuring Latency

Every call detail page shows per-turn latency:

Transcription latency: Time from end of caller speech to transcript ready
LLM latency: Time from transcript ready to LLM response received
TTS latency: Time from LLM response to first audio byte
Total turn latency: Sum of the above

To measure aggregate latency across many calls, configure a post-call analysis field:

{
  "name": "average_turn_latency_acceptable",
  "type": "boolean",
  "prompt": "Did the agent respond within a natural conversational delay on every turn? Flag as false if there were noticeable pauses."
}

Latency Optimization Checklist

Setting	Recommendation
Transcriber	Deepgram Nova 3
LLM	GPT-4o Mini (or Groq Llama 4)
TTS	Cartesia
Audio Cache	Enable for all static phrases
India routing	Enable for Indian caller base
System prompt	Keep under 500 tokens
Response length instruction	”Respond in 1-2 sentences”
Tool calls	Minimize tool calls per turn

Following all recommendations in this checklist should bring most agents to the 1.0-1.5 second total turn latency range for domestic Indian calls.

Reliability Overview - overall reliability architecture
Indian Server Routing - India-specific latency routing
Audio Cache - full Audio Cache configuration reference

​The Latency Stack

​Source 1: End-of-Speech Detection

​Source 2: Transcription Latency

Deepgram Nova 3

Deepgram Flux

Soniox

​Source 3: LLM Inference Latency

​Source 4: TTS Synthesis Latency

​Source 5: Audio Cache

​Network Latency

​Measuring Latency

​Latency Optimization Checklist

​Related Pages

The Latency Stack

Source 1: End-of-Speech Detection

Source 2: Transcription Latency

Source 3: LLM Inference Latency

Source 4: TTS Synthesis Latency

Source 5: Audio Cache

Network Latency

Measuring Latency

Latency Optimization Checklist

Related Pages