The Latency Stack
Every agent turn goes through this pipeline:| Stage | Typical range | What it depends on |
|---|---|---|
| End-of-speech detection | 200-500ms | Voice activity detection sensitivity |
| Transcription (STT) | 200-600ms | Transcriber model, audio length |
| LLM inference | 400ms - 2s | Model, prompt length, response length |
| TTS synthesis | 200-800ms | Voice provider, response length |
| Network round-trips | 50-200ms | Geographic proximity to DialNexa servers |
| Total | ~1.0-4.0s | All of the above |
Source 1: End-of-Speech Detection
Before transcription starts, the platform needs to know the caller has finished speaking. This uses voice activity detection (VAD). Problem: VAD has a natural delay - it needs to wait to distinguish a pause within a sentence from the end of a turn. Fixes:- Reduce the end-of-speech silence threshold if your callers tend to speak in short, complete sentences. A threshold that’s too long adds unnecessary pause.
- Increase the threshold if the agent is frequently cutting the caller off mid-sentence (the caller pauses but hasn’t finished).
Source 2: Transcription Latency
Transcription is typically the fastest stage, but model choice matters.Deepgram Nova 3
Fastest available transcriber. Optimized for real-time speech recognition. Recommended for most deployments. Latency: ~200-350ms.
Deepgram Flux
Optimized for difficult audio conditions (noisy environments, heavy accents). Slightly higher latency than Nova 3 but more accurate in those conditions. Latency: ~300-500ms.
Soniox
Well-suited for Indian English and regional accents. Use when caller accuracy is more important than raw speed. Latency: ~350-600ms.
Source 3: LLM Inference Latency
LLM inference is typically the largest source of latency, especially for longer prompts and longer responses. Model comparison (time to first token):| Model | Speed | Quality | Best for |
|---|---|---|---|
| GPT-4o Mini | Fastest (~300-600ms) | Good | Default for most use cases |
| GPT-4o | Medium (~600ms - 1.2s) | Better reasoning | Complex multi-step conversations |
| Groq Llama 4 | Fast (~300-500ms) | Good | High-volume, latency-sensitive deployments |
| DeepSeek V3 | Fast - Medium | Good | Cost-optimised use cases |
- Shorten your system prompt: Every token in the prompt increases processing time. Keep instructions concise. Use bullet points, not paragraphs.
- Reduce response length: Instruct the agent to keep responses short: “Respond in 1-2 sentences only.” Shorter outputs have lower time-to-complete.
- Avoid complex tool chains: A tool call that triggers another tool call doubles the LLM round-trips. Flatten your tool logic where possible.
- Pre-compute context: If your agent always needs business hours or product info, inject it as a dynamic variable at call start rather than having the agent fetch it via a tool mid-call.
Source 4: TTS Synthesis Latency
Text-to-speech synthesis is the final audio-producing stage. Both voice providers support streaming synthesis, meaning audio playback begins before the full response is synthesized.| Provider | Streaming | Latency (first audio chunk) | Quality |
|---|---|---|---|
| Cartesia | Yes | ~80-150ms | Very natural |
| ElevenLabs | Yes | ~150-300ms | Highly natural |
Both providers support streaming: the first audio chunk plays back while subsequent chunks are still being synthesized. The latency figures above represent time-to-first-audio, not time-to-complete-utterance.
Source 5: Audio Cache
Audio Cache is DialNexa’s most powerful single optimization for latency reduction on static phrases. How it works: You pre-specify phrases that your agent says frequently and predictably - greetings, confirmations, hold messages. DialNexa pre-synthesizes these phrases and caches the audio. When the agent needs to say one of these phrases, the cached audio is served instantly - zero TTS latency. Example phrases that benefit from caching:- “Thank you for calling. How can I help you today?”
- “Please hold while I look that up.”
- “I’ve successfully booked your appointment.”
- “Is there anything else I can help you with?”
Network Latency
Network round-trip time between the caller’s phone, the carrier, and DialNexa’s servers adds latency that varies by geography. For Indian deployments: Enable India region routing to process calls through DialNexa’s India-based infrastructure. This reduces round-trip time for Indian callers significantly compared to routing through global (typically US-based) servers. See Indian Server Routing for setup.Measuring Latency
Every call detail page shows per-turn latency:- Transcription latency: Time from end of caller speech to transcript ready
- LLM latency: Time from transcript ready to LLM response received
- TTS latency: Time from LLM response to first audio byte
- Total turn latency: Sum of the above
Latency Optimization Checklist
| Setting | Recommendation |
|---|---|
| Transcriber | Deepgram Nova 3 |
| LLM | GPT-4o Mini (or Groq Llama 4) |
| TTS | Cartesia |
| Audio Cache | Enable for all static phrases |
| India routing | Enable for Indian caller base |
| System prompt | Keep under 500 tokens |
| Response length instruction | ”Respond in 1-2 sentences” |
| Tool calls | Minimize tool calls per turn |
Related Pages
- Reliability Overview - overall reliability architecture
- Indian Server Routing - India-specific latency routing
- Audio Cache - full Audio Cache configuration reference