Skip to main content
Voice configuration controls how your agent sounds to callers. It covers the speech synthesis provider, the specific voice within that provider, and per-voice settings that adjust prosody, speed, and output volume. Getting voice right directly affects caller trust and call completion rates.

Voice providers

DialNexa integrates with four TTS (text-to-speech) providers: ElevenLabs, Cartesia, SmallestAI, and Sarvam AI. Each has a different catalog size, latency profile, and configuration surface.

ElevenLabs

Broad catalog with hundreds of voices. Natural prosody and expressive range. Best for agents where realism and variety matter more than raw latency. Supports voice cloning and voice import.Default model: eleven_flash_v2_5
Settings: Voice Model, Speed, Stability, Volume

Cartesia

Smaller catalog optimized for low-latency delivery. Suitable for high-volume deployments where TTS delay is a primary cost.Model: sonic-2
Settings: Speed

SmallestAI

Optimised for Indian languages. Includes Indian voice personas (Diya, Raman, Ananya, Aarav, and more). Best for agents targeting Indian callers in Hindi, Hinglish, or Indian English.Models: lightning, lightning-large, lightning-v2
Settings: Voice Model, Voice

Sarvam AI

India-focused TTS provider. Strong support for Indian English and regional Indian language contexts.Model: bulbul:v2
Default language: en-IN

Choosing a provider

ConsiderationElevenLabsCartesiaSmallestAISarvam AI
Voice varietyLarge catalogSmaller catalogIndian voicesIndian English
LatencyModerateLowerLowModerate
Voice cloningSupportedNot supportedNot supportedNot supported
Best forBrand voice, expressivenessHigh-volume, low-latencyIndian language callersIndian English callers
Language focusMultilingualMultilingualIndian languagesIndian English (en-IN)
Use ElevenLabs when your use case requires a specific voice personality, cloning, or fine-grained prosody control. Use Cartesia when call volume is high and TTS latency is a primary constraint. Use SmallestAI or Sarvam AI for Indian-language or Indian-English call scenarios.

The voice selector

The voice selector in your agent’s Speech Settings lets you browse, filter, and preview available voices. Filters available:
  • Provider — ElevenLabs or Cartesia
  • Gender — Male, Female, Neutral
  • Language — narrows to voices that perform well in the selected language
  • Use case — Conversational, Narration, Customer Support, and other catalog tags (ElevenLabs)
Previewing voices: each voice card has a preview clip. Click the play icon to hear a sample before selecting. Preview clips are short and pre-recorded; synthesis quality in production may vary based on your prompt content and phrasing patterns. Nexa voice IDs: every voice in the DialNexa system has a stable identifier in the format vel_.... This ID is what the API and webhooks use to reference a voice. Copy a voice’s Nexa ID from the voice card in the selector. Use Nexa voice IDs when configuring agents programmatically so that display name changes on the provider side do not break your configuration.

Voice and language interaction

The voice selector automatically scopes to voices compatible with your agent’s primary language. Configuring an agent with Hindi as the primary language surfaces Hindi-compatible voices.
Selecting a voice that does not support your agent’s primary language produces degraded or unintelligible audio. Verify voice-language compatibility before deploying.
Some voices support multiple languages. If your agent uses auto language switching, confirm that the selected voice supports all candidate languages, not just the primary one. See Supported Languages for per-voice-provider language coverage.

Voice settings

ElevenLabs settings

Voice Model ElevenLabs offers multiple synthesis models that trade quality against latency. DialNexa defaults to Flash v2.5, which is optimized for real-time conversation. Unless you have a specific reason to use an alternate model, keep this at the default.
ModelIDLatencyBest for
Flash v2.5 (default)eleven_flash_v2_5LowestReal-time conversation
Flash v2eleven_flash_v2LowReal-time conversation, older model
Turbo v2.5eleven_turbo_v2_5Low-moderateBalance of speed and quality
Turbo v2eleven_turbo_v2ModerateQuality, slightly older turbo tier
Multilingual v2eleven_multilingual_v2HigherHighest quality, multi-language support
Multilingual STS v2eleven_multilingual_sts_v2HigherSpeech-to-speech, multilingual
English STS v2eleven_english_sts_v2ModerateSpeech-to-speech, English
Speed Controls speech rate. Range: 0.7 (slower) to 1.2 (faster). Default: 1.0. For customer support agents, slightly slower speech (0.9) often improves comprehension on mobile phone connections with variable audio quality. Stability Controls how consistent the voice sounds across utterances. Lower stability introduces variation — more expressive but less predictable. Higher stability produces flatter, more uniform delivery.
  • High stability (0.8 and above): recommended for transactional agents where consistent tone matters more than expressiveness
  • Low stability (0.3 to 0.5): recommended for conversational agents where natural variation sounds more human
Volume Output gain adjustment. Range: -6 dB to +6 dB. Default: 0 dB. Increase if callers report the agent is hard to hear. Decrease if callers report clipping or distortion.
Volume adjustment applies to TTS output only. It does not affect the caller’s microphone gain or the transcription pipeline.

Cartesia settings

Speed Controls speech rate using the same principles as ElevenLabs speed. Adjust based on your caller population and expected call environment.

Audio Cache

DialNexa caches TTS output for repeated phrases across calls. When the agent produces the same phrase again (such as a greeting or a disclaimer), the cached audio is served instead of re-synthesizing. This reduces both latency and TTS cost. Audio Cache is enabled by default for cascaded agents. Non-super-admin users should contact DialNexa support if they need it disabled. Disable it only if your agent’s phrasing is highly dynamic and cache hits are unlikely, or if you are testing voice changes and need fresh synthesis on every call. Speech to Speech agents do not use Audio Cache because they do not send text through a separate TTS provider.