Voice - DialNexa Documentation

Voice configuration controls how your agent sounds to callers. It covers the speech synthesis provider, the specific voice within that provider, and per-voice settings that adjust prosody, speed, and output volume. Getting voice right directly affects caller trust and call completion rates.

Voice providers

DialNexa integrates with four TTS (text-to-speech) providers: ElevenLabs, Cartesia, SmallestAI, and Sarvam AI. Each has a different catalog size, latency profile, and configuration surface.

ElevenLabs

Broad catalog with hundreds of voices. Natural prosody and expressive range. Best for agents where realism and variety matter more than raw latency. Supports voice cloning and voice import.Default model: eleven_flash_v2_5
Settings: Voice Model, Speed, Stability, Volume

Cartesia

Smaller catalog optimized for low-latency delivery. Suitable for high-volume deployments where TTS delay is a primary cost.Model: sonic-2
Settings: Speed

SmallestAI

Optimised for Indian languages. Includes Indian voice personas (Diya, Raman, Ananya, Aarav, and more). Best for agents targeting Indian callers in Hindi, Hinglish, or Indian English.Models: lightning, lightning-large, lightning-v2
Settings: Voice Model, Voice

Sarvam AI

India-focused TTS provider. Strong support for Indian English and regional Indian language contexts.Model: bulbul:v2
Default language: en-IN

Choosing a provider

Consideration	ElevenLabs	Cartesia	SmallestAI	Sarvam AI
Voice variety	Large catalog	Smaller catalog	Indian voices	Indian English
Latency	Moderate	Lower	Low	Moderate
Voice cloning	Supported	Not supported	Not supported	Not supported
Best for	Brand voice, expressiveness	High-volume, low-latency	Indian language callers	Indian English callers
Language focus	Multilingual	Multilingual	Indian languages	Indian English (`en-IN`)

Use ElevenLabs when your use case requires a specific voice personality, cloning, or fine-grained prosody control. Use Cartesia when call volume is high and TTS latency is a primary constraint. Use SmallestAI or Sarvam AI for Indian-language or Indian-English call scenarios.

The voice selector

The voice selector in your agent’s Speech Settings lets you browse, filter, and preview available voices. Filters available:

Provider — ElevenLabs or Cartesia
Gender — Male, Female, Neutral
Language — narrows to voices that perform well in the selected language
Use case — Conversational, Narration, Customer Support, and other catalog tags (ElevenLabs)

Previewing voices: each voice card has a preview clip. Click the play icon to hear a sample before selecting. Preview clips are short and pre-recorded; synthesis quality in production may vary based on your prompt content and phrasing patterns. Nexa voice IDs: every voice in the DialNexa system has a stable identifier in the format vel_.... This ID is what the API and webhooks use to reference a voice. Copy a voice’s Nexa ID from the voice card in the selector. Use Nexa voice IDs when configuring agents programmatically so that display name changes on the provider side do not break your configuration.

Voice and language interaction

The voice selector automatically scopes to voices compatible with your agent’s primary language. Configuring an agent with Hindi as the primary language surfaces Hindi-compatible voices.

Selecting a voice that does not support your agent’s primary language produces degraded or unintelligible audio. Verify voice-language compatibility before deploying.

Some voices support multiple languages. If your agent uses auto language switching, confirm that the selected voice supports all candidate languages, not just the primary one. See Supported Languages for per-voice-provider language coverage.

Voice settings

ElevenLabs settings

Voice Model ElevenLabs offers multiple synthesis models that trade quality against latency. DialNexa defaults to Flash v2.5, which is optimized for real-time conversation. Unless you have a specific reason to use an alternate model, keep this at the default.

Model	ID	Latency	Best for
Flash v2.5 (default)	`eleven_flash_v2_5`	Lowest	Real-time conversation
Flash v2	`eleven_flash_v2`	Low	Real-time conversation, older model
Turbo v2.5	`eleven_turbo_v2_5`	Low-moderate	Balance of speed and quality
Turbo v2	`eleven_turbo_v2`	Moderate	Quality, slightly older turbo tier
Multilingual v2	`eleven_multilingual_v2`	Higher	Highest quality, multi-language support
Multilingual STS v2	`eleven_multilingual_sts_v2`	Higher	Speech-to-speech, multilingual
English STS v2	`eleven_english_sts_v2`	Moderate	Speech-to-speech, English

Speed Controls speech rate. Range: 0.7 (slower) to 1.2 (faster). Default: 1.0. For customer support agents, slightly slower speech (0.9) often improves comprehension on mobile phone connections with variable audio quality. Stability Controls how consistent the voice sounds across utterances. Lower stability introduces variation — more expressive but less predictable. Higher stability produces flatter, more uniform delivery.

High stability (0.8 and above): recommended for transactional agents where consistent tone matters more than expressiveness
Low stability (0.3 to 0.5): recommended for conversational agents where natural variation sounds more human

Volume Output gain adjustment. Range: -6 dB to +6 dB. Default: 0 dB. Increase if callers report the agent is hard to hear. Decrease if callers report clipping or distortion.

Volume adjustment applies to TTS output only. It does not affect the caller’s microphone gain or the transcription pipeline.

Cartesia settings

Speed Controls speech rate using the same principles as ElevenLabs speed. Adjust based on your caller population and expected call environment.

Audio Cache

DialNexa caches TTS output for repeated phrases across calls. When the agent produces the same phrase again (such as a greeting or a disclaimer), the cached audio is served instead of re-synthesizing. This reduces both latency and TTS cost. Audio Cache is enabled by default for cascaded agents. Non-super-admin users should contact DialNexa support if they need it disabled. Disable it only if your agent’s phrasing is highly dynamic and cache hits are unlikely, or if you are testing voice changes and need fresh synthesis on every call. Speech to Speech agents do not use Audio Cache because they do not send text through a separate TTS provider.

​Voice providers

ElevenLabs

Cartesia

SmallestAI

Sarvam AI

​Choosing a provider

​The voice selector

​Voice and language interaction

​Voice settings

​ElevenLabs settings

​Cartesia settings

​Audio Cache

​Related

Voice providers

Choosing a provider

The voice selector

Voice and language interaction

Voice settings

ElevenLabs settings

Cartesia settings

Audio Cache

Related