Skip to main content
Speech to Speech Agents in DialNexa use a realtime speech model to listen to the caller and speak back directly. They are best for latency-sensitive calls where natural turn taking matters more than separate control over speech to text, text to speech, and Audio Cache settings. DialNexa create-agent modal with Speech to Speech selected as the agent type.

What This Page Helps You Do

This page helps you decide whether Speech to Speech is the right agent type, choose between available OpenAI realtime and Gemini model paths, configure the agent, and verify the first test call before routing live traffic.

Before You Begin

You need:
  • Access to a DialNexa workspace where Speech to Speech is enabled
  • Permission to create or edit agents
  • A test phone route or web call route
  • A short caller script for comparing OpenAI realtime and Gemini models fairly
  • Any functions or integrations already configured if the realtime model must take actions during the call

When To Use Speech To Speech Agents

Use Speech to Speech Agents when fast spoken turns are central to the call experience. Good candidates include web calls, interruption-heavy sales conversations, short support triage, and demos where first audio timing is easy for users to notice. Use a cascaded Single Prompt Agent or Conversational Flow Agent instead when you need to tune the transcriber, pick a separate voice provider, use fallback STT, rely on Audio Cache, or audit each branch in a visual flow.

How Speech To Speech Differs From Cascaded Agents

ControlCascaded Single Prompt or Flow AgentSpeech to Speech Agent
ListeningUses a transcriber such as Deepgram or Soniox.The realtime model listens directly.
ReasoningUses a text LLM such as OpenAI, Google, or Groq.The realtime speech model handles the turn.
SpeakingUses a separate TTS voice and voice model.Uses compatible realtime voice options for the selected model path.
Audio CacheAvailable for repeated TTS phrases.Not used because there is no separate TTS cache.
Fallback STTCan be configured for cascaded agents.Not used.
Typical max durationUp to 90 minutes where enabled.Up to 60 minutes where enabled.
Pricing previewCan show transcriber, LLM, voice engine, and telephony components.Shows realtime model pricing plus telephony where applicable.

OpenAI Realtime And Gemini Model Paths

Speech to Speech model availability depends on the workspace. The dashboard model selector is the source of truth for which realtime models are currently enabled.
Model pathUse it whenWhat to validate before production
OpenAI realtime modelsYou want to test direct speech behavior with OpenAI realtime options shown in your workspace.Interruption handling, function calls, welcome timing, voice fit, first audio timing, and cost preview.
Gemini modelsYou want to compare Gemini Speech to Speech behavior, including Gemini voice choices where enabled.Gemini voice fit, automatic activity detection, tool calls, voicemail behavior, welcome startup, long-call continuity, and visible INR pricing.
DialNexa Speech to Speech agent builder showing a realtime speech model selector and no separate transcriber selector. Gemini Speech to Speech support can appear as a Gemini live model such as gemini-3.1-flash-live-preview where enabled. OpenAI realtime options can appear in the same Speech to Speech model selector. Confirm the exact model names, rates, and voices in your workspace before planning production cost or quality. DialNexa Speech to Speech model menu comparing OpenAI realtime and Gemini model options with INR per-minute pricing.

Gemini Speech To Speech Details

Gemini Speech to Speech uses a Gemini realtime model and compatible Gemini voices. It listens to caller audio, produces spoken audio directly, and does not require a separate transcriber or text to speech provider. DialNexa Speech to Speech agent editor with Gemini model pricing, prompt editor, voice controls, and global settings. DialNexa Gemini Speech to Speech voice selector showing Gemini voices, voice IDs, filters, search, preview controls, and Use Voice action.
Gemini S2S behaviorWhat users should know
Model categoryGemini live models appear as Speech to Speech models, not cascaded text LLMs.
Voice choicesGemini S2S uses Gemini-compatible voices. The visible voice list can include voices such as Aoede, Algieba, Alnilam, Autonoe, Callirrhoe, Charon, Zephyr, and Zubenelgenubi where enabled.
Turn takingGemini automatic activity detection handles interruptions and caller barge-in. Test short greetings and interruption-heavy scripts.
ToolsGemini S2S can use configured agent tools where the selected model path supports them.
Long callsSession resumption and context compression can help longer sessions continue through realtime connection limits, but long calls still need production-like tests.
Welcome audioGemini welcome audio can be prepared through the Gemini TTS path so the first spoken line starts faster when prewarm succeeds.
Pricing previewThe selector can show INR per-minute realtime model pricing. Confirm workspace pricing before using it for cost planning.

Set Up A Speech To Speech Agent

1

Create a new agent

Open the Agents tab, click New Agent, and select Speech to Speech where it is available.
2

Choose the realtime model

Select the OpenAI realtime or Gemini model option you want to test. Check the visible pricing preview before continuing.
3

Choose a compatible voice

Pick from the voices available for the selected realtime model. For Gemini S2S, use the Gemini voice selector and listen to samples before saving.
4

Write a concise prompt

Keep the role, goal, boundaries, tool rules, and closing behavior explicit. Realtime speech quality still depends on clear instructions.
5

Configure tools only when needed

Add functions or dashboard integrations only when the live call needs them. Then test the tool path with real caller phrasing.
6

Publish and assign a route

Publish the version, then assign it to the phone number, web call, batch call, or workflow route that should use it.

Verify The Result

After the first test call, review both the subjective call feel and the call evidence.
CheckWhat good looks like
First audio timingThe agent begins speaking quickly enough for the route.
Interruption handlingThe agent stops, listens, and recovers when the caller interrupts.
Tool behaviorFunction calls use correct arguments and do not fire before required facts are collected.
Voice fitThe selected realtime voice sounds clear for names, amounts, dates, and the longest line in the script.
Call HistoryTranscript, recording, summary, status, and post-call fields support the result your team expects.
Cost previewThe selected model and telephony cost fit the campaign or route volume.

Troubleshooting

This is expected for Speech to Speech Agents. The realtime model listens directly, so separate STT settings are not used.
This is expected. Speech to Speech does not send text through a separate TTS provider, so there is no TTS cache to configure.
Model availability depends on workspace configuration. Check the model selector in the dashboard or contact DialNexa support if a required realtime model is missing.
Compare against a cascaded Single Prompt Agent using the same script. If separate STT, TTS, Audio Cache, or fallback STT controls matter more than latency, use the cascaded stack.
Keep the prompt, function schema, route, and caller script identical when comparing OpenAI realtime and Gemini models. Review function arguments in Call History before publishing.

Recap

Speech to Speech Agents are for realtime voice behavior. OpenAI realtime and Gemini models can both be valid choices where enabled, but the winning path should be proven with the same prompt, route, caller script, tool setup, and Call History review.

Types Of Agents

Choose between Single Prompt, Conversational Flow, and Speech to Speech.

Provider Selection Guide

Compare speech, model, voice, and telephony layers.

LLMs And Conversation Behavior

Understand model behavior and fallback settings.

Testing Agents

Test before publishing a realtime model path.