The two modes
Streaming (word-by-word)
The transcriber emits partial results as words are recognized, before the caller has finished speaking. The system can begin processing earlier, potentially reducing time-to-first-byte of agent response.Best for: low-latency applications, short utterances, turn-based conversations
Endpoint-based
The transcriber waits for a speech endpoint (a detected pause or end of utterance) before emitting a complete result. The transcript is more accurate but arrives later.Best for: complex utterances, high-accuracy requirements, callers who speak in long sentences
How mode affects response latency
Response latency = transcription time + LLM processing time + TTS synthesis time + network Transcription mode affects the first component.- Streaming mode: the transcriber begins delivering text while the caller is still speaking. The LLM pipeline can start processing partial input. When combined with a low Response Eagerness setting, the agent can start responding very quickly after the caller stops. Total perceived latency is lower.
- Endpoint-based mode: the complete transcript arrives after the endpoint is detected. The LLM starts processing only after the full utterance is available. This adds 200 to 800 ms of additional latency in typical usage, depending on the endpoint detection sensitivity.
Streaming mode can cause the agent to respond before the caller is fully done speaking. If Response Eagerness is set too high, the agent will interrupt callers mid-sentence. Tune Response Eagerness alongside transcription mode.
Deepgram models and mode support
| Deepgram Model | Streaming | Endpoint-based | Notes |
|---|---|---|---|
| Nova-2 | Yes | Yes | General purpose, high accuracy |
| Nova-2 (Medical) | No | Yes | Specialized vocabulary |
| Nova-2 (Phone Call) | Yes | Yes | Optimized for telephone audio |
| Whisper (via Deepgram) | No | Yes | Highest accuracy, higher latency |
| Base | Yes | Yes | Lower cost, lower accuracy |
Choosing between modes
Use streaming mode when:- Perceived response latency is a top priority
- Callers speak in short, clear phrases (command-style input)
- Your agents handle simple, transactional intents where partial transcripts are sufficient
- You have tuned Response Eagerness to prevent premature interruptions
- Callers speak in long or complex sentences that benefit from full-context transcription
- Transcription accuracy is more important than latency (e.g., medical, legal contexts)
- Callers have accents or speech patterns that cause streaming partial transcripts to be unstable
- You are using a specialized model (medical, financial) that only supports endpoint mode
Response Eagerness relationship
Response Eagerness is a separate setting that controls how aggressively the agent interrupts or begins responding. It interacts directly with transcription mode:- Streaming + High Eagerness: very fast responses, higher risk of interrupting callers
- Streaming + Low Eagerness: faster than endpoint-based but agent waits for more stable partial transcripts
- Endpoint-based + any Eagerness: agent always waits for the full transcript before considering a response
Configuring transcription mode
Transcription mode is set per agent in Settings > Speech > Transcription.Select a transcription model
Choose the Deepgram model from the Transcription Model dropdown. The available modes for that model are shown automatically.