Prerequisites
- An ElevenLabs-backed workspace (voice cloning is ElevenLabs only)
- Audio samples that meet the quality requirements below
- Workspace owner or admin role
Audio requirements
The quality of the cloned voice depends entirely on the quality of the input audio. Poor samples produce a voice that sounds thin, robotic, or inconsistent.| Requirement | Specification |
|---|---|
| Minimum duration | 10 seconds of clean speech |
| Recommended duration | 1 to 3 minutes across multiple files |
| Format | MP3, WAV, M4A, FLAC |
| Sample rate | 16 kHz or higher |
| Channels | Mono or stereo (mono preferred) |
| Background noise | None or near-zero |
| Music or effects | Not allowed — speech only |
| Multiple speakers | Not allowed — single speaker per clone |
Clone steps
Open Voice Cloning
In your workspace, go to Settings > Voices > Clone Voice. If you do not see this option, your workspace plan does not include voice cloning — contact support.
Name the voice
Enter a name for the cloned voice. This name appears in the voice selector and in the API. Choose something descriptive, such as “Brand Voice - Female EN” rather than “Clone 1”.
Upload audio samples
Drag and drop your audio files into the upload area, or click Choose Files. You can upload multiple files. The uploader accepts MP3, WAV, M4A, and FLAC up to 25 MB per file.
Submit for processing
Click Create Voice. Processing typically takes 30 seconds to 3 minutes depending on the total duration of uploaded audio. You do not need to stay on the page — you will receive a workspace notification when the clone is ready.
Using the cloned voice in an agent
After cloning, the voice appears in the voice selector under My Voices (or your workspace’s custom voice section). Select it the same way you would any catalog voice. The cloned voice has a Nexa voice ID (vel_...) that you can use in API-configured agents:
Processing time
| Audio duration uploaded | Typical processing time |
|---|---|
| Under 1 minute | 30 to 60 seconds |
| 1 to 3 minutes | 1 to 3 minutes |
| Over 3 minutes | 3 to 5 minutes |
Limitations
Language support: cloned voices inherit language capability from the ElevenLabs cloning system. A voice cloned from English audio primarily performs well in English. For non-English synthesis, use the Multilingual v2 model, but expect some accent bleed from the source recordings. Quality expectations: cloning does not produce a perfect replica. The output is a synthetic approximation. Longer and more varied source audio produces better results. A 30-second sample will produce noticeably lower quality than 2 minutes of varied speech. Usage rights: you are responsible for ensuring you have the rights to clone the voice in the audio you upload. DialNexa does not verify consent or ownership of uploaded recordings. Editing or re-cloning: you cannot modify a clone after creation. To improve quality, delete the existing clone and create a new one with better audio. Number of clones: the number of cloned voices per workspace is subject to your plan limits. Check Settings > Billing for your current usage.Troubleshooting
The clone sounds robotic or flat
The clone sounds robotic or flat
This usually means the source audio was too short or lacked variety. Upload at least 1 minute of natural, conversational speech covering multiple sentence types.
Processing failed with no error message
Processing failed with no error message
The most common cause is an unsupported audio format or a file that is corrupt. Convert your audio to WAV at 16 kHz and retry.
The voice sounds like a different person
The voice sounds like a different person
Background noise in the source audio causes the model to capture ambient characteristics rather than the speaker. Use a noise-cleaned version of the recording.
Non-English synthesis sounds heavily accented
Non-English synthesis sounds heavily accented
Switch the agent’s Voice Model to Multilingual v2. Expect some accent from the source language if the speaker in the source audio was a non-native speaker.