What You Can Learn
A/B testing produces a side-by-side comparison across:- Call outcomes: Completion rate, call duration, transfer rate, hang-up timing
- Post-call analysis fields: Sentiment scores, goal completion flags, custom extraction fields
- Transcript quality: Language naturalness, error handling, off-script recovery
- Model behavior: How different LLMs or temperatures handle edge cases and ambiguous inputs
Prerequisites
- Two published agent versions. These can differ by prompt, LLM, voice, or any other configuration.
- A phone number or route to apply the split to.
- Enough inbound or outbound call volume to reach statistical significance. Low-volume routes may need days or weeks to produce reliable results.
Set Up a Traffic Split
Open the phone number or route
Navigate to Phone Numbers and click the number you want to test on. For outbound routes, open the relevant workflow or campaign.
Enable A/B Testing
In the number or route detail view, find the A/B Testing section and toggle it on.
Select Version A and Version B
Choose the two published agent versions to compare. Version A is typically your current production version; Version B is the variant you are testing.
Set the traffic split
Set the percentage of traffic each version receives. A 50/50 split produces the fastest results. If you want to limit exposure to a new version, start with 10/90 or 20/80 in favor of the stable version.
Traffic is split per call at random. Individual callers are not “sticky” to a version across calls unless you implement caller ID-based routing in your flow.
Read Results
Open the A/B test results view from the number or route detail page. Results update in real time as calls complete.Metrics Table
| Metric | Description |
|---|---|
| Calls (A / B) | Total calls handled by each version |
| Avg Duration | Mean call duration per version |
| Completion Rate | Percentage of calls that ended with a defined success outcome (based on post-call analysis fields) |
| Transfer Rate | Percentage of calls that triggered a transfer action |
| Avg Sentiment | Mean sentiment score from post-call analysis (if sentiment is enabled) |
| Custom Fields | Any post-call extraction fields you have defined, compared as averages or distributions |
Transcript Sampling
The results view shows a random sample of transcripts from each version side by side. Review these manually to assess qualitative differences in language, error handling, and caller experience.Statistical significance is not automatically calculated. Use a standard two-proportion z-test or chi-squared test on completion rates once each version has at least 100 calls.
End the Test
When you have enough data to make a decision:Review results
Confirm that one version outperforms the other on your key metric (completion rate, sentiment, or another post-call analysis field).
Disable A/B testing
Open the A/B Testing section and toggle it off. You will be prompted to select which version to keep as the active version.
Limitations
- A/B testing is available on one traffic split per phone number at a time. You cannot run a three-way split.
- Both versions must be published. Draft versions cannot be included in a split.
- Outbound batch campaigns do not support per-call A/B splits through the UI. For batch A/B testing, create two separate batches with different agent versions and compare results manually.