Skip to contentSkip to navigationSkip to topbar
On this page

TwiML™ Voice: <ConversationRelay>


(warning)

Legal notice and public beta

ConversationRelay, including the <ConversationRelay> TwiML(link takes you to an external page) noun and API, uses artificial intelligence or machine learning technologies. By enabling or using any features or functionalities within Programmable Voice that Twilio identifies as using artificial intelligence or machine learning technology, you acknowledge and agree to certain terms. Your use of these features or functionalities is subject to the terms of the Predictive and Generative AI or ML Features Addendum(link takes you to an external page).

ConversationRelay isn't compliant with the Payment Card Industry (PCI)(link takes you to an external page) and doesn't support Voice workflows that are subject to PCI.

ConversationRelay is currently available as a Public Beta product, and Twilio may change the information in this document at any time. This means that some features aren't yet implemented, and others may change before the product becomes Generally Available. Public Beta products aren't covered by a Twilio Service Level Agreement. Learn more about Twilio's beta product support here(link takes you to an external page).

(information)

Info

Before using ConversationRelay, you need to complete the onboarding steps and agree to the Predictive and Generative AI/ML Features Addendum. See the ConversationRelay Onboarding Guide for more details.

The <ConversationRelay> TwiML noun under the <Connect> verb routes a call to Twilio's ConversationRelay service, providing advanced AI-powered voice interactions. ConversationRelay handles the complexities of live, synchronous voice calls, such as Speech-to-Text (STT) and Text-to-Speech (TTS) conversions, session management, and low-latency communication with your application. This approach allows your system to focus on processing conversational AI logic and sending back responses effectively.

In a typical setup, <ConversationRelay> connects to your AI application through a WebSocket, allowing real-time and event-based interaction. Your application receives transcribed caller speech in structured messages and sends responses as text, which ConversationRelay converts to speech and plays back to the caller. This setup is commonly used for customer service, virtual assistants, and other scenarios that require real-time, AI-based voice interactions.


Basic usage

basic-usage page anchor

Before you can use <ConversationRelay>, make sure you've completed the onboarding steps and configured your Twilio account accordingly.

WebSocket security

websocket-security page anchor

To ensure the secure operation of <ConversationRelay>, your WebSocket server must validate incoming requests using the Twilio signature. For detailed guidance on setting up signature validation, see Configure your WebSocket server.

Generating TwiML for <ConversationRelay>

generating-twiml-for-conversationrelay page anchor
Connect a Programmable Voice call to Twilio's ConversationRelay service.Link to code sample: Connect a Programmable Voice call to Twilio's ConversationRelay service.
1
const VoiceResponse = require('twilio').twiml.VoiceResponse;
2
3
const response = new VoiceResponse();
4
const connect = response.connect({
5
action: 'https://myhttpserver.com/connect_action'
6
});
7
connect.conversationRelay({
8
url: 'wss://mywebsocketserver.com/websocket',
9
welcomeGreeting: 'Hi! Ask me anything!'
10
});
11
12
console.log(response.toString());

Output

1
<?xml version="1.0" encoding="UTF-8"?>
2
<Response>
3
<Connect action="https://myhttpserver.com/connect_action">
4
<ConversationRelay url="wss://mywebsocketserver.com/websocket" welcomeGreeting="Hi! Ask me anything!" />
5
</Connect>
6
</Response>
  • action (optional): The URL that Twilio will request when the <Connect> verb ends.
  • url (required): The URL of your WebSocket server (must use the wss:// protocol).
  • welcomeGreeting (optional): The message automatically played to the caller after we answer the call and establish the WebSocket connection.

When the TwiML execution is complete, Twilio will make a callback to the action URL with call information and the return parameters from ConversationRelay.


<ConversationRelay> attributes

conversationrelay-attributes page anchor

The <ConversationRelay> noun supports the following attributes:

Attribute nameDescriptionDefault valueRequired
urlThe URL to your WebSocket server (must use wss://).Required
welcomeGreetingThe message automatically played to the caller after we answer the call and establish the WebSocket connection.Optional
welcomeGreetingInterruptibleSpecifies if the caller can interrupt the welcomeGreeting with speech. Values can be "none", "dtmf", "speech", or "any". For backward compatibility, Boolean values are also accepted: true = "any" and false = "none"."any"Optional
languageThe language code (for example, "en-US") that applies to both Speech-to-Text (STT) and Text-to-Speech (TTS). Setting this attribute is equivalent to setting both ttsLanguage and transcriptionLanguage."en-US"Optional
ttsLanguageThe default language code to use for TTS when the text token message doesn't specify a language. If you set both attributes, this one overrides the language attribute. You can modify this via the ttsLanguage field in the language message you send through the Service Provider Interface (SPI).Optional
ttsProviderThe provider for TTS. Available choices are "Google", "Amazon", and "ElevenLabs"."Google"Optional
voiceThe voice used for TTS. Choices vary based on the ttsProvider. For details, refer to the Twilio TTS Voices. We list additional voices available for ConversationRelay below."en-US-Journey-O" (Google), "Joanna-Neural" (Amazon)Optional
transcriptionLanguageThe language code to use for STT when the session starts. If you set both attributes, this one overrides the language attribute for the transcription language. You can modify this via the transcriptionLanguage field in the language message you send through the SPI.Optional
transcriptionProviderThe provider for STT (Speech Recognition). Available choices are "Google" and "Deepgram"."Google"Optional
speechModelThe speech model used for STT. Choices vary based on the transcriptionProvider. Refer to the provider's documentation for an accurate list."telephony" (Google), "nova-2-general" (Deepgram)Optional
profanityFilterSpecifies whether to filter profanities out of the speech transcription."true"Optional
interruptibleSpecifies if caller speech can interrupt TTS playback. Values can be "none", "dtmf", "speech", or "any". For backward compatibility, Boolean values are also accepted: true = "any" and false = "none"."any"Optional
dtmfDetectionSpecifies whether the system sends Dual-tone multi-frequency (DTMF) keypresses over the WebSocket. Set to true to turn on DTMF events.falseOptional
preemptibleSpecifies if the TTS of the current talk cycle can allow text tokens from the subsequent talk cycle to interrupt.falseOptional
hintsA comma-separated list of words or phrases that helps Speech-to-Text recognition for uncommon words, product names, or domain-specific terminology. Works similarly to the hints attribute in <Gather>.Optional

Additional TTS voices available for ConversationRelay

additional-tts-voices-available-for-conversationrelay page anchor

Available ElevenLabs voices

available-elevenlabs-voices page anchor

We've added TTS provider support for ElevenLabs, which provides additional natural-sounding voice synthesis. Use the interface below to search and filter through a wide selection of voices by language, accent, age, and more. Each voice entry includes a voiceID that you can copy and paste into your <ConversationRelay> configuration.

How to Use ElevenLabs Voices

  1. Search or Filter: Use the tool below to locate a voice that matches your requirements (for example, language, accent, category, age, gender, tag).
  2. Copy the voiceID: From the search results, copy the unique identifier (for example, NYC9WEgkq1u4jiqBseQ9).
  3. Configure <ConversationRelay>: In your TwiML, explicitly set ttsProvider="ElevenLabs" and use the copied voiceID in the voice attribute.

Example:

1
<Connect>
2
<ConversationRelay url="wss://example.com/websocket" ttsProvider="ElevenLabs" voice="NYC9WEgkq1u4jiqBseQ9" ... />
3
</Connect>
(information)

Info

Since Google is the default ttsProvider, you must explicitly set ttsProvider="ElevenLabs" to use an ElevenLabs voice.

If you don't explicitly specify the voice attribute in your <ConversationRelay> configuration, ConversationRelay automatically applies a default voice based on the language setting (as defined by the language or ttsLanguage attribute) and the selected TTS provider (default is Google). Below is the complete list of default voice settings:

1
{
2
"vi-VN": {"ttsProvider": "google", "voice": "vi-VN-Standard-A", "asrProvider": "google", "speechModel": "long"},
3
"ja-JP": {"ttsProvider": "google", "voice": "ja-JP-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
4
"fi-FI": {"ttsProvider": "google", "voice": "fi-FI-Standard-A", "asrProvider": "google", "speechModel": "long"},
5
"uk-UA": {"ttsProvider": "google", "voice": "uk-UA-Standard-A", "asrProvider": "google", "speechModel": "long"},
6
"en-US": {"ttsProvider": "google", "voice": "en-US-Chirp3-HD-Aoede", "asrProvider": "google", "speechModel": "telephony"},
7
"en-IN": {"ttsProvider": "google", "voice": "en-IN-Standard-E", "asrProvider": "google", "speechModel": "long"},
8
"ta-IN": {"ttsProvider": "google", "voice": "ta-IN-Standard-A", "asrProvider": "google", "speechModel": "long"},
9
"nl-BE": {"ttsProvider": "google", "voice": "nl-BE-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
10
"zh-CN": {"ttsProvider": "google", "voice": "zh-CN-Neural2-B", "asrProvider": "deepgram", "speechModel": "nova-2-general"},
11
"ar-XA": {"ttsProvider": "google", "voice": "ar-XA-Wavenet-D", "asrProvider": "google", "speechModel": "long"},
12
"te-IN": {"ttsProvider": "google", "voice": "te-IN-Standard-A", "asrProvider": "google", "speechModel": "long"},
13
"nl-NL": {"ttsProvider": "google", "voice": "nl-NL-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
14
"hi-IN": {"ttsProvider": "google", "voice": "hi-IN-Standard-A", "asrProvider": "google", "speechModel": "long"},
15
"bg-BG": {"ttsProvider": "google", "voice": "bg-BG-Standard-A", "asrProvider": "google", "speechModel": "long"},
16
"en-AU": {"ttsProvider": "google", "voice": "en-AU-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
17
"es-US": {"ttsProvider": "google", "voice": "es-US-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
18
"kn-IN": {"ttsProvider": "google", "voice": "kn-IN-Standard-A", "asrProvider": "google", "speechModel": "long"},
19
"cs-CZ": {"ttsProvider": "google", "voice": "cs-CZ-Standard-A", "asrProvider": "google", "speechModel": "long"},
20
"de-DE": {"ttsProvider": "google", "voice": "de-DE-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
21
"hu-HU": {"ttsProvider": "google", "voice": "hu-HU-Standard-A", "asrProvider": "google", "speechModel": "long"},
22
"ml-IN": {"ttsProvider": "google", "voice": "ml-IN-Standard-A", "asrProvider": "google", "speechModel": "long"},
23
"zh-TW": {"ttsProvider": "google", "voice": "zh-TW-Neural2-B", "asrProvider": "deepgram", "speechModel": "nova-2-general"},
24
"zh-HK": {"ttsProvider": "google", "voice": "zh-HK-Neural2-B", "asrProvider": "deepgram", "speechModel": "nova-2-general"},
25
"ko-KR": {"ttsProvider": "google", "voice": "ko-KR-Standard-B", "asrProvider": "google", "speechModel": "telephony"},
26
"pt-BR": {"ttsProvider": "google", "voice": "pt-BR-Standard-D", "asrProvider": "google", "speechModel": "telephony"},
27
"es-ES": {"ttsProvider": "google", "voice": "es-ES-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
28
"fr-CA": {"ttsProvider": "google", "voice": "fr-CA-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
29
"it-IT": {"ttsProvider": "google", "voice": "it-IT-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
30
"pl-PL": {"ttsProvider": "google", "voice": "pl-PL-Standard-A", "asrProvider": "google", "speechModel": "long"},
31
"ru-RU": {"ttsProvider": "google", "voice": "ru-RU-Standard-A", "asrProvider": "google", "speechModel": "long"},
32
"pt-PT": {"ttsProvider": "google", "voice": "pt-PT-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
33
"ro-RO": {"ttsProvider": "google", "voice": "ro-RO-Standard-A", "asrProvider": "google", "speechModel": "long"},
34
"sv-SE": {"ttsProvider": "google", "voice": "sv-SE-Standard-A", "asrProvider": "google", "speechModel": "long"},
35
"id-ID": {"ttsProvider": "google", "voice": "id-ID-Standard-A", "asrProvider": "google", "speechModel": "long"},
36
"mr-IN": {"ttsProvider": "google", "voice": "mr-IN-Standard-A", "asrProvider": "google", "speechModel": "long"},
37
"da-DK": {"ttsProvider": "google", "voice": "da-DK-Standard-A", "asrProvider": "google", "speechModel": "long"},
38
"tr-TR": {"ttsProvider": "google", "voice": "tr-TR-Standard-A", "asrProvider": "google", "speechModel": "long"},
39
"fr-FR": {"ttsProvider": "google", "voice": "fr-FR-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
40
"en-GB": {"ttsProvider": "google", "voice": "en-GB-Standard-A", "asrProvider": "google", "speechModel": "telephony"},
41
"th-TH": {"ttsProvider": "google", "voice": "th-TH-Standard-A", "asrProvider": "google", "speechModel": "long"}
42
}

Our internal configuration defines these default settings and updates them periodically. Refer to the Twilio Twilio TTS Voices documentation for a complete and current list of supported languages, default voices, and detailed settings.

By understanding these defaults, you can decide when it's necessary to explicitly set the voice parameter to achieve the desired auditory experience for your application.

Other TTS Providers and Voices

other-tts-providers-and-voices page anchor

For additional voices from Google or Amazon (including generative options), refer to our Twilio TTS Voices documentation. Each provider offers a variety of languages and styles, enabling you to tailor your application's voice experience to your specific needs.


Include nested elements within <ConversationRelay> for more granular configuration. For more information on configuring ConversationRelay, refer to the ConversationRelay Onboarding Guide.

The <Language> element maps a language code to specific TTS and STT settings. Use this element to configure multiple languages for your session.

Example

Connect a Programmable Voice call to Twilio's ConversationRelay service.Link to code sample: Connect a Programmable Voice call to Twilio's ConversationRelay service.
1
const VoiceResponse = require('twilio').twiml.VoiceResponse;
2
3
const response = new VoiceResponse();
4
const connect = response.connect();
5
const conversationrelay = connect.conversationRelay({
6
url: 'wss://mywebsocketserver.com/websocket'
7
});
8
conversationrelay.language({
9
code: 'sv-SE',
10
ttsProvider: 'amazon',
11
voice: 'Elin-Neural',
12
transcriptionProvider: 'google',
13
speechModel: 'long'
14
});
15
conversationrelay.language({
16
code: 'en-US',
17
ttsProvider: 'google',
18
voice: 'en-US-Journey-O'
19
});
20
21
console.log(response.toString());

Output

1
<?xml version="1.0" encoding="UTF-8"?>
2
<Response>
3
<Connect>
4
<ConversationRelay url="wss://mywebsocketserver.com/websocket">
5
<Language code="sv-SE" ttsProvider="amazon" voice="Elin-Neural" transcriptionProvider="google" speechModel="long"/>
6
<Language code="en-US" ttsProvider="google" voice="en-US-Journey-O" />
7
</ConversationRelay>
8
</Connect>
9
</Response>

Attributes

Attribute nameDescription of attributesDefault valueRequired
codeThe language code (for example, "en-US") that applies to both STT and TTS.Required
ttsProviderThe provider for TTS. Choices are "Google", "Amazon", and "ElevenLabs".Inherited from <ConversationRelay>Optional
voiceThe voice used for TTS. Choices vary based on the ttsProvider.Inherited from <ConversationRelay>Optional
transcriptionProviderThe provider for STT. Choices are "Google" and "Deepgram".Inherited from <ConversationRelay>Optional
speechModelThe speech model used for STT. Choices vary based on the transcriptionProvider.Inherited from <ConversationRelay>Optional
languageThe language code for the session (for example, "en-US")."en-US"Optional
customParameterCustom parameters to be sent in the setup message.Optional

Notes

  • If you specify the same language code in both <ConversationRelay> and <Language>, the settings in <Language> take precedence.
  • ConversationRelay provides default settings for commonly used languages.

The <Parameter> element allows you to send custom parameters from the TwiML directly into the initial "setup" message sent over the WebSocket. These parameters appear under the customParameters field in the JSON message.

Example

Connect a Programmable Voice call to Twilio's ConversationRelay service.Link to code sample: Connect a Programmable Voice call to Twilio's ConversationRelay service.
1
const VoiceResponse = require('twilio').twiml.VoiceResponse;
2
3
const response = new VoiceResponse();
4
const connect = response.connect();
5
const conversationrelay = connect.conversationRelay({
6
url: 'wss://mywebsocketserver.com/websocket'
7
});
8
conversationrelay.parameter({
9
name: 'foo',
10
value: 'bar'
11
});
12
conversationrelay.parameter({
13
name: 'hint',
14
value: 'Annoyed customer'
15
});
16
17
console.log(response.toString());

Output

1
<?xml version="1.0" encoding="UTF-8"?>
2
<Response>
3
<Connect>
4
<ConversationRelay url="wss://mywebsocketserver.com/websocket">
5
<Parameter name="foo" value="bar"/>
6
<Parameter name="hint" value="Annoyed customer"/>
7
</ConversationRelay>
8
</Connect>
9
</Response>

Resulting Setup Message

1
{
2
"type": "setup",
3
"sessionId": "VX00000000000000000000000000000000",
4
"callSid": "CA00000000000000000000000000000000",
5
"...": "...",
6
"customParameters": {
7
"foo": "bar",
8
"hint": "Annoyed customer"
9
}
10
}

Language settings and their default values

language-settings-and-their-default-values page anchor

Language settings refer to configurations for both Text-to-Speech and Speech-to-Text:

  • Text-to-Speech (TTS) settings:
    • ttsLanguage
    • ttsProvider
    • voice
  • Speech-to-Text (STT) settings:
    • transcriptionLanguage
    • transcriptionProvider
    • speechModel

Configure language settings

configure-language-settings page anchor

Configure language settings in two places:

  1. Attributes of <ConversationRelay>: These serve as the default settings used when the session starts.
  2. Within <Language> Elements: Each <Language> element configures settings for a specific language code. You can include multiple <Language> elements to support multiple languages.

Handle defaults and overrides

handle-defaults-and-overrides page anchor
  • In <ConversationRelay>, the ttsLanguage attribute overrides the language attribute for the default TTS language.
  • In <ConversationRelay>, the transcriptionLanguage attribute overrides the language attribute for the STT language.
  • If a <Language> element specifies the same code attribute as in <ConversationRelay>, the <Language> element's settings take precedence.
  • The system uses default values when you don't provide specific settings.

Default Values

  • language: Defaults to en-US if not specified.
  • ttsProvider: Defaults to Google if not specified.
  • transcriptionProvider: Defaults to Google if not specified.
  • If you set the ttsProvider attribute without the voice attribute, the system uses a default voice for that provider.
  • If you set the transcriptionProvider attribute without the speechModel attribute, the system uses a default model for that provider.
  • If you set the voice attribute without the ttsProvider attribute, the system infers the provider from the default or specified ttsProvider.
  • If you set the speechModel attribute without the transcriptionProvider attribute, the system infers the provider from the default or specified transcriptionProvider.

For Speech-to-Text (STT) settings:

  • At session start, the service uses the transcriptionLanguage attribute to initiate the STT session.
  • If the combination of the transcriptionProvider and speechModel attributes is invalid, the call disconnects, and the system reports an error in the action callback and error notifications.
  • You can change the transcriptionLanguage attribute during the session via the language message you send through the Service Provider Interface (SPI).

For Text-to-Speech (TTS) settings:

  • When the lang property is present in the text token message from the SPI, the service uses it to select the TTS voice.
  • If the combination of the ttsProvider and voice attributes is invalid, the system sends an error message over the SPI.
  • If you don't specify the lang property in the text token, the service uses the current TTS language settings.

Service Provider Interface (SPI) specification

service-provider-interface-spi-specification page anchor

ConversationRelay interacts with your application server via a WebSocket connection specified by the url attribute. Messages exchanged follow this Service Provider Interface (SPI) specification.

ConversationRelay validates all incoming SPI messages to ensure they conform to the expected format. If validation fails, Twilio returns error 64107 with details about the validation failure. The following validation rules apply:

For text message type:

for-text-message-type page anchor
  • The token field can't be null or missing.
  • If lang is provided, it must be one of the supported languages.

For play message type:

for-play-message-type page anchor
  • The source field must contain a valid URL.

For sendDigits message type:

for-senddigits-message-type page anchor
  • The digits field can't be null or empty.
  • The digits field must only contain the characters 0-9, w, #, and *.

For language message type:

for-language-message-type page anchor
  • Either ttsLanguage or transcriptionLanguage must be present.
  • If provided, ttsLanguage must be one of the supported languages.
  • If provided, transcriptionLanguage must be one of the supported languages.

For clear message type:

for-clear-message-type page anchor
  • No additional attributes or parameters may be present
(information)

Info

ConversationRelay validates messages but continues the session even when it returns an error 64107 for non-conforming requests. These validation messages are informative only.

Messages from ConversationRelay to your application

messages-from-conversationrelay-to-your-application page anchor

ConversationRelay sends this message immediately after establishing the WebSocket connection.

1
{
2
"type": "setup",
3
"sessionId": "VX00000000000000000000000000000000",
4
"callSid": "CA00000000000000000000000000000000",
5
"from": "+14151234567",
6
"to": "+18881234567",
7
"direction": "inbound",
8
"...": "...",
9
"customParameters" : {
10
"foo": "bar"
11
}
12
}

ConversationRelay sends this message when the caller says something.

1
{
2
"type": "prompt",
3
"voicePrompt": "Hi! Can you tell me about life?",
4
"lang": "en-US",
5
"last": true
6
}

ConversationRelay sends this message when you turn on DTMF detection and the caller presses a key.

1
{
2
"type": "dtmf",
3
"digit": "1"
4
}

ConversationRelay sends this message when the caller interrupts TTS playback by speaking.

1
{
2
"type": "interrupt",
3
"utteranceUntilInterrupt": "Life is a complex set of",
4
"durationUntilInterruptMs": "460"
5
}

ConversationRelay sends this message when an error occurs during the session.

1
{
2
"type": "error",
3
"description": "Invalid message received: { \"foo\" : \"bar\" }"
4
}

Messages from your application to ConversationRelay

messages-from-your-application-to-conversationrelay page anchor

Send text tokens, and ConversationRelay converts them into speech.

1
{
2
"type": "text",
3
"token": "Hello world!",
4
"last": false
5
}
  • token attribute (Required): Converts the provided text to speech.
  • last attribute (Optional, default is false): Indicates whether this is the last token in the current message.

Best practices

  • Use streaming text tokens for smoother TTS playback.
  • Set "last": true when you have sent the final token of a message.

Request to play media to the caller.

1
{
2
"type": "play",
3
"source": "https://api.twilio.com/cowbell.mp3",
4
"loop": 1,
5
"preemptible": false
6
}
  • source attribute (Required): The URL of the media to play.
  • loop attribute (Optional, default is 1): Number of times to play the media. A value of 0 plays it 1,000 times (maximum).
  • preemptible attribute (Optional, default is false): If set to true, subsequent text or play messages will stop this media playback.

Request to send DTMF digits to the caller. ConversationRelay sends digits as per Twilio's <Play> digits attribute(link takes you to an external page).

1
{
2
"type": "sendDigits",
3
"digits": "9www4085551212"
4
}

Change the transcription and TTS language during the session.

1
{
2
"type": "language",
3
"ttsLanguage": "sv-SE",
4
"transcriptionLanguage": "en-US"
5
}
(information)

Info

This affects future TTS and STT sessions.

End the session and return control of the call to Twilio through ConversationRelay.

1
{
2
"type": "end",
3
"handoffData": "{\"reasonCode\":\"live-agent-handoff\", \"reason\": \"The caller wants to talk to a real person\"}"
4
}
  • handoffData attribute (Optional): A string containing data to pass back in the action callback.

Result of TwiML execution

result-of-twiml-execution page anchor

<Connect> action URL callback

connect-action-url-callback page anchor

When an action URL is specified in the <Connect> verb, ConversationRelay will make a request to that URL when the <Connect> verb ends. The request includes call information and session details.

Example Payloads

Session ended by application example

session-ended-by-application-example page anchor
1
{
2
"AccountSid": "AC00000000000000000000000000000000",
3
"CallSid": "CA00000000000000000000000000000000",
4
"CallStatus": "in-progress",
5
"From": "client:caller",
6
"To": "test:conversationrelay",
7
"Direction": "inbound",
8
"ApplicationSid": "AP00000000000000000000000000000000",
9
"SessionId": "VX00000000000000000000000000000000",
10
"SessionStatus": "ended",
11
"SessionDuration": "25",
12
"HandoffData": "{\"reason\": \"The caller requested to talk to a real person\"}"
13
}

Error occurred during session example

error-occurred-during-session-example page anchor
1
{
2
"AccountSid": "AC00000000000000000000000000000000",
3
"CallSid": "CA00000000000000000000000000000000",
4
"CallStatus": "in-progress",
5
"From": "client:caller",
6
"To": "test:conversationrelay",
7
"Direction": "inbound",
8
"ApplicationSid": "AP00000000000000000000000000000000",
9
"SessionId": "VX00000000000000000000000000000000",
10
"SessionStatus": "failed",
11
"SessionDuration": "10",
12
"ErrorCode": "39001",
13
"ErrorMessage": "Network connection to WebSocket server failed."
14
}

Session completed normally (caller hung up) example

session-completed-normally-caller-hung-up-example page anchor
1
{
2
"AccountSid": "AC00000000000000000000000000000000",
3
"CallSid": "CA00000000000000000000000000000000",
4
"CallStatus": "completed",
5
"From": "client:caller",
6
"To": "test:conversationrelay",
7
"Direction": "inbound",
8
"ApplicationSid": "AP00000000000000000000000000000000",
9
"SessionId": "VX00000000000000000000000000000000",
10
"SessionStatus": "completed",
11
"SessionDuration": "35"
12
}

  • Streaming Text Tokens: For smoother TTS playback, stream text tokens incrementally and set "last": true when the message is complete.
  • Error Handling: Monitor for error messages sent over the SPI to handle any issues promptly.
  • Language Switching: Use the language message to switch languages dynamically during a session.
  • Session Management: Use the end message to gracefully end sessions when your application logic determines it's appropriate.
  • Don't "wait" to send text tokens back from the Large Language Model (LLM); send them as you receive them.
  • To achieve the lowest latency, use the partial completions (streaming) through the LLM APIs. That way, the system sends each word in separate text tokens. The last one of that has last=true. This enables ConversationRelay to identify the first sayable string.
  • Don't trim LLM tokens; the tokens need to have spaces between them.

Handling punctuation and last: true in ConversationRelay TTS

handling-punctuation-and-last-true-in-conversationrelay-tts page anchor
  1. Setting the last flag properly
    • When sending text tokens that include punctuation, ensure the final token in the message includes "last": true.
    • Without a final token marked with "last": true, ConversationRelay assumes additional tokens are forthcoming and may stop reading at the first punctuation mark (for example, period, comma, or question mark).
  2. Using partial completions for streaming
    • In streaming mode, send each text token incrementally with "last": false.

    • When the LLM indicates that the response is complete (for example, when response.finish_reason() equals "stop"), mark that final token with "last": true.

      Example:

      1
      { "type": "text", "token": "Hello", "last": false }
      2
      { "type": "text", "token": " world", "last": false }
      3
      { "type": "text", "token": "!", "last": true }
  3. Handling complete responses (non-streaming)
    • In non-streaming mode, when the entire response is generated as a single complete sentence, mark the token with "last": true.

      Example:

      { "type": "text", "token": "Hello world!", "last": true }
  4. Punctuation handling in longer messages
    • For messages with complex punctuation, consider breaking the response into smaller chunks.
    • Each chunk should have the appropriate punctuation, with only the final token of the overall message marked with "last": true.

By following these guidelines, you can ensure that ConversationRelay processes and speaks your full message smoothly without unexpected pauses or truncation.

Prompt Engineering for voice responses in ConversationRelay

prompt-engineering-for-voice-responses-in-conversationrelay page anchor

When setting up system prompts for Large Language Models (LLMs) in ConversationRelay, consider these best practices to ensure optimal performance with Text-to-Speech (TTS) in ConversationRelay:

  • Explicit Number Formatting: Encourage the LLM to spell out numbers (for example, "two" instead of "2") to avoid misinterpretation by TTS.
  • Avoid Special Characters: Avoid bullet points, asterisks, or special symbols, as these can cause pauses or mispronunciations in voice output.
  • Focus on Conversational Tone: Design prompts to produce conversational, naturally flowing responses, which translate more effectively to TTS.

These prompt adjustments help improve the LLM-generated tokens' compatibility with voice output in ConversationRelay, enhancing clarity and consistency for users.

WebSocket message compliance for ConversationRelay

websocket-message-compliance-for-conversationrelay page anchor

All WebSocket messages from ConversationRelay to your API follow the strict formats defined in these docs. Your application must also adhere to these specifications when sending messages back to ConversationRelay.

  • Use One-Way Communication: Managed API services like AWS API Gateway often support two-way communication, but this approach doesn't work with ConversationRelay. Two-way communication may cause your application to send back non-conforming messages, causing us to terminate the session.
  • Explicit Message Handling: Use one-way communication only, ensuring precise control over what and when you send messages to ConversationRelay.

Following these practices helps maintain session stability and ensures compatibility with ConversationRelay's message handling.

Text normalization best practices

text-normalization-best-practices page anchor

When working with Text-to-Speech (TTS) in ConversationRelay, proper text normalization is crucial for delivering clear and natural spoken responses. This is especially important when using ElevenLabs voices, which may have difficulty with certain formats. Consider the following guidelines:

  • Numbers and Units: Write numbers as words for improved pronunciation. For example, write "twenty dollars and fifty cents" instead of "$20.50".
  • Dates: Spell out dates completely (for example, "March twenty-eighth, two thousand twenty-five" instead of "03/28/2025").
  • Email Addresses: Replace or spell out special characters. For instance, write "user at example dot com" rather than "user@example.com" since the "@" sign may be mispronounced.
  • Names: Use consistent formatting for names throughout the call to avoid variations (for example, always use "Anna" rather than alternating between "A-nna" and "Ah-nna").
  • Abbreviations: Spell out abbreviations that should be pronounced fully (for example, "Doctor" instead of "Dr.").
  • Special Characters: Replace special characters with their spoken equivalents (for example, "percent" instead of "%").
  • Punctuation: Use appropriate punctuation to create natural pauses and intonation.
  • Acronyms and Initialisms: Insert spaces between letters if they must be spelled out (for example, "H T T P" for letter-by-letter pronunciation).

For detailed text normalization guidelines, refer to ElevenLabs' text normalization best practices(link takes you to an external page).

TTS voice quality and latency trade-offs

tts-voice-quality-and-latency-trade-offs page anchor

Text-to-Speech (TTS) voice quality varies significantly by provider and voice type. While generative voices often offer higher fidelity and more natural-sounding responses, they can introduce additional latency and process TTS at a slower rate.

  • Quality versus Latency: Generative voices may produce more realistic responses but could add latency, affecting real-time applications.
  • Test Before Production: We recommend testing various providers and voices to determine the best balance of quality and responsiveness for your specific use case.

Selecting the right TTS voice involves balancing quality and performance, so thorough testing is essential before production deployment.

STT provider and model variability

stt-provider-and-model-variability page anchor

Speech-to-Text (STT) quality and latency can vary depending on the provider and the environment. Google and Deepgram each offer unique strengths for different scenarios, such as clean versus noisy audio environments.

  • Environment Sensitivity: Some models may perform better in noisy environments, while others excel with clean audio. Test in environments that reflect your actual use case.
  • Customize for Optimal Recognition: We encourage testing various combinations of STT providers and speech models to find the best quality and responsiveness for your application.

Optimizing STT performance requires careful selection based on environment and model capabilities, so thorough testing is essential for achieving the best results.

WebSocket reconnection logic for ConversationRelay

websocket-reconnection-logic-for-conversationrelay page anchor

In the event of a WebSocket connection error in ConversationRelay, implement reconnection logic by initiating a new <Connect><ConversationRelay> request:

  • Re-establish the Connection in ConversationRelay: If you lose the WebSocket connection, handle the disconnect in your <Connect> element's action URL callback by returning new TwiML containing <Connect><ConversationRelay> to restore the session.
  • Validate Call Consistency in ConversationRelay: Ensure the callSid remains the same to confirm continuity of the original call session.

This approach helps maintain session stability and consistency following any connection disruptions.

Language and voice selection behavior with SPI messages

language-and-voice-selection-behavior-with-spi-messages page anchor

When you switch languages during a session via an SPI message, ConversationRelay uses the pre-set voice associated with the new language that you configured in the initial TwiML setup. If you didn't configure a specific voice for that language, ConversationRelay will use its default voice for the selected language.

  • Initial Language Configuration: To maintain a consistent voice experience across languages, define the desired voice for each supported language explicitly in the TwiML configuration. This setup ensures that when you change the language via an SPI message during the session, the specified voice is used. This prevents the system from defaulting to its own voice.
  • Voice on Language Change: If you issue a language change and haven't set a specific voice for that language in TwiML, the system will use its default voice for that language. For example, switching from English (en-GB) with a male voice to Spanish (es-ES) may result in a female voice if the default Spanish voice is female.
  • No Voice Updates Mid-Session: Once the session starts, you can't modify voice and language configurations set in TwiML through SPI messages.

This setup ensures consistent voice behavior for each language by configuring it in TwiML before the call begins.

Choosing between streaming and non-streaming mode for LLM responses

choosing-between-streaming-and-non-streaming-mode-for-llm-responses page anchor

ConversationRelay supports both streaming and non-streaming modes for sending LLM responses. Each mode has unique trade-offs in latency and response fluidity:

  • Non-Streaming Mode: In non-streaming mode, you send the full LLM response to ConversationRelay at once. This approach, while slightly more latent for initial responses, can provide a smoother and more consistent TTS experience, with less variability in pacing and fluidity of the speech output.
  • Latency Trade-Off: Initial response time (time to first token) is longer in non-streaming mode, which may affect perceived responsiveness for the first reply.
  • When to Use: Non-streaming mode can be effective if the session prioritizes TTS consistency over immediate response speed.
  • Streaming Mode: Streaming mode sends text tokens incrementally as they're generated by the LLM, allowing ConversationRelay to start speaking sooner. This provides quicker initial responses but may introduce slight variability in TTS pacing and fluidity.
  • Responsiveness: Streaming mode minimizes latency in initial responses, making it ideal for real-time interactions where responsiveness is critical.
  • Goal of Streaming Mode: While streaming mode may not be perfect in all cases, we recommend it for most applications, and improvements are ongoing to enhance fluidity in TTS output.
  • Recommendation for Streaming Mode: Both modes are viable, and you can experiment with streaming or non-streaming to find the best balance of response time and TTS quality for your use case. For non-streaming, send text in bulk once the LLM response is complete. For streaming, send text tokens in smaller chunks as they become available.

For errors, such as messages that ConversationRelay doesn't understand, we will respond with an error message.

If your WebSocket sends unidentified messages to ConversationRelay and the last 10 messages remain unidentified, we will terminate the connection. The status code will be 1007 with the reason "Too many consecutive malformed messages." In that case, we will report an error 64105 "WebSocket Ended."


If the WebSocket disconnects unexpectedly in ConversationRelay, we don't reconnect, and the call disconnects with a failed status.


ConversationRelay, including the <ConversationRelay> TwiML nouns and APIs, use artificial intelligence or machine learning technologies.

Our AI Nutrition Facts for ConversationRelay(link takes you to an external page) provide an overview of the AI feature you're using, so you can better understand how the AI is working with your data. The below AI Nutrition Label details the ConversationRelay AI qualities. For more information and the glossary regarding the AI Nutrition Facts Label, refer to our AI Nutrition Facts page(link takes you to an external page).

Deepgram AI nutrition facts

deepgram-ai-nutrition-facts page anchor

AI Nutrition Facts

ConversationRelay (STT and TTS) - Programmable Voice - Deepgram

Description
Generate speech to text in real-time through a WebSocket API in Programmable Voice.
Privacy Ladder Level
N/A
Feature is Optional
Yes
Model Type
Automatic Speech Recognition
Base Model
Deepgram Nova2

Trust Ingredients

Base Model Trained with Customer Data
No

ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.

Customer Data is Shared with Model Vendor
No

ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.

Training Data Anonymized
N/A

Base Model is not trained using any Customer Data.

Data Deletion
N/A

Customer Data is not stored or retained in the Base Model.

Human in the Loop
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Data Retention
N/A

Compliance

Logging & Auditing
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Guardrails
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Input/Output Consistency
Yes

Customer is responsible for human review.

Other Resources
Learn more about this label at nutrition-facts.ai

Google AI nutrition facts

google-ai-nutrition-facts page anchor

AI Nutrition Facts

ConversationRelay (STT and TTS) - Programmable Voice - Google AI

Description
Generate speech to text in real-time and convert text into natural-sounding speech through a WebSocket API in Programmable Voice.
Privacy Ladder Level
N/A
Feature is Optional
Yes
Model Type
Generative and Predictive - Automatic Speech Recognition and Text-to-Speech
Base Model
Google Speech-to-Text; Google Text-to-Speech

Trust Ingredients

Base Model Trained with Customer Data
No

ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.

Customer Data is Shared with Model Vendor
No

ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.

Training Data Anonymized
N/A

Base Model is not trained using any Customer Data.

Data Deletion
N/A

Customer Data is not stored or retained in the Base Model.

Human in the Loop
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Data Retention
N/A

Compliance

Logging & Auditing
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Guardrails
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Input/Output Consistency
Yes

Customer is responsible for human review.

Other Resources
Learn more about this label at nutrition-facts.ai

Amazon AI nutrition facts

amazon-ai-nutrition-facts page anchor

AI Nutrition Facts

ConversationRelay (STT and TTS) - Programmable Voice - Amazon AI

Description
Convert text into natural sounding speech through a websocket API in Programmable Voice.
Privacy Ladder Level
N/A
Feature is Optional
Yes
Model Type
Generative and Predictive
Base Model
Amazon Polly Text-to-Speech

Trust Ingredients

Base Model Trained with Customer Data
No

ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.

Customer Data is Shared with Model Vendor
No

ConversationRelay uses the Default Base Model provided by the Model Vendor. The Base Model is not trained using Customer Data.

Training Data Anonymized
N/A

Base Model is not trained using any Customer Data.

Data Deletion
N/A

Customer Data is not stored or retained in the Base Model.

Human in the Loop
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Data Retention
N/A

Compliance

Logging & Auditing
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Guardrails
Yes

Customer can view and listen to the input and output in the customer's own terminal.

Input/Output Consistency
Yes

Customer is responsible for human review.

Other Resources
Learn more about this label at nutrition-facts.ai

ElevenLabs nutrition facts

elevenlabs-nutrition-facts page anchor

AI Nutrition Facts

ConversationRelay (STT and TTS) - Programmable Voice - ElevenLabs

Description
Convert text into a human-sounding voice using speech synthesis technology from ElevenLabs.
Privacy Ladder Level
N/A
Feature is Optional
Yes
Model Type
Predictive
Base Model
ElevenLabs Text-To-Speech: Flash 2 and Flash 2.5

Trust Ingredients

Base Model Trained with Customer Data
No

The Base Model is not trained using any Customer Data.

Customer Data is Shared with Model Vendor
No

Programmable Voice uses the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.

Training Data Anonymized
N/A

Base Model is not trained using any Customer Data.

Data Deletion
N/A

The Base Model is not trained using any Customer Data.

Human in the Loop
Yes

Customers can view text input and listen to the audio output.

Data Retention
Customer can review TwiML logs, including <Say> Logs, to debug and troubleshoot for up to 30 days.

Compliance

Logging & Auditing
Yes

Customers can view text input and listen to the audio output.

Guardrails
Yes

Customers can view text input and listen to the audio output.

Input/Output Consistency
Yes

Customer is responsible for human review.

Other Resources
Learn more about this label at nutrition-facts.ai