Real-time Transcriptions for Video
Legal notice
Real-time Transcriptions uses artificial intelligence or machine learning technologies. If you enable or use any of the features or functionalities within Twilio Video that Twilio identifies as using artificial intelligence or machine learning technology, you acknowledge and agree that your use of these features or functionalities is subject to the terms of the Predictive and Generative AI/ML Features Addendum.
Real-time Transcriptions converts speech from any participant in a Video Room into text and sends that text to the Video Client SDKs (JavaScript, iOS, and Android). Your application can render the text in any style and format. Twilio supports multiple speech models from which you can choose the model that best fits your use case.
Your app can implement transcriptions in two ways:
- Start when your app creates a Video Room.
- Start, stop, or restart on demand while the Room is active.
You turn on Real-time Transcriptions at the Video Room level, so every participant's speech gets transcribed. You configure the spoken language and speech model, and the settings remain in effect until the Room ends.
While active, Twilio delivers the transcribed text, along with the Participant SID, to every participant in the Room.
If you turn on partial results, the transcription engine delivers interim results so that your app can refresh its interface in near real-time.
You can also set a default configuration in the Twilio Console.
Transcriptions exists as a subresource of the Room resource that represents a Room's transcript.
/v1/Rooms/{RoomNameOrSid}/Transcriptions/{TranscriptionTtid}
| Property | Type | Description |
|---|---|---|
ttid | ttid | The Twilio Type ID of the Transcriptions resource. It is assigned at creation time using the following format: video_transcriptions_{uuidv7-special-encoded}. |
room_sid | SID<RM> | The ID of the room instance parent to the Transcriptions resource. |
account_sid | SID<AC> | The account ID that owns the Transcriptions resource. |
status | enum<string> | The status of the Transcriptions resource. It can be: started, stopped or failed. The resource is created with status started by default. |
configuration | object<string-map> | Key-value map with configuration parameters applied to audio tracks transcriptions. To review the list of properties, see the Parameters for the configuration object section. |
date_created | string<date-time> | The date and time in UTC when the resource was created, specified in ISO 8601 format. |
date_updated | string<date-time> | The date and time in UTC when the resource was last updated, specified in ISO 8601 format. Null if the resource hasn't been updated since creation. |
start_time | string<date-time> | The date and time in UTC when the resource was in state started and the first participant joined the Room, specified in ISO 8601 format. |
end_time | string<date-time> | The date and time in UTC when transcription processing has paused in the Room, specified in ISO 8601 format. This happens when the resource is set to stopped or the last participant has left the Room. |
duration | integer | The cumulative time in seconds that the transcripion resource has been in state started and at least one participant is in the Room. This is independent of whether audio tracks are published or audio tracks are muted. |
url | string<uri> | The absolute URL of the resource. |
If Twilio detects an internal error that prevents generation of transcriptions, the Transcriptions resource changes its status to failed. The Twilio Console receives a debug event with the details of the failure. Once it detects a failure, you can't restart that resource.
| Name | Type | Necessity | Default | Description |
|---|---|---|---|---|
transcriptionEngine | string | Optional | "google" | The supported transcription engine Twilio uses. To learn about the possible values, see the transcription engine table. |
speechModel | string | Optional | "telephony" | The provider-supported recognition model that the transcription engine uses. To learn about the possible values, see the speech model table. |
languageCode | string | Optional | "en-US" | The language code that the transcription engine uses, specified in BCP-47 format. This attribute ensures that the transcription engine understands and processes the spoken language. |
partialResults | Boolean | Optional | false | Indicates whether to send partial results. When true, the transcription engine sends interim results as the transcription progresses, providing more immediate feedback before the final result could display. |
profanityFilter | Boolean | Optional | true | Indicates if the server tries to filter profanities. This replaces all but the initial character in each filtered word with asterisks. Google provides this feature. |
hints | string | Optional | None | A list of words or phrases that the transcription provider can expect to encounter during a transcription. Using the hints attribute can improve the transcription provider's recognition of words or phrases that are expected during the video call. Up to 500 words or phrases can be provided in the list of hints, each entry separated with a comma. Each word or phrase may be up to 100 characters each. Separate each word in a phrase with a space. |
enableAutomaticPunctuation | Boolean | Optional | true | The provider adds punctuation to the transcribed text. Default is true. When enabled, the transcription engine inserts punctuation marks such as periods, commas, and question marks, improving the readability of the transcribed text. |
The following table lists the possible values for the transcriptionEngine and the associated speechModel properties.
| Transcription engine | Speech model | Use case | Example |
|---|---|---|---|
google | telephony | Use this model for telephone call audio. | |
google | medical_conversation | Use this model for conversations between a medical provider and a patient. | |
google | long | Use this model for any type of long-form content. | Media, spontaneous speech, and conversations |
google | short | Use this model for short utterances that are a few seconds in length. Consider using this model instead of the command and search model. | Commands or other single, short, directed speech |
google | telephony_short | Use this model for short or even single-word utterances for audio that originated from a phone call. | Customer service, teleconferencing, and kiosk applications. |
google | medical_dictation | Use this model to transcribe notes dictated by a medical professional. | |
google | chirp_telephony | Use this model for telephone call audio with multiple languages. It uses the Google Universal large Speech Model (USM). | |
google | chirp | Use this model for content for audio in multiple languages. It uses the Google Universal large Speech Model (USM). | |
deepgram | nova-3 | Recommended for meetings, captioning, noisy, or far-field audio. | |
deepgram | nova-2 | Recommended with languages that nova-3 doesn't support. |
Info
- The
googletranscription engine corresponds to the Google Speech-to-Text V2 API. - Speech models support a limited range of languages. For valid combinations, see the following provider documentation:
To create a Video Room with Real-time Transcriptions enabled, add the following two parameters to the Room POST request:
| Parameter | Type | Description |
|---|---|---|
TranscribeParticipantsOnConnect | Boolean | Whether to start real-time transcriptions when Participants connect. Default is false. |
TranscriptionsConfiguration | object | Key-value configuration settings for the transcription engine. To learn more, see Transcription configuration properties. |
To turn on transcriptions for the Video Room, set the TranscribeParticipantsOnConnect parameter to true.
1curl -X POST "https://video.twilio.com/v1/Rooms" \2-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN \3--data-urlencode 'TranscriptionsConfiguration={"languageCode": "EN-us", "partialResults": true}' \4--data-urlencode "TranscribeParticipantsOnConnect=true" \
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"audio_only": false,4"date_created": "2025-06-12T15:42:32Z",5"date_updated": "2025-06-12T15:42:32Z",6"duration": null,7"empty_room_timeout": 5,8"enable_turn": true,9"end_time": null,10"large_room": false,11"links": {12"participants": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Participants",13"recording_rules": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/RecordingRules",14"recordings": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Recordings",15"transcriptions": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions"16},17"max_concurrent_published_tracks": 170,18"max_participant_duration": 14400,19"max_participants": 50,20"media_region": "us1",21"record_participants_on_connect": false,22"sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",23"status": "in-progress",24"status_callback": null,25"status_callback_method": "POST",26"type": "group",27"unique_name": "test",28"unused_room_timeout": 5,29"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",30"video_codecs": [31"VP8",32"H264"33]34}
To create a Transcription, send a POST request to the following URI:
/v1/Rooms/{RoomNameOrSid}/Transcriptions
| Parameter | Type | Description |
|---|---|---|
RoomSid | SID<RM> | The ID of the parent room where you created the Transcriptions resource. |
| Parameter | Type | Description |
|---|---|---|
Configuration | object<string-map> | Object with key-value configurations. |
1curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions" \2-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN \3--data-urlencode 'Configuration={"languageCode": "EN-us", "partialResults": true, "profanityFilter": true, "speechModel": "long"}'
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"configuration": {4"languageCode": "EN-us",5"partialResults": "true",6"profanityFilter": "true",7"speechModel": "long"8},9"date_created": "2025-07-22T14:14:35Z",10"date_updated": null,11"duration": null,12"end_time": null,13"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",14"start_time": null,15"status": "started",16"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",17"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"18}
To stop a Transcriptions resource, send a POST request to the following resource instance URI:
/v1/Rooms/{RoomSid}/Transcriptions/{ttid}
To stop transcriptions on a Room, set the Status parameter to stopped.
| Parameter | Type | Description |
|---|---|---|
ttid | ttid | A single TTID of the Transcriptions resource being updated. |
RoomSid | SID<RM> | The ID of the parent room where you updated the Transcriptions resource. |
| Parameter | Type | Description |
|---|---|---|
Status | enum<string> | New status of the Transcriptions resource. Can be: started, stopped. There is no state transition if the resource property status already has the same value or if the parameter is missing. |
1curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \2-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN \3--data-urlencode "Status=stopped"
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"configuration": {4"languageCode": "EN-us",5"partialResults": "true"6},7"date_created": "2025-07-22T12:55:30Z",8"date_updated": "2025-07-22T12:56:02Z",9"duration": null,10"end_time": null,11"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",12"start_time": null,13"status": "stopped",14"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",15"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"16}
To restart a stopped Transcriptions resource in a Room, send a POST request to the following resource instance URI:
/v1/Rooms/{RoomSid}/Transcriptions/{ttid}
To restart transcription, set the Status to started.
| Parameter | Type | Description |
|---|---|---|
ttid | ttid | The TTID of the Transcriptions resource being updated. Current implementation supports a single Transcriptions resource, but this might change in future implementations. |
RoomSid | SID<RM> | The ID of the parent room where the Transcriptions resource is updated. |
| Parameter | Type | Description |
|---|---|---|
Status | enum<string> | The status of the Transcriptions resource. To restart transcriptions, set to started. If this parameter has the same or no value, the state makes no transition. |
1curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \2-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN \3--data-urlencode "Status=started"
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"configuration": {4"languageCode": "EN-us",5"partialResults": "true"6},7"date_created": "2025-07-22T12:57:24Z",8"date_updated": null,9"duration": null,10"end_time": null,11"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",12"start_time": null,13"status": "started",14"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",15"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"16}
To fetch a list of Transcriptions resources in a Room, send a GET request to the following resource instance URI:
/v1/Rooms/{RoomSid}/Transcriptions
Real-time Transcriptions supports only a single instance of the Transcriptions resource per Room, so the list only has a single item.
| Parameter | Type | Description |
|---|---|---|
RoomSid | SID<RM> | The ID of the parent room that has the Transcriptions resources. |
1curl -X GET "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions" \2-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN
1{2"meta": {3"first_page_url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions?PageSize=50&Page=0",4"key": "transcriptions",5"next_page_url": null,6"page": 0,7"page_size": 50,8"previous_page_url": null,9"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions?PageSize=50&Page=0"10},11"transcriptions": [12{13"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",14"configuration": {},15"date_created": "2025-07-22T11:05:41Z",16"date_updated": null,17"duration": null,18"end_time": null,19"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",20"start_time": null,21"status": "started",22"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",23"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"24}25]26}
To fetch a Transcriptions resource in a Room, send a GET request to the following resource instance URI:
/v1/Rooms/{RoomSid}/Transcriptions/{ttid}
| Parameter | Type | Description |
|---|---|---|
ttid | ttid | The TTID of the Transcriptions resource being requested. |
RoomSid | SID<RM> | The ID of the parent room where you fetch the Transcriptions resource. |
1curl -X https://video.twilio.com/v1/Rooms/{room_sid}/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX \2-u $TWILIO_ACCOUNT_SID:$TWILIO_AUTH_TOKEN
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"configuration": {4"LanguageCode": "EN-us",5"ProfanityFilter": "true"6},7"date_created": null,8"date_updated": null,9"duration": null,10"end_time": null,11"links": {12"transcriptions": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"13},14"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",15"start_time": null,16"status": "created",17"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"18}
Twilio delivers transcribed text to the client SDKs through callback events.
The schema of the JSON delivery format contains a version number. Each event contains the transcription of a single utterance and details of the participant who generated the audio.
1properties:2type:3const: extension_transcriptions45version:6description: |7Version of the transcriptions protocol used by this message. It is semver compliant89track:10$ref: /Server/State/RemoteTrack11description: |12Audio track from where the transcription has been generated.1314participant:15$ref: /Server/State/Participant16description: |17The participant who published the audio track from where the18transcription has been generated.1920sequence_number:21type: integer22description: |23Sequence number. Starts with one and increments monotonically. A sequence24counter is defined for each track to allow the receiver to identify25missing messages.2627timestamp:28type: string29description: |30Absolute time from the real-time-transcription. It is31conformant with UTC ISO 8601.3233partial_results:34type: boolean35description: |36Whether the transcription is a final or a partial result.3738stability:39type: double40description: |41Indicates how likely it is that this partial result transcript won't be updated again. The range is from `0.0` (unstable) to `1.0` (stable). This field is only provided when `partialResults` is `true`.4243language_code:44type: string45description: |46Language code of the transcribed text. It is conformant with BCP-47.4748transcription:49type: string50description: |51Utterance transcription
1{2"version": "1.0",3"language_code": "en-US",4"partial_results": false,5"participant": "PA00000000000000000000000000000000",6"sequence_number": 3,7"timestamp": "2025-01-01T12:00:00.000000000Z",8"track": "MT00000000000000000000000000000000",9"transcription": "This is a test",10"type": "extension_transcriptions"11}
When you set partialResults parameter to true, the transcription engine provides a series of partial results as it determines the text corresponding to the spoken utterance.
The stability property indicates the probability that the partial result provided changes before the delivery of the final result. This value ranges from 0.0 (unstable) to 1.0 (stable). In general, consider partial results with stability less than 0.9 as preliminary and temporary. When building an app element to display the transcribed text as captions or subtitles, filter out partial results with a stability value less than 0.9. This avoids text flickering as the app receives partial results.
To turn on the flow of transcript events, set the receiveTranscriptions parameter in connectOptions to true. This parameter defaults to false. With Real-time Transcriptions enabled for the Room and receiveTranscriptions set to true, callback events containing the transcribed text start to flow.
1import { connect } from 'twilio-video';23const room = await connect(token, {4name: 'my-room',5receiveTranscriptions: true6});78room.on('transcription', (transcriptionEvent) => {9console.log(`${transcriptionEvent.participant}: ${transcriptionEvent.transcription}`);10});
To receive transcription events, set the receiveTranscriptions parameter in TVIConnectOptions to true. This parameter defaults to false. To fetch this value, use the isReceiveTranscriptionsEnabled getter.
With Real-time Transcriptions enabled for the Room and receiveTranscriptions set to true, the transcriptionReceived(room:transcription:) method in the RoomDelegate protocol delivers callback events containing the transcribed text.
1let options = ConnectOptions(token: accessToken, block: { (builder) in2builder.roomName = "test"3builder.isReceiveTranscriptionsEnabled = true4})
To receive transcription events, set the receiveTranscriptions parameter in ConnectOptions to true. This parameter defaults to false. To check the setting, call isReceiveTranscriptionsEnabled().
With Real-time Transcriptions enabled for the Room and ConnectOptions set to true, the onTranscription(@NonNull Room room, @NonNull JSONObject json) method of the Room.Listener interface delivers callback events containing the transcribed text.
1ConnectOptions connectOptions = new ConnectOptions.Builder(accessToken)2.receiveTranscriptions(true)3.build();45Video.connect(context, connectOptions, roomListener);
To enable and configure Real-time Transcriptions in the Twilio Console, complete the following steps.
- Log in to the Twilio Console.
- Go to Video > Manage > Room Settings.
- Scroll to Realtime Transcriptions.
- Click Accept for the Predictive and Generative AI/ML Features Addendum.
- Click Enabled for the Automatically turn on Realtime Transcriptions by default in Rooms.
- Click Save.
AI Nutrition Facts
Real-time Transcriptions for Video uses third-party artificial technology and machine learning technologies.
To improve your understanding how AI handles your data, Twilio's AI Nutrition Facts provide an overview of the AI feature you're using. The following Speech to Text Transcriptions—Nutrition Facts label outlines the AI qualities of Real-time Transcriptions for Video.
AI Nutrition Facts
Speech to Text Transcriptions - Programmable Voice, Twilio Video, and Conversational Intelligence
- Description
- Generate speech to text voice transcriptions (real-time and post-call) in Programmable Voice, Twilio Video, and Conversational Intelligence.
- Privacy Ladder Level
- N/A
- Feature is Optional
- Yes
- Model Type
- Generative and Predictive - Automatic Speech Recognition
- Base Model
- Deepgram Speech-to-Text, Google Speech-to-Text, Amazon Transcribe
- Base Model Trained with Customer Data
- No
- Customer Data is Shared with Model Vendor
- No
- Training Data Anonymized
- N/A
- Data Deletion
- Yes
- Human in the Loop
- Yes
- Data Retention
- Until the customer deletes
- Logging & Auditing
- Yes
- Guardrails
- Yes
- Input/Output Consistency
- Yes
- Other Resources
- https://www.twilio.com/docs/conversational-intelligence
Trust Ingredients
Conversational Intelligence, Programmable Voice, and Twilio Video only use the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.
Conversational Intelligence, Programmable Voice, and Twilio Video only use the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.
Base Model is not trained using any customer data.
Transcriptions are deleted by the customer using the Conversational Intelligence API or when a customer account is deprovisioned.
The customer views output in the Conversational Intelligence API or Transcript Viewer.
Compliance
The customer can listen to the input (recording) and view the output (transcript).
The customer can listen to the input (recording) and view the output (transcript).
The customer is responsible for human review.
Learn more about this label at nutrition-facts.ai
- To use the Google
medical_conversationmodel, setenableAutomaticPunctuationtotrue. - When a Room reaches the
MaxParticipantDurationtime limit, Transcriptions stop. As a workaround, set theMaxParticipantDurationparameter of the Room exceeds than the expected lifetime of the Room. This value defaults to four hours.