Real-time Transcriptions for Video
Beta
Real-Time Transcriptions is in Public Beta. The information in this document could change. We might add or update features before the product becomes Generally Available. Beta products don't have a Service Level Agreement (SLA). Learn more about beta product support.
Real-Time Transcriptions Public Beta is not HIPAA eligible.
Legal notice
Real-Time Transcriptions uses artificial intelligence or machine learning technologies. By enabling or using any of the features or functionalities within Twilio Video that are identified as using artificial intelligence or machine learning technology, you acknowledge and agree that your use of these features or functionalities is subject to the terms of the Predictive and Generative AI/ML Features Addendum.
Real-Time Transcriptions converts speech from any participant in a Video Room into text and sends that text to the Video Client SDKs (JavaScript, iOS, and Android). Your application can render the text in any style and format. Twilio supports multiple speech models and you can choose the model that best fits your use case.
Your app can implement transcriptions in two ways:
- Start automatically when your app creates a Video Room.
- Start, stop, or restart on demand while the Room is active.
You enable Real-Time Transcriptions at the Video Room level, so every participant is transcribed. You configure the spoken language and speech model, and the settings remain in effect until the Room ends. You can also set a default configuration in the Twilio Console.
When transcription is active, Twilio delivers the transcribed text, along with the Participant SID
, to every participant in the Room.
If you enable partial results, the transcription engine delivers interim results so that your app can refresh the UI in near real time.
Transcription is a subresource of the Room resource that represents a Room's transcript. The resource URI is:
/v1/Rooms/{RoomNameOrSid}/Transcriptions/{TranscriptionTtid}
Property | Type | Description |
---|---|---|
ttid | ttid | The Twilio Type ID of the Transcription resource. It is assigned at creation time using the following format: video_transcriptions_{uuidv7-special-encoded} . |
room_sid | SID<RM> | The ID of the room instance parent to the Transcription resource. |
account_sid | SID<AC> | The account ID that owns the Transcription resource. |
status | enum<string> | The status of the Transcription resource. It can be: started , stopped or failed . The resource is created with status started by default. |
configuration | object<string-map> | Key value map with configuration parameters applied to audio tracks transcriptions. The following body parameters are supported: PartialResults , LanguageCode , TranscriptionEngine , ProfanityFilter , SpeechModel , Hints , EnableAutomaticPunctuation . |
date_created | string<date-time> | The date and time in GMT when the resource was created, specified in ISO 8601 format. |
date_updated | string<date-time> | The date and time in GMT when the resource was last updated, specified in ISO 8601 format. |
start_time | string<date-time> | The date and time in GMT when the resource last transitioned to state started , specified in ISO 8601 format. |
end_time | string<date-time> | The date and time in GMT when the resource last transitioned to state stopped , specified in ISO 8601 format. |
duration | integer | The time in seconds the transcription has been in state started . |
url | string<uri> | The absolute URL of the resource. |
The Transcription resource transitions to status failed
if an internal error is detected that prevents the transcriptions from being generated. The Twilio Console receives a debug event with the details of the failure. The resource can't be restarted once a failure is detected.
Name | Type | Optional or Required | Description |
---|---|---|---|
transcriptionEngine | string | Optional | Definition of the transcription engine to be used, among those supported by Twilio. Default is "google" . |
speechModel | string | Optional | Recognition model used by the transcription engine, among those supported by the provider. Defaults to Google's "telephony" . |
languageCode | string | Optional | Language code used by the transcription engine, specified in BCP-47 format. Default is "en-US" . This attribute is useful for ensuring that the transcription engine correctly understands and processes the spoken language. |
partialResults | boolean | Optional | Indicates whether to send partial results. Default is false . When enabled, the transcription engine sends interim results as the transcription progresses, providing more immediate feedback before the final result is available. |
profanityFilter | boolean | Optional | Indicates if the server will attempt to filter out profanities, replacing all but the initial character in each filtered word with asterisks. Google feature. Default is true . |
hints | string | Optional | This field may contain a list of words or phrases that the transcription provider can expect to encounter during a transcription. Using the hints attribute can improve the transcription provider's recognition of words or phrases that are expected during the video call. Up to 500 words or phrases can be provided in the list of hints, each entry separated with a comma. Each word or phrase may be up to 100 characters each. Separate each word in a phrase with a space. |
enableAutomaticPunctuation | boolean | Optional | The provider will add punctuation to the transcribed text. Default is true . When enabled, the transcription engine will automatically insert punctuation marks such as periods, commas, and question marks, improving the readability of the transcribed text. |
The following table lists the possible values for the transcriptionEngine
and the associated speechModel
properties.
Transcription engine | Speech model | Description |
---|---|---|
google | telephony | Use this model for audio that originated from an audio phone call, typically recorded at an 8 kHz sampling rate. |
google | medical_conversation | Use this model for conversations between a medical provider, for example, a doctor or nurse, and a patient. |
google | long | Use this model for any type of long form content, such as media or spontaneous speech and conversations. Consider using this model instead of the video or the default model, especially if they aren't available in your target language. |
google | short | Use this model for short utterances that are a few seconds in length. It is useful for trying to capture commands or other single-short directed speech use cases. Consider using this model instead of the command and search model. |
google | telephony_short | Dedicated version of the telephony model for short or even single-word utterances for audio that originated from a phone call, typically recorded at an 8 kHz sampling rate. Useful for utterances only a few seconds long in customer service, teleconferencing, and automated kiosk applications. |
google | medical_dictation | Use this model to transcribe notes dictated by a medical professional, for example, a doctor dictating notes about a patient's blood test results. |
google | chirp_2 | Use the next generation of our Universal large Speech Model (USM) powered by our large language model technology for streaming and batch, and transcriptions and translations in diverse linguistic content and multilingual capabilities. |
google | chirp_telephony | Universal large Speech Model (USM) fine-tuned for audio that originated from a phone call (typically recorded at an 8 kHz sampling rate). |
google | chirp | Use our Universal large Speech Model (USM), for state-of-the-art non-streaming transcriptions in diverse linguistic content and multilingual capabilities. |
deepgram | nova-2 | Recommended for most use cases. |
Notes:
- The
google
transcription engine corresponds to the Google Speech-to-Text V2 API. - Not all languages are available on all speech models. For valid combinations, see the following provider documentation:
To create a Video Room with Real-Time Transcriptions automatically enabled, add the following two parameters to the Room POST
request:
Parameter | Type | Description |
---|---|---|
TranscribeParticipantsOnConnect | boolean | Whether to start real-time transcriptions when Participants connect. Default is false . |
TranscriptionsConfiguration | object | Key-value configuration settings for the transcription engine. For more information see Transcriptio configuration properties. |
To automatically enable transcriptions on the Video Room, set the TranscribeParticipantsOnConnect
parameter to true
.
Example:
curl -X POST "https://video.twilio.com/v1/Rooms" \\--data-urlencode 'TranscriptionsConfiguration={"languageCode": "EN-us", "partialResults": true}' \\--data-urlencode "TranscribeParticipantsOnConnect=true" \\-u $API_Key_Sid:$API_Key_Secret
Response:
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"audio_only": false,4"date_created": "2025-06-12T15:42:32Z",5"date_updated": "2025-06-12T15:42:32Z",6"duration": null,7"empty_room_timeout": 5,8"enable_turn": true,9"end_time": null,10"large_room": false,11"links": {12"participants": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Participants",13"recording_rules": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/RecordingRules",14"recordings": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Recordings",15"transcriptions": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions"16},17"max_concurrent_published_tracks": 170,18"max_participant_duration": 14400,19"max_participants": 50,20"media_region": "us1",21"record_participants_on_connect": false,22"sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",23"status": "in-progress",24"status_callback": null,25"status_callback_method": "POST",26"type": "group",27"unique_name": "test",28"unused_room_timeout": 5,29"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",30"video_codecs": [31"VP8",32"H264"33]34}
Create a Transcription with a POST
request to the resource URI:
/v1/Rooms/{RoomNameOrSid}/Transcriptions
Path parameters
Parameter | Type | Description |
---|---|---|
RoomSid | SID<RM> | The ID of the parent room where the Transcription resource is created. |
Request body parameters
Parameter | Type | Description |
---|---|---|
Configuration | object<string-map> | Object with key-value configurations. See property configuration of the Transcription resource above for a description of supported keys. |
Example:
1curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions" \2--data-urlencode 'Configuration={"languageCode": "EN-us", "partialResults": true, "profanityFilter": true, "speechModel": "long"}' \3-u $API_Key_Sid:$API_Key_Secret
Response:
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"configuration": {4"languageCode": "EN-us",5"partialResults": "true",6"profanityFilter": "true",7"speechModel": "long"8},9"date_created": "2025-07-22T14:14:35Z",10"date_updated": null,11"duration": null,12"end_time": null,13"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",14"start_time": null,15"status": "started",16"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",17"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"18}
Update a Transcription resource with a POST
request to the resource instance URI:
/v1/Rooms/{RoomSid}/Transcriptions/{ttid}
To stop transcriptions on a Room, set the Status
to stopped
.
Path parameters
Parameter | Type | Description |
---|---|---|
ttid | ttid | The TTID of the Transcription resource being updated. Current implementation supports a single transcription resource, but this might change in future implementations. |
RoomSid | SID<RM> | The ID of the parent room where the Transcription resource is updated. |
Request body parameters
Parameter | Type | Description |
---|---|---|
Status | enum<string> | New status of the Transcription resource. Can be: started , stopped . There is no state transition if the resource property status already has the same value or if the parameter is missing. |
Example:
1curl -X POST "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \2--data-urlencode "Status=stopped" \3-u ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Response:
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"configuration": {4"languageCode": "EN-us",5"partialResults": "true"6},7"date_created": "2025-07-22T12:55:30Z",8"date_updated": "2025-07-22T12:56:02Z",9"duration": null,10"end_time": null,11"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",12"start_time": null,13"status": "stopped",14"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",15"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"16}
To restart transcriptions on a Room that has a Transcription resource with a stopped state, update it with a POST
request to the resource instance URI:
/v1/Rooms/{RoomSid}/Transcriptions/{ttid}
To restart transcription, set the Status
to started
.
Path parameters
Parameter | Type | Description |
---|---|---|
ttid | ttid | The TTID of the Transcription resource being updated. Current implementation supports a single transcription resource, but this might change in future implementations. |
RoomSid | SID<RM> | The ID of the parent room where the Transcription resource is updated. |
Request body parameters
Parameter | Type | Description |
---|---|---|
Status | enum<string> | New status of the Transcription resource. Use started to restart transcriptions. There is no state transition if the resource property status already has the same value or if the parameter is missing. |
Example:
1"https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" \2--data-urlencode "Status=started" \3-u ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Response:
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"configuration": {4"languageCode": "EN-us",5"partialResults": "true"6},7"date_created": "2025-07-22T12:57:24Z",8"date_updated": null,9"duration": null,10"end_time": null,11"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",12"start_time": null,13"status": "started",14"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",15"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"16}
Retrieve the Transcription resource in a Room with a GET
request to the resource URI:
/v1/Rooms/{RoomSid}/Transcriptions
Real-Time Transcriptions supports only a single instance of the Transcription resource per Room, so the list will always have a single item.
Path parameters
Parameter | Type | Description |
---|---|---|
RoomSid | SID<RM> | The ID of the parent room from where Transcription resources are retrieved. |
Example:
1curl -X GET "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions" \2-u $API_Key_Sid:$API_Key_Secret
Response:
1{2"meta": {3"first_page_url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions?PageSize=50&Page=0",4"key": "transcriptions",5"next_page_url": null,6"page": 0,7"page_size": 50,8"previous_page_url": null,9"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions?PageSize=50&Page=0"10},11"transcriptions": [12{13"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",14"configuration": {},15"date_created": "2025-07-22T11:05:41Z",16"date_updated": null,17"duration": null,18"end_time": null,19"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",20"start_time": null,21"status": "started",22"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",23"url": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"24}25]26}
Retrieve a specific Transcription resource with a GET
request to the instance resource URI:
/v1/Rooms/{RoomSid}/Transcriptions/{ttid}
Path parameters
Parameter | Type | Description |
---|---|---|
ttid | ttid | The TTID of the Transcription resource being requested. |
RoomSid | SID<RM> | The ID of the parent room where the Transcription resource is retrieved. |
Example:
curl -X https://video.twilio.com/v1/Rooms/$sid/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX -u $API_Key_Sid:$API_Key_Secret
Response:
1{2"account_sid": "ACXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",3"configuration": {4"LanguageCode": "EN-us",5"ProfanityFilter": "true"6},7"date_created": null,8"date_updated": null,9"duration": null,10"end_time": null,11"links": {12"transcriptions": "https://video.twilio.com/v1/Rooms/RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/Transcriptions/video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"13},14"room_sid": "RMXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",15"start_time": null,16"status": "created",17"ttid": "video_extension_XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"18}
Twilio delivers transcribed text to the client SDKs through callback events.
The schema of the JSON delivery format contains a version number. Each event contains the transcription of a single utterance and details of the participant who generated the audio.
1properties:2type:3const: extension_transcriptions45version:6description: |7Version of the transcriptions protocol used by this message. It is semver compliant89track:10$ref: /Server/State/RemoteTrack11description: |12Audio track from where the transcription has been generated.1314participant:15$ref: /Server/State/Participant16description: |17The participant who published the audio track from where the18transcription has been generated.1920sequence_number:21type: integer22description: |23Sequence number. Starts with one and increments monotonically. A sequence24counter is defined for each track to allow the receiver to identify25missing messages.2627timestamp:28type: string29description: |30Absolute time from the real-time-transcription. It is31conformant with UTC ISO 8601.3233partial_results:34type: boolean35description: |36Whether the transcription is a final or a partial result.3738language_code:39type: string40description: |41Language code of the transcribed text. It is conformant with BCP-47.4243transcription:44type: string45description: |46Utterance transcription
Example:
1{2"version": "1.0",3"language_code": "en-US",4"partial_results": false,5"participant": "PA00000000000000000000000000000000",6"sequence_number": 3,7"timestamp": "2025-01-01T12:00:00.000000000Z",8"track": "MT00000000000000000000000000000000",9"transcription": "This is a test",10"type": "extension_transcriptions"11}
To enable the flow of transcript events, set the receiveTranscriptions
parameter in connectOptions
to true
. The default value is false
. Once set to true
and on condition that Real-Time Transcriptions is enabled at the Room level, callback events containing the transcribed text will start to flow.
Example:
1import { connect } from 'twilio-video';23const room = await connect(token, {4name: 'my-room',5receiveTranscriptions: true6});78room.on('transcription', (transcriptionEvent) => {9console.log(`${transcriptionEvent.participant}: ${transcriptionEvent.transcription}`);10});
To enable the flow of transcript events, set the receiveTranscriptions
parameter in TVIConnectOptions
to true
. The default value is false
. The value can be retrived using the isReceiveTranscriptionsEnabled
getter. Once set to true
and on condition that Real-Time Transcriptions is enabled at the Room level, callback events containing the transcribed text will start to flow via the transcriptionReceived(room:transcription:)
method in the RoomDelegate protocol.
Example:
1let options = ConnectOptions(token: accessToken, block: { (builder) in2builder.roomName = "test"3builder.isReceiveTranscriptionsEnabled = true4})
To receive transcription events, set the receiveTranscriptions
parameter in ConnectOptions
to true
. The default value is false
. To check the current setting, call isReceiveTranscriptionsEnabled()
.
Once set to true
and Real-Time Transcriptions is enabled for the Room, callback events containing the transcribed text are delivered through the onTranscription(@NonNull Room room, @NonNull JSONObject json)
method of the Room.Listener
interface.
Example:
1ConnectOptions connectOptions = new ConnectOptions.Builder(accessToken)2.receiveTranscriptions(true)3.build();45Video.connect(context, connectOptions, roomListener);
To enable and configure Real-time Transcriptions in the Twilio Console, navigate to the Video Room settings page.

AI Nutrition Facts
Real-Time Transcriptions for Video uses third-party artificial technology and machine learning technologies.
Twilio's AI Nutrition Facts provide an overview of the AI feature you're using, so you can better understand how the AI is working with your data. The following Speech to Text Transcriptions - Nutrition Facts label outlines the AI qualities of Real-Time Transcriptions for Video. For more information, see Twilio's AI Nutrition Facts page.
AI Nutrition Facts
Speech to Text Transcriptions - Programmable Voice, Twilio Video, and Conversational Intelligence
- Description
- Generate speech to text voice transcriptions (real-time and post-call) in Programmable Voice, Twilio Video, and Conversational Intelligence.
- Privacy Ladder Level
- N/A
- Feature is Optional
- Yes
- Model Type
- Generative and Predictive - Automatic Speech Recognition
- Base Model
- Deepgram Speech-to-Text, Google Speech-to-Text, Amazon Transcribe
- Base Model Trained with Customer Data
- No
- Customer Data is Shared with Model Vendor
- No
- Training Data Anonymized
- N/A
- Data Deletion
- Yes
- Human in the Loop
- Yes
- Data Retention
- Until the customer deletes
- Logging & Auditing
- Yes
- Guardrails
- Yes
- Input/Output Consistency
- Yes
- Other Resources
- https://www.twilio.com/docs/conversational-intelligence
Trust Ingredients
Conversational Intelligence, Programmable Voice, and Twilio Video only use the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.
Conversational Intelligence, Programmable Voice, and Twilio Video only use the default Base Model provided by the Model Vendor. The Base Model is not trained using customer data.
Base Model is not trained using any customer data.
Transcriptions are deleted by the customer using the Conversational Intelligence API or when a customer account is deprovisioned.
The customer views output in the Conversational Intelligence API or Transcript Viewer.
Compliance
The customer can listen to the input (recording) and view the output (transcript).
The customer can listen to the input (recording) and view the output (transcript).
The customer is responsible for human review.
Learn more about this label at nutrition-facts.ai