Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This guide helps you better understand the different ways you can use Azure Communication Services offering of real-time transcription through Call Automation SDKs.
Prerequisites
- Azure account with an active subscription, for details see Create an account for free.
- Azure Communication Services resource, see Create an Azure Communication Services resource
- Create and connect Azure AI services to your Azure Communication Services resource.
- Create a custom subdomain for your Azure AI services resource.
- Create a new web service application using the Call Automation SDK.
Set up a WebSocket Server
Azure Communication Services requires your server application to set up a WebSocket server to stream transcription in real-time. WebSocket is a standardized protocol that provides a full-duplex communication channel over a single TCP connection. You can optionally use Azure services Azure WebApps that allows you to create an application to receive transcripts over a websocket connection. Follow this quickstart.
Establish a call
In this quickstart, we assume that you're already familiar with starting calls. If you need to learn more about starting and establishing calls, you can follow our quickstart. For the purposes of this quickstart, we're going through the process of starting transcription for both incoming calls and outbound calls.
When working with real-time transcription, you have a few of options on when and how to start transcription:
Option 1 - Starting at time of answering or creating a call
Option 2 - Starting transcription during an ongoing call
Option 3 - Starting transcription when connecting to an Azure Communication Services Rooms call
In this tutorial, we're demonstrating option 2 and 3, starting transcription during an ongoing call or when connecting to a Rooms call. By default the 'startTranscription' is set to false at time of answering or creating a call.
Create a call and provide the transcription details
Define the TranscriptionOptions for ACS to specify when to start the transcription, specify the locale for transcription, and the web socket connection for sending the transcript.
var createCallOptions = new CreateCallOptions(callInvite, callbackUri)
{
CallIntelligenceOptions = new CallIntelligenceOptions() { CognitiveServicesEndpoint = new Uri(cognitiveServiceEndpoint) },
TranscriptionOptions = new TranscriptionOptions(new Uri(""), "en-US", false, TranscriptionTransport.Websocket)
};
CreateCallResult createCallResult = await callAutomationClient.CreateCallAsync(createCallOptions);
Sentiment Analysis (Preview)
Track the emotional tone of conversations in real time to support customer and agent interactions, and enable supervisors to intervene when necessary. Available in public preview through createCall
, answerCall
and startTranscription
.
Create a call with Sentiment Analysis enabled
// Define transcription options with sentiment analysis enabled
var transcriptionOptions = new TranscriptionOptions
{
IsSentimentAnalysisEnabled = true
};
var callIntelligenceOptions = new CallIntelligenceOptions
{
CognitiveServicesEndpoint = new Uri(cognitiveServiceEndpoint)
};
var createCallOptions = new CreateCallOptions(callInvite, new Uri("https://test"))
{
CallIntelligenceOptions = callIntelligenceOptions,
TranscriptionOptions = transcriptionOptions
};
CreateCallResult createCallResult = await callAutomationClient.CreateCallAsync(createCallOptions);
Answer a call with Sentiment Analysis enabled
// Define transcription options with sentiment analysis enabled
var transcriptionOptions = new TranscriptionOptions
{
IsSentimentAnalysisEnabled = true
};
var answerCallOptions = new AnswerCallOptions(incomingCallContext, callbackUri)
{
TranscriptionOptions = transcriptionOptions
};
var answerCallResult = await client.AnswerCallAsync(answerCallOptions);
PII Redaction (Preview)
Automatically identify and mask sensitive information—such as names, addresses, or identification numbers—to ensure privacy and regulatory compliance. Available in createCall
, answerCall
and startTranscription
.
Answer a call with PII Redaction enabled
var transcriptionOptions = new TranscriptionOptions
{
PiiRedactionOptions = new PiiRedactionOptions
{
IsEnabled = true,
RedactionType = RedactionType.MaskWithCharacter
},
};
var options = new AnswerCallOptions(incomingCallContext, callbackUri)
{
TranscriptionOptions = transcriptionOptions,
};
//Answer call request
var answerCallResult = await client.AnswerCallAsync(options);
Note
With PII redaction enabled you’ll only receive the redacted text.
Real-time language detection (Preview)
Automatically detect spoken languages to enable natural, human-like communication and eliminate manual language selection. Available in createCall
, answerCall
and startTranscription
.
Create a call with Real-time language detection enabled
var transcriptionOptions = new TranscriptionOptions
{
Locales = new List<string> { "en-US", "fr-FR", "hi-IN" }
};
var createCallOptions = new CreateCallOptions(callInviteOption, new Uri("https://test"))
{
TranscriptionOptions = transcriptionOptions
};
//CreateCall request
var createCallRequest = await client.CreateCallAsync(createCallOptions);
Note
To stop language identification after it has started, use the updateTranscription
API and explicitly set the language you want to use for the transcript. This disables automatic language detection and locks transcription to the specified language.
Connect to a Rooms call and provide transcription details
If you're connecting to an ACS room and want to use transcription, configure the transcription options as follows:
var transcriptionOptions = new TranscriptionOptions(
transportUri: new Uri(""),
locale: "en-US",
startTranscription: false,
transcriptionTransport: TranscriptionTransport.Websocket,
//Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
SpeechRecognitionModelEndpointId = "YourCustomSpeechRecognitionModelEndpointId"
);
var connectCallOptions = new ConnectCallOptions(new RoomCallLocator("roomId"), callbackUri)
{
CallIntelligenceOptions = new CallIntelligenceOptions()
{
CognitiveServicesEndpoint = new Uri(cognitiveServiceEndpoint)
},
TranscriptionOptions = transcriptionOptions
};
var connectResult = await client.ConnectCallAsync(connectCallOptions);
Start Transcription
Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.
// Start transcription with options
var transcriptionOptions = new StartTranscriptionOptions
{
OperationContext = "startMediaStreamingContext",
IsSentimentAnalysisEnabled = true,
// Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
SpeechRecognitionModelEndpointId = "YourCustomSpeechRecognitionModelEndpointId"
};
// Start transcription
await callMedia.StartTranscriptionAsync(transcriptionOptions);
// Alternative: Start transcription without options
// await callMedia.StartTranscriptionAsync();
Get mid call summaries (Preview)
Enhance your call workflows with real-time summarization. By enabling summarization in your transcription options, ACS can automatically generate concise mid-call recaps—including decisions, action items, and key discussion points—without waiting for the call to end. This helps teams stay aligned and enables faster follow-ups during live conversations.
// Define transcription options with call summarization enabled
var transcriptionOptions = new TranscriptionOptions
{
SummarizationOptions = new SummarizationOptions
{
Locale = "en-US"
}
};
// Answer call with transcription options
var answerCallOptions = new AnswerCallOptions(incomingCallContext, callbackUri)
{
TranscriptionOptions = transcriptionOptions
};
var answerCallResult = await client.AnswerCallAsync(answerCallOptions);
Additional Headers:
The Correlation ID and Call Connection ID are now included in the WebSocket headers for improved traceability x-ms-call-correlation-id
and x-ms-call-connection-id
.
Receiving Transcription Stream
When transcription starts, your websocket receives the transcription metadata payload as the first packet.
{
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
"locale": "en-us",
"callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
"correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
}
}
Receiving Transcription data
After the metadata, the next packets your web socket receives will be TranscriptionData for the transcribed audio.
{
"kind": "TranscriptionData",
"transcriptionData": {
"text": "Testing transcription.",
"format": "display",
"confidence": 0.695223331451416,
"offset": 2516998782481234400,
"words": [
{
"text": "testing",
"offset": 2516998782481234400
},
{
"text": "testing",
"offset": 2516998782481234400
}
],
"participantRawID": "8:acs:",
"resultStatus": "Final"
}
}
Receiving Transcription Stream with AI capabilities enabled (Preview)
When transcription is enabled during a call, Azure Communication Services emits metadata that describes the configuration and context of the transcription session. This includes details such as the locale, call connection ID, sentiment analysis settings, and PII redaction preferences. Developers can use this payload to verify transcription setup, audit configurations, or troubleshoot issues related to real-time transcription features enhanced by AI.
{
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "863b5e55-de0d-4fc3-8e58-2d68e976b5ad",
"locale": "en-US",
"callConnectionId": "02009180-9dc2-429b-a3eb-d544b7b6a0e1",
"correlationId": "62c8215b-5276-4d3c-bb6d-06a1b114651b",
"speechModelEndpointId": null,
"locales": [],
"enableSentimentAnalysis": true,
"piiRedactionOptions": {
"enable": true,
"redactionType": "MaskWithCharacter"
}
}
}
Receiving Transcription data with AI capabilities enabled (Preview)
After the initial metadata packet, your WebSocket connection will begin receiving TranscriptionData
events for each segment of transcribed audio. These packets include the transcribed text, confidence score, timing information, and—if enabled—sentiment analysis and PII redaction. This data can be used to build real-time dashboards, trigger workflows, or analyze conversation dynamics during the call.
{
"kind": "TranscriptionData",
"transcriptionData": {
"text": "My date of birth is *********.",
"format": "display",
"confidence": 0.8726407289505005,
"offset": 309058340,
"duration": 31600000,
"words": [],
"participantRawID": "4:+917020276722",
"resultStatus": "Final",
"sentimentAnalysisResult": {
"sentiment": "neutral"
}
}
}
Handling transcription stream in the web socket server
using WebServerApi;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddEndpointsApiExplorer();
builder.Services.AddSwaggerGen();
var app = builder.Build();
app.UseWebSockets();
app.Map("/ws", async context =>
{
if (context.WebSockets.IsWebSocketRequest)
{
using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
await HandleWebSocket.Echo(webSocket);
}
else
{
context.Response.StatusCode = StatusCodes.Status400BadRequest;
}
});
app.Run();
Updates to your code for the websocket handler
using Azure.Communication.CallAutomation;
using System.Net.WebSockets;
using System.Text;
namespace WebServerApi
{
public class HandleWebSocket
{
public static async Task Echo(WebSocket webSocket)
{
var buffer = new byte[1024 * 4];
var receiveResult = await webSocket.ReceiveAsync(
new ArraySegment(buffer), CancellationToken.None);
while (!receiveResult.CloseStatus.HasValue)
{
string msg = Encoding.UTF8.GetString(buffer, 0, receiveResult.Count);
var response = StreamingDataParser.Parse(msg);
if (response != null)
{
if (response is AudioMetadata audioMetadata)
{
Console.WriteLine("***************************************************************************************");
Console.WriteLine("MEDIA SUBSCRIPTION ID-->"+audioMetadata.MediaSubscriptionId);
Console.WriteLine("ENCODING-->"+audioMetadata.Encoding);
Console.WriteLine("SAMPLE RATE-->"+audioMetadata.SampleRate);
Console.WriteLine("CHANNELS-->"+audioMetadata.Channels);
Console.WriteLine("LENGTH-->"+audioMetadata.Length);
Console.WriteLine("***************************************************************************************");
}
if (response is AudioData audioData)
{
Console.WriteLine("***************************************************************************************");
Console.WriteLine("DATA-->"+audioData.Data);
Console.WriteLine("TIMESTAMP-->"+audioData.Timestamp);
Console.WriteLine("IS SILENT-->"+audioData.IsSilent);
Console.WriteLine("***************************************************************************************");
}
if (response is TranscriptionMetadata transcriptionMetadata)
{
Console.WriteLine("***************************************************************************************");
Console.WriteLine("TRANSCRIPTION SUBSCRIPTION ID-->"+transcriptionMetadata.TranscriptionSubscriptionId);
Console.WriteLine("LOCALE-->"+transcriptionMetadata.Locale);
Console.WriteLine("CALL CONNECTION ID--?"+transcriptionMetadata.CallConnectionId);
Console.WriteLine("CORRELATION ID-->"+transcriptionMetadata.CorrelationId);
Console.WriteLine("LOCALES-->" + transcriptionMetadata.Locales);
Console.WriteLine("PII REDACTION OPTIONS ISENABLED-->" + transcriptionMetadata.PiiRedactionOptions?.IsEnabled);
Console.WriteLine("PII REDACTION OPTIONS - REDACTION TYPE-->" + transcriptionMetadata.PiiRedactionOptions?.RedactionType);
Console.WriteLine("***************************************************************************************");
}
if (response is TranscriptionData transcriptionData)
{
Console.WriteLine("***************************************************************************************");
Console.WriteLine("TEXT-->"+transcriptionData.Text);
Console.WriteLine("FORMAT-->"+transcriptionData.Format);
Console.WriteLine("OFFSET-->"+transcriptionData.Offset);
Console.WriteLine("DURATION-->"+transcriptionData.Duration);
Console.WriteLine("PARTICIPANT-->"+transcriptionData.Participant.RawId);
Console.WriteLine("CONFIDENCE-->"+transcriptionData.Confidence);
Console.WriteLine("SENTIMENT ANALYSIS RESULT-->" + transcriptionData.SentimentAnalysisResult?.Sentiment);
foreach (var word in transcriptionData.Words)
{
Console.WriteLine("TEXT-->"+word.Text);
Console.WriteLine("OFFSET-->"+word.Offset);
Console.WriteLine("DURATION-->"+word.Duration);
}
Console.WriteLine("***************************************************************************************");
}
}
await webSocket.SendAsync(
new ArraySegment(buffer, 0, receiveResult.Count),
receiveResult.MessageType,
receiveResult.EndOfMessage,
CancellationToken.None);
receiveResult = await webSocket.ReceiveAsync(
new ArraySegment(buffer), CancellationToken.None);
}
await webSocket.CloseAsync(
receiveResult.CloseStatus.Value,
receiveResult.CloseStatusDescription,
CancellationToken.None);
}
}
}
Update Transcription
For situations where your application allows users to select their preferred language you may also want to capture the transcription in that language. To do this, Call Automation SDK allows you to update the transcription locale.
UpdateTranscriptionOptions updateTranscriptionOptions = new UpdateTranscriptionOptions(locale)
{
OperationContext = "UpdateTranscriptionContext",
//Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
SpeechRecognitionModelEndpointId = "YourCustomSpeechRecognitionModelEndpointId"
};
await client.GetCallConnection(callConnectionId).GetCallMedia().UpdateTranscriptionAsync(updateTranscriptionOptions);
Stop Transcription
When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.
StopTranscriptionOptions stopOptions = new StopTranscriptionOptions()
{
OperationContext = "stopTranscription"
};
await callMedia.StopTranscriptionAsync(stopOptions);
Create a call and provide the transcription details
Define the TranscriptionOptions for ACS to specify when to start the transcription, the locale for transcription, and the web socket connection for sending the transcript.
CallInvite callInvite = new CallInvite(target, caller);
CallIntelligenceOptions callIntelligenceOptions = new CallIntelligenceOptions()
.setCognitiveServicesEndpoint(appConfig.getCognitiveServiceEndpoint());
TranscriptionOptions transcriptionOptions = new TranscriptionOptions(
appConfig.getWebSocketUrl(),
TranscriptionTransport.WEBSOCKET,
"en-US",
false,
"your-endpoint-id-here" // speechRecognitionEndpointId
);
CreateCallOptions createCallOptions = new CreateCallOptions(callInvite, appConfig.getCallBackUri());
createCallOptions.setCallIntelligenceOptions(callIntelligenceOptions);
createCallOptions.setTranscriptionOptions(transcriptionOptions);
Response result = client.createCallWithResponse(createCallOptions, Context.NONE);
return result.getValue().getCallConnectionProperties().getCallConnectionId();
Sentiment Analysis (Preview)
Track the emotional tone of conversations in real time to support customer and agent interactions, and enable supervisors to intervene when necessary. Available in public preview through createCall
, answerCall
and startTranscription
.
Create a call with Sentiment Analysis enabled
CallInvite callInvite = new CallInvite(target, caller);
CallIntelligenceOptions callIntelligenceOptions = new CallIntelligenceOptions()
.setCognitiveServicesEndpoint(cognitiveServicesEndpoint);
TranscriptionOptions transcriptionOptions = new TranscriptionOptions("en-ES")
.setTransportUrl(websocketUriHost)
.setEnableSentimentAnalysis(true) // Enable sentiment analysis
.setLocales(locales);
CreateCallOptions createCallOptions = new CreateCallOptions(callInvite, callbackUri.toString())
.setCallIntelligenceOptions(callIntelligenceOptions)
.setTranscriptionOptions(transcriptionOptions);
// Create call request
Response<CreateCallResult> result = client.createCallWithResponse(createCallOptions, Context.NONE);
Answer a call with Sentiment Analysis enabled
TranscriptionOptions transcriptionOptions = new TranscriptionOptions("en-ES")
.setTransportUrl(websocketUriHost)
.setEnableSentimentAnalysis(true) // Enable sentiment analysis
.setLocales(locales);
AnswerCallOptions answerCallOptions = new AnswerCallOptions(data.getString("incomingCallContext"), callbackUri)
.setCallIntelligenceOptions(callIntelligenceOptions)
.setTranscriptionOptions(transcriptionOptions);
// Answer call request
Response<AnswerCallResult> answerCallResponse = client.answerCallWithResponse(answerCallOptions, Context.NONE);
PII Redaction (Preview)
Automatically identify and mask sensitive information—such as names, addresses, or identification numbers—to ensure privacy and regulatory compliance. Available in createCall
, answerCall
and startTranscription
.
Answer a call with PII Redaction enabled
PiiRedactionOptions piiRedactionOptions = new PiiRedactionOptions()
.setEnabled(true)
.setRedactionType(RedactionType.MASK_WITH_CHARACTER);
TranscriptionOptions transcriptionOptions = new TranscriptionOptions("en-ES")
.setTransportUrl(websocketUriHost)
.setPiiRedactionOptions(piiRedactionOptions)
.setLocales(locales);
AnswerCallOptions answerCallOptions = new AnswerCallOptions(data.getString("incomingCallContext"), callbackUri)
.setCallIntelligenceOptions(callIntelligenceOptions)
.setTranscriptionOptions(transcriptionOptions);
// Answer call request
Response<AnswerCallResult> answerCallResponse = client.answerCallWithResponse(answerCallOptions, Context.NONE);
Note
With PII redaction enabled you’ll only receive the redacted text.
Real-time language detection (Preview)
Automatically detect spoken languages to enable natural, human-like communication and eliminate manual language selection. Available in createCall
, answerCall
and startTranscription
.
Create a call with Real-time language detection enabled
var transcriptionOptions = new TranscriptionOptions
{
Locales = new List<string> { "en-US", "fr-FR", "hi-IN" },
};
var createCallOptions = new CreateCallOptions(callInviteOption, new Uri("https://test"))
{
TranscriptionOptions = transcriptionOptions
};
var createCallResult = await client.CreateCallAsync(createCallOptions);
Note
To stop language identification after it has started, use the updateTranscription
API and explicitly set the language you want to use for the transcript. This disables automatic language detection and locks transcription to the specified language.
Connect to a Rooms call and provide transcription details
If you're connecting to an ACS room and want to use transcription, configure the transcription options as follows:
TranscriptionOptions transcriptionOptions = new TranscriptionOptions(
appConfig.getWebSocketUrl(),
TranscriptionTransport.WEBSOCKET,
"en-US",
false,
"your-endpoint-id-here" // speechRecognitionEndpointId
);
ConnectCallOptions connectCallOptions = new ConnectCallOptions(new RoomCallLocator("roomId"), appConfig.getCallBackUri())
.setCallIntelligenceOptions(
new CallIntelligenceOptions()
.setCognitiveServicesEndpoint(appConfig.getCognitiveServiceEndpoint())
)
.setTranscriptionOptions(transcriptionOptions);
ConnectCallResult connectCallResult = Objects.requireNonNull(client
.connectCallWithResponse(connectCallOptions)
.block())
.getValue();
Start Transcription
Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.
//Option 1: Start transcription with options
StartTranscriptionOptions transcriptionOptions = new StartTranscriptionOptions()
.setOperationContext("startMediaStreamingContext");
client.getCallConnection(callConnectionId)
.getCallMedia()
.startTranscriptionWithResponse(transcriptionOptions, Context.NONE);
// Alternative: Start transcription without options
// client.getCallConnection(callConnectionId)
// .getCallMedia()
// .startTranscription();
Get mid call summaries (Preview)
Enhance your call workflows with real-time summarization. By enabling summarization in your transcription options, ACS can automatically generate concise mid-call recaps—including decisions, action items, and key discussion points—without waiting for the call to end. This helps teams stay aligned and enables faster follow-ups during live conversations.
SummarizationOptions summarizationOptions = new SummarizationOptions()
.setEnableEndCallSummary(true)
.setLocale("en-US");
TranscriptionOptions transcriptionOptions = new TranscriptionOptions("en-ES")
.setTransportUrl(websocketUriHost)
.setSummarizationOptions(summarizationOptions)
.setLocales(locales);
AnswerCallOptions answerCallOptions = new AnswerCallOptions(data.getString("incomingCallContext"), callbackUri)
.setCallIntelligenceOptions(callIntelligenceOptions)
.setTranscriptionOptions(transcriptionOptions);
// Answer call request
Response<AnswerCallResult> answerCallResponse = client.answerCallWithResponse(answerCallOptions, Context.NONE);
Additional Headers:
The Correlation ID and Call Connection ID are now included in the WebSocket headers for improved traceability x-ms-call-correlation-id
and x-ms-call-connection-id
.
Receiving Transcription Stream
When transcription starts, your websocket receives the transcription metadata payload as the first packet.
{
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
"locale": "en-us",
"callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
"correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
}
}
Receiving Transcription data
After the metadata, the next packets your web socket receives will be TranscriptionData for the transcribed audio.
{
"kind": "TranscriptionData",
"transcriptionData": {
"text": "Testing transcription.",
"format": "display",
"confidence": 0.695223331451416,
"offset": 2516998782481234400,
"words": [
{
"text": "testing",
"offset": 2516998782481234400
},
{
"text": "testing",
"offset": 2516998782481234400
}
],
"participantRawID": "8:acs:",
"resultStatus": "Final"
}
}
Receiving Transcription Stream with AI capabilities enabled (Preview)
When transcription is enabled during a call, Azure Communication Services emits metadata that describes the configuration and context of the transcription session. This includes details such as the locale, call connection ID, sentiment analysis settings, and PII redaction preferences. Developers can use this payload to verify transcription setup, audit configurations, or troubleshoot issues related to real-time transcription features enhanced by AI.
{
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "863b5e55-de0d-4fc3-8e58-2d68e976b5ad",
"locale": "en-US",
"callConnectionId": "02009180-9dc2-429b-a3eb-d544b7b6a0e1",
"correlationId": "62c8215b-5276-4d3c-bb6d-06a1b114651b",
"speechModelEndpointId": null,
"locales": [],
"enableSentimentAnalysis": true,
"piiRedactionOptions": {
"enable": true,
"redactionType": "MaskWithCharacter"
}
}
}
Receiving Transcription data with AI capabilities enabled (Preview)
After the initial metadata packet, your WebSocket connection will begin receiving TranscriptionData
events for each segment of transcribed audio. These packets include the transcribed text, confidence score, timing information, and—if enabled—sentiment analysis and PII redaction. This data can be used to build real-time dashboards, trigger workflows, or analyze conversation dynamics during the call.
{
"kind": "TranscriptionData",
"transcriptionData": {
"text": "My date of birth is *********.",
"format": "display",
"confidence": 0.8726407289505005,
"offset": 309058340,
"duration": 31600000,
"words": [],
"participantRawID": "4:+917020276722",
"resultStatus": "Final",
"sentimentAnalysisResult": {
"sentiment": "neutral"
}
}
}
Handling transcription stream in the web socket server
package com.example;
import org.glassfish.tyrus.server.Server;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class App {
public static void main(String[] args) {
Server server = new Server("localhost", 8081, "/ws", null, WebSocketServer.class);
try {
server.start();
System.out.println("Web socket running on port 8081...");
System.out.println("wss://localhost:8081/ws/server");
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
reader.readLine();
} catch (Exception e) {
e.printStackTrace();
} finally {
server.stop();
}
}
}
Updates to your code for the websocket handler
package com.example;
import javax.websocket.OnMessage;
import javax.websocket.Session;
import javax.websocket.server.ServerEndpoint;
import com.azure.communication.callautomation.models.streaming.StreamingData;
import com.azure.communication.callautomation.models.streaming.StreamingDataParser;
import com.azure.communication.callautomation.models.streaming.media.AudioData;
import com.azure.communication.callautomation.models.streaming.media.AudioMetadata;
import com.azure.communication.callautomation.models.streaming.transcription.TranscriptionData;
import com.azure.communication.callautomation.models.streaming.transcription.TranscriptionMetadata;
import com.azure.communication.callautomation.models.streaming.transcription.Word;
@ServerEndpoint("/server")
public class WebSocketServer {
@OnMessage
public void onMessage(String message, Session session) {
StreamingData data = StreamingDataParser.parse(message);
if (data instanceof AudioMetadata) {
AudioMetadata audioMetaData = (AudioMetadata) data;
System.out.println("----------------------------------------------------------------");
System.out.println("SUBSCRIPTION ID: --> " + audioMetaData.getMediaSubscriptionId());
System.out.println("ENCODING: --> " + audioMetaData.getEncoding());
System.out.println("SAMPLE RATE: --> " + audioMetaData.getSampleRate());
System.out.println("CHANNELS: --> " + audioMetaData.getChannels());
System.out.println("LENGTH: --> " + audioMetaData.getLength());
System.out.println("----------------------------------------------------------------");
}
if (data instanceof AudioData) {
AudioData audioData = (AudioData) data;
System.out.println("----------------------------------------------------------------");
System.out.println("DATA: --> " + audioData.getData());
System.out.println("TIMESTAMP: --> " + audioData.getTimestamp());
System.out.println("IS SILENT: --> " + audioData.isSilent());
System.out.println("----------------------------------------------------------------");
}
if (data instanceof TranscriptionMetadata) {
TranscriptionMetadata transcriptionMetadata = (TranscriptionMetadata) data;
System.out.println("----------------------------------------------------------------");
System.out.println("TRANSCRIPTION SUBSCRIPTION ID: --> " + transcriptionMetadata.getTranscriptionSubscriptionId());
System.out.println("LOCALE: --> " + transcriptionMetadata.getLocale());
System.out.println("CALL CONNECTION ID: --> " + transcriptionMetadata.getCallConnectionId());
System.out.println("CORRELATION ID: --> " + transcriptionMetadata.getCorrelationId());
// Check for PII Redaction Options locale
if (transcriptionMetadata.getPiiRedactionOptions() != null &&
transcriptionMetadata.getPiiRedactionOptions().getLocale() != null) {
System.out.println("PII Redaction Locale: --> " + transcriptionMetadata.getPiiRedactionOptions().getLocale());
}
// Check for detected locales
if (transcriptionMetadata.getLocales() != null) {
System.out.println("Detected Locales: --> " + transcriptionMetadata.getLocales());
}
System.out.println("----------------------------------------------------------------");
}
if (data instanceof TranscriptionData) {
TranscriptionData transcriptionData = (TranscriptionData) data;
System.out.println("----------------------------------------------------------------");
System.out.println("TEXT: --> " + transcriptionData.getText());
System.out.println("FORMAT: --> " + transcriptionData.getFormat());
System.out.println("CONFIDENCE: --> " + transcriptionData.getConfidence());
System.out.println("OFFSET: --> " + transcriptionData.getOffset());
System.out.println("DURATION: --> " + transcriptionData.getDuration());
System.out.println("RESULT STATUS: --> " + transcriptionData.getResultStatus());
for (Word word : transcriptionData.getWords()) {
System.out.println("Text: --> " + word.getText());
System.out.println("Offset: --> " + word.getOffset());
System.out.println("Duration: --> " + word.getDuration());
}
System.out.println("SENTIMENT:-->" + transcriptionData.getSentimentAnalysisResult().getSentiment());
System.out.println("LANGUAGE IDENTIFIED:-->" + transcriptionData.getLanguageIdentified());
System.out.println("----------------------------------------------------------------");
}
}
}
Update Transcription
For situations where your application allows users to select their preferred language you may also want to capture the transcription in that language. To do this, Call Automation SDK allows you to update the transcription locale.
UpdateTranscriptionOptions transcriptionOptions = new UpdateTranscriptionOptions()
.setLocale(newLocale)
.setOperationContext("transcriptionContext")
.setSpeechRecognitionEndpointId("your-endpoint-id-here");
client.getCallConnection(callConnectionId)
.getCallMedia()
.updateTranscriptionWithResponse(transcriptionOptions, Context.NONE);
Stop Transcription
When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.
// Option 1: Stop transcription with options
StopTranscriptionOptions stopTranscriptionOptions = new StopTranscriptionOptions()
.setOperationContext("stopTranscription");
client.getCallConnection(callConnectionId)
.getCallMedia()
.stopTranscriptionWithResponse(stopTranscriptionOptions, Context.NONE);
// Alternative: Stop transcription without options
// client.getCallConnection(callConnectionId)
// .getCallMedia()
// .stopTranscription();
Create a call and provide the transcription details
Define the TranscriptionOptions for ACS to specify when to start the transcription, the locale for transcription, and the web socket connection for sending the transcript.
const transcriptionOptions = {
transportUrl: "",
transportType: "websocket",
locale: "en-US",
startTranscription: false,
speechRecognitionModelEndpointId: "YOUR_CUSTOM_SPEECH_RECOGNITION_MODEL_ID"
};
const options = {
callIntelligenceOptions: {
cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
},
transcriptionOptions: transcriptionOptions
};
console.log("Placing outbound call...");
acsClient.createCall(callInvite, process.env.CALLBACK_URI + "/api/callbacks", options);
Sentiment Analysis (Preview)
Track the emotional tone of conversations in real time to support customer and agent interactions, and enable supervisors to intervene when necessary. Available in public preview through createCall
, answerCall
and startTranscription
.
Create a call with Sentiment Analysis enabled
const transcriptionOptions = {
transportUrl: "",
transportType: "websocket",
locale: "en-US",
startTranscription: false,
enableSentimentAnalysis: true,
speechRecognitionModelEndpointId: "YOUR_CUSTOM_SPEECH_RECOGNITION_MODEL_ID"
};
const options = {
callIntelligenceOptions: {
cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
},
transcriptionOptions: transcriptionOptions
};
console.log("Placing outbound call...");
acsClient.createCall(callInvite, process.env.CALLBACK_URI + "/api/callbacks", options);
Answer a call with Sentiment Analysis enabled
const transcriptionOptions: TranscriptionOptions = {
transportUrl: transportUrl,
transportType: "websocket",
startTranscription: true,
enableSentimentAnalysis: true
};
const answerCallOptions: AnswerCallOptions = {
callIntelligenceOptions: {
cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
},
transcriptionOptions: transcriptionOptions,
enableLoopbackAudio: true
};
await acsClient.answerCall(incomingCallContext, callbackUri, answerCallOptions);
PII Redaction (Preview)
Automatically identify and mask sensitive information—such as names, addresses, or identification numbers—to ensure privacy and regulatory compliance. Available in createCall
, answerCall
and startTranscription
.
Answer a call with PII Redaction enabled
const transcriptionOptions: TranscriptionOptions = {
transportUrl: transportUrl,
transportType: "websocket",
startTranscription: true,
piiRedactionOptions: {
enable: true,
redactionType: "maskWithCharacter"
}
};
const answerCallOptions: AnswerCallOptions = {
callIntelligenceOptions: {
cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
},
transcriptionOptions: transcriptionOptions,
enableLoopbackAudio: true
};
await acsClient.answerCall(incomingCallContext, callbackUri, answerCallOptions);
Note
With PII redaction enabled you’ll only receive the redacted text.
Real-time language detection (Preview)
Automatically detect spoken languages to enable natural, human-like communication and eliminate manual language selection. Available in createCall
, answerCall
and startTranscription
.
Create a call with Real-time language detection enabled
const transcriptionOptions: TranscriptionOptions = {
transportUrl: transportUrl,
transportType: "websocket",
startTranscription: true,
locales: ["es-ES", "en-US"]
};
const createCallOptions: CreateCallOptions = {
callIntelligenceOptions: {
cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
},
transcriptionOptions: transcriptionOptions,
operationContext: "CreatPSTNCallContext",
enableLoopbackAudio: true
};
Note
To stop language identification after it has started, use the updateTranscription
API and explicitly set the language you want to use for the transcript. This disables automatic language detection and locks transcription to the specified language.
Connect to a Rooms call and provide transcription details
If you're connecting to an ACS room and want to use transcription, configure the transcription options as follows:
const transcriptionOptions = {
transportUri: "",
locale: "en-US",
transcriptionTransport: "websocket",
startTranscription: false,
speechRecognitionModelEndpointId: "YOUR_CUSTOM_SPEECH_RECOGNITION_MODEL_ID"
};
const callIntelligenceOptions = {
cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
};
const connectCallOptions = {
callIntelligenceOptions: callIntelligenceOptions,
transcriptionOptions: transcriptionOptions
};
const callLocator = {
id: roomId,
kind: "roomCallLocator"
};
const connectResult = await client.connectCall(callLocator, callBackUri, connectCallOptions);
Start Transcription
Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.
const startTranscriptionOptions = {
locale: "en-AU",
operationContext: "startTranscriptionContext"
};
// Start transcription with options
await callMedia.startTranscription(startTranscriptionOptions);
// Alternative: Start transcription without options
// await callMedia.startTranscription();
Get mid call summaries (Preview)
Enhance your call workflows with real-time summarization. By enabling summarization in your transcription options, ACS can automatically generate concise mid-call recaps—including decisions, action items, and key discussion points—without waiting for the call to end. This helps teams stay aligned and enables faster follow-ups during live conversations.
const transcriptionOptions: TranscriptionOptions = {
transportUrl: transportUrl,
transportType: "websocket",
startTranscription: true,
summarizationOptions: {
enableEndCallSummary: true,
locale: "es-ES"
}
};
const answerCallOptions: AnswerCallOptions = {
callIntelligenceOptions: {
cognitiveServicesEndpoint: process.env.COGNITIVE_SERVICES_ENDPOINT
},
transcriptionOptions: transcriptionOptions,
enableLoopbackAudio: true
};
await acsClient.answerCall(incomingCallContext, callbackUri, answerCallOptions);
Additional Headers:
The Correlation ID and Call Connection ID are now included in the WebSocket headers for improved traceability x-ms-call-correlation-id
and x-ms-call-connection-id
.
Receiving Transcription Stream
When transcription starts, your websocket receives the transcription metadata payload as the first packet.
{
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
"locale": "en-us",
"callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
"correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
}
}
Receiving Transcription Data
After the metadata, the next packets your web socket receives will be TranscriptionData for the transcribed audio.
{
"kind": "TranscriptionData",
"transcriptionData": {
"text": "Testing transcription.",
"format": "display",
"confidence": 0.695223331451416,
"offset": 2516998782481234400,
"words": [
{
"text": "testing",
"offset": 2516998782481234400
},
{
"text": "testing",
"offset": 2516998782481234400
}
],
"participantRawID": "8:acs:",
"resultStatus": "Final"
}
}
Receiving Transcription Stream with AI capabilities enabled (Preview)
When transcription is enabled during a call, Azure Communication Services emits metadata that describes the configuration and context of the transcription session. This includes details such as the locale, call connection ID, sentiment analysis settings, and PII redaction preferences. Developers can use this payload to verify transcription setup, audit configurations, or troubleshoot issues related to real-time transcription features enhanced by AI.
{
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "863b5e55-de0d-4fc3-8e58-2d68e976b5ad",
"locale": "en-US",
"callConnectionId": "02009180-9dc2-429b-a3eb-d544b7b6a0e1",
"correlationId": "62c8215b-5276-4d3c-bb6d-06a1b114651b",
"speechModelEndpointId": null,
"locales": [],
"enableSentimentAnalysis": true,
"piiRedactionOptions": {
"enable": true,
"redactionType": "MaskWithCharacter"
}
}
}
Receiving Transcription data with AI capabilities enabled (Preview)
After the initial metadata packet, your WebSocket connection will begin receiving TranscriptionData
events for each segment of transcribed audio. These packets include the transcribed text, confidence score, timing information, and—if enabled—sentiment analysis and PII redaction. This data can be used to build real-time dashboards, trigger workflows, or analyze conversation dynamics during the call.
{
"kind": "TranscriptionData",
"transcriptionData": {
"text": "My date of birth is *********.",
"format": "display",
"confidence": 0.8726407289505005,
"offset": 309058340,
"duration": 31600000,
"words": [],
"participantRawID": "4:+917020276722",
"resultStatus": "Final",
"sentimentAnalysisResult": {
"sentiment": "neutral"
}
}
}
Handling transcription stream in the web socket server
import WebSocket from 'ws';
import { streamingData } from '@azure/communication-call-automation/src/util/streamingDataParser';
const wss = new WebSocket.Server({ port: 8081 });
wss.on('connection', (ws) => {
console.log('Client connected');
ws.on('message', (packetData) => {
const decoder = new TextDecoder();
const stringJson = decoder.decode(packetData);
console.log("STRING JSON =>", stringJson);
const response = streamingData(packetData);
const kind = response?.kind;
if (kind === "TranscriptionMetadata") {
console.log("--------------------------------------------");
console.log("Transcription Metadata");
console.log("CALL CONNECTION ID: -->", response.callConnectionId);
console.log("CORRELATION ID: -->", response.correlationId);
console.log("LOCALE: -->", response.locale);
console.log("SUBSCRIPTION ID: -->", response.subscriptionId);
console.log("SPEECH MODEL ENDPOINT: -->", response.speechRecognitionModelEndpointId);
console.log("IS SENTIMENT ANALYSIS ENABLED: -->", response.enableSentimentAnalysis);
if (response.piiRedactionOptions) {
console.log("PII REDACTION ENABLED: -->", response.piiRedactionOptions.enable);
console.log("PII REDACTION TYPE: -->", response.piiRedactionOptions.redactionType);
}
if (response.locales) {
response.locales.forEach((language) => {
console.log("LOCALE DETECTED: -->", language);
});
}
console.log("--------------------------------------------");
} else if (kind === "TranscriptionData") {
console.log("--------------------------------------------");
console.log("Transcription Data");
console.log("TEXT: -->", response.text);
console.log("FORMAT: -->", response.format);
console.log("CONFIDENCE: -->", response.confidence);
console.log("OFFSET IN TICKS: -->", response.offsetInTicks);
console.log("DURATION IN TICKS: -->", response.durationInTicks);
console.log("RESULT STATE: -->", response.resultState);
if (response.participant?.phoneNumber) {
console.log("PARTICIPANT PHONE NUMBER: -->", response.participant.phoneNumber);
}
if (response.participant?.communicationUserId) {
console.log("PARTICIPANT USER ID: -->", response.participant.communicationUserId);
}
if (response.words?.length) {
response.words.forEach((word) => {
console.log("WORD TEXT: -->", word.text);
console.log("WORD DURATION IN TICKS: -->", word.durationInTicks);
console.log("WORD OFFSET IN TICKS: -->", word.offsetInTicks);
});
}
if (response.sentimentAnalysisResult) {
console.log("SENTIMENT: -->", response.sentimentAnalysisResult.sentiment);
}
console.log("LANGUAGE IDENTIFIED: -->", response.languageIdentified);
console.log("--------------------------------------------");
}
});
ws.on('close', () => {
console.log('Client disconnected');
});
});
console.log('WebSocket server running on port 8081');
Update Transcription
For situations where your application allows users to select their preferred language, you may also want to capture the transcription in that language. To do this task, the Call Automation SDK allows you to update the transcription locale.
async function updateTranscriptionAsync() {
const options: UpdateTranscriptionOptions = {
operationContext: "updateTranscriptionContext",
speechRecognitionModelEndpointId: "YOUR_CUSTOM_SPEECH_RECOGNITION_MODEL_ID"
};
await acsClient
.getCallConnection(callConnectionId)
.getCallMedia()
.updateTranscription("en-au", options);
}
Stop Transcription
When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.
const stopTranscriptionOptions = {
operationContext: "stopTranscriptionContext"
};
// Stop transcription with options
await callMedia.stopTranscription(stopTranscriptionOptions);
// Alternative: Stop transcription without options
// await callMedia.stopTranscription();
Create a call and provide the transcription details
Define the TranscriptionOptions for ACS to specify when to start the transcription, the locale for transcription, and the web socket connection for sending the transcript.
transcription_options = TranscriptionOptions(
transport_url="WEBSOCKET_URI_HOST",
transport_type=TranscriptionTransportType.WEBSOCKET,
locale="en-US",
start_transcription=False,
#Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
speech_recognition_model_endpoint_id = "YourCustomSpeechRecognitionModelEndpointId"
);
)
call_connection_properties = call_automation_client.create_call(
target_participant,
CALLBACK_EVENTS_URI,
cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
source_caller_id_number=source_caller,
transcription=transcription_options
)
Sentiment Analysis (Preview)
Track the emotional tone of conversations in real time to support customer and agent interactions, and enable supervisors to intervene when necessary. Available in public preview through createCall
, answerCall
and startTranscription
.
Create a call with Sentiment Analysis enabled
transcription_options = TranscriptionOptions(
transport_url=self.transport_url,
transport_type=StreamingTransportType.WEBSOCKET,
locale="en-US",
start_transcription=False,
enable_sentiment_analysis=True
)
call_connection_properties = await call_automation_client.create_call(
target_participant=[target_participant],
callback_url=CALLBACK_EVENTS_URI,
cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
source_caller_id_number=source_caller,
transcription=transcription_options
)
Answer a call with Sentiment Analysis enabled
transcription_options = TranscriptionOptions(
transport_url=self.transport_url,
transport_type=StreamingTransportType.WEBSOCKET,
locale="en-US",
start_transcription=False,
enable_sentiment_analysis=True
)
answer_call_result = await call_automation_client.answer_call(
incoming_call_context=incoming_call_context,
transcription=transcription_options,
cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
callback_url=callback_uri,
enable_loopback_audio=True,
operation_context="answerCallContext"
)
PII Redaction (Preview)
Automatically identify and mask sensitive information—such as names, addresses, or identification numbers—to ensure privacy and regulatory compliance. Available in createCall
, answerCall
and startTranscription
.
Answer a call with PII Redaction enabled
transcription_options = TranscriptionOptions(
transport_url=self.transport_url,
transport_type=StreamingTransportType.WEBSOCKET,
locale=["en-US", "es-ES"],
start_transcription=False,
pii_redaction=PiiRedactionOptions(
enable=True,
redaction_type=RedactionType.MASK_WITH_CHARACTER
)
)
answer_call_result = await call_automation_client.answer_call(
incoming_call_context=incoming_call_context,
transcription=transcription_options,
cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
callback_url=callback_uri,
enable_loopback_audio=True,
operation_context="answerCallContext"
)
Note
With PII redaction enabled you’ll only receive the redacted text.
Real-time language detection (Preview)
Automatically detect spoken languages to enable natural, human-like communication and eliminate manual language selection. Available in createCall
, answerCall
and startTranscription
.
Create a call with Real-time language detection enabled
transcription_options = TranscriptionOptions(
transport_url=self.transport_url,
transport_type=StreamingTransportType.WEBSOCKET,
locale=["en-US", "es-ES","hi-IN"],
start_transcription=False,
enable_sentiment_analysis=True,
)
call_connection_properties = await call_automation_client.create_call(
target_participant=[target_participant],
callback_url=CALLBACK_EVENTS_URI,
cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
source_caller_id_number=source_caller,
transcription=transcription_options
)
Note
To stop language identification after it has started, use the updateTranscription
API and explicitly set the language you want to use for the transcript. This disables automatic language detection and locks transcription to the specified language.
Connect to a Rooms call and provide transcription details
If you're connecting to an ACS room and want to use transcription, configure the transcription options as follows:
transcription_options = TranscriptionOptions(
transport_url="",
transport_type=TranscriptionTransportType.WEBSOCKET,
locale="en-US",
start_transcription=False,
#Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
speech_recognition_model_endpoint_id = "YourCustomSpeechRecognitionModelEndpointId"
)
connect_result = client.connect_call(
room_id="roomid",
CALLBACK_EVENTS_URI,
cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
operation_context="connectCallContext",
transcription=transcription_options
)
Start Transcription
Once you're ready to start the transcription, you can make an explicit call to Call Automation to start transcribing the call.
# Start transcription without options
call_connection_client.start_transcription()
# Option 1: Start transcription with locale and operation context
# call_connection_client.start_transcription(locale="en-AU", operation_context="startTranscriptionContext")
# Option 2: Start transcription with operation context
# call_connection_client.start_transcription(operation_context="startTranscriptionContext")
Get mid call summaries (Preview)
Enhance your call workflows with real-time summarization. By enabling summarization in your transcription options, ACS can automatically generate concise mid-call recaps—including decisions, action items, and key discussion points—without waiting for the call to end. This helps teams stay aligned and enables faster follow-ups during live conversations.
transcription_options = TranscriptionOptions(
transport_url=self.transport_url,
transport_type=StreamingTransportType.WEBSOCKET,
locale="en-US",
start_transcription=False,
summarization=SummarizationOptions(
enable_end_call_summary=True,
locale="en-US"
)
)
answer_call_result = await call_automation_client.answer_call(
incoming_call_context=incoming_call_context,
transcription=transcription_options,
cognitive_services_endpoint=COGNITIVE_SERVICES_ENDPOINT,
callback_url=callback_uri,
enable_loopback_audio=True,
operation_context="answerCallContext"
)
await call_connection_client.summarize_call(
operation_context=self.operation_context,
operation_callback_url=self.operation_callback_url,
summarization=transcription_options.summarization
)
Additional Headers:
The Correlation ID and Call Connection ID are now included in the WebSocket headers for improved traceability x-ms-call-correlation-id
and x-ms-call-connection-id
.
Receiving Transcription Stream
When transcription starts, your websocket receives the transcription metadata payload as the first packet.
{
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "aaaa0a0a-bb1b-cc2c-dd3d-eeeeee4e4e4e",
"locale": "en-us",
"callConnectionId": "65c57654=f12c-4975-92a4-21668e61dd98",
"correlationId": "65c57654=f12c-4975-92a4-21668e61dd98"
}
}
Receiving Transcription Data
After the metadata, the next packets your websocket receives will be TranscriptionData for the transcribed audio.
{
"kind": "TranscriptionData",
"transcriptionData": {
"text": "Testing transcription.",
"format": "display",
"confidence": 0.695223331451416,
"offset": 2516998782481234400,
"words": [
{
"text": "testing",
"offset": 2516998782481234400
},
{
"text": "testing",
"offset": 2516998782481234400
}
],
"participantRawID": "8:acs:",
"resultStatus": "Final"
}
}
Receiving Transcription Stream with AI capabilities enabled (Preview)
When transcription is enabled during a call, Azure Communication Services emits metadata that describes the configuration and context of the transcription session. This includes details such as the locale, call connection ID, sentiment analysis settings, and PII redaction preferences. Developers can use this payload to verify transcription setup, audit configurations, or troubleshoot issues related to real-time transcription features enhanced by AI.
{
"kind": "TranscriptionMetadata",
"transcriptionMetadata": {
"subscriptionId": "863b5e55-de0d-4fc3-8e58-2d68e976b5ad",
"locale": "en-US",
"callConnectionId": "02009180-9dc2-429b-a3eb-d544b7b6a0e1",
"correlationId": "62c8215b-5276-4d3c-bb6d-06a1b114651b",
"speechModelEndpointId": null,
"locales": [],
"enableSentimentAnalysis": true,
"piiRedactionOptions": {
"enable": true,
"redactionType": "MaskWithCharacter"
}
}
}
Receiving Transcription data with AI capabilities enabled (Preview)
After the initial metadata packet, your WebSocket connection will begin receiving TranscriptionData
events for each segment of transcribed audio. These packets include the transcribed text, confidence score, timing information, and—if enabled—sentiment analysis and PII redaction. This data can be used to build real-time dashboards, trigger workflows, or analyze conversation dynamics during the call.
{
"kind": "TranscriptionData",
"transcriptionData": {
"text": "My date of birth is *********.",
"format": "display",
"confidence": 0.8726407289505005,
"offset": 309058340,
"duration": 31600000,
"words": [],
"participantRawID": "4:+917020276722",
"resultStatus": "Final",
"sentimentAnalysisResult": {
"sentiment": "neutral"
}
}
}
Handling transcription stream in the web socket server
import asyncio
import json
import websockets
from azure.communication.callautomation._shared.models import identifier_from_raw_id
async def handle_client(websocket, path):
print("Client connected")
try:
async for message in websocket:
json_object = json.loads(message)
kind = json_object['kind']
if kind == 'TranscriptionMetadata':
print("Transcription metadata")
print("-------------------------")
print("Subscription ID:", json_object['transcriptionMetadata']['subscriptionId'])
print("Locale:", json_object['transcriptionMetadata']['locale'])
print("Call Connection ID:", json_object['transcriptionMetadata']['callConnectionId'])
print("Correlation ID:", json_object['transcriptionMetadata']['correlationId'])
print("Locales:", json_object['transcriptionMetadata']['locales'])
print("PII Redaction Options:", json_object['transcriptionMetadata']['piiRedactionOptions'])
if kind == 'TranscriptionData':
participant = identifier_from_raw_id(json_object['transcriptionData']['participantRawID'])
word_data_list = json_object['transcriptionData']['words']
print("Transcription data")
print("-------------------------")
print("Text:", json_object['transcriptionData']['text'])
print("Format:", json_object['transcriptionData']['format'])
print("Confidence:", json_object['transcriptionData']['confidence'])
print("Offset:", json_object['transcriptionData']['offset'])
print("Duration:", json_object['transcriptionData']['duration'])
print("Participant:", participant.raw_id)
print("Result Status:", json_object['transcriptionData']['resultStatus'])
print("Sentiment Analysis Result:", json_object['transcriptionData']['sentimentAnalysisResult'])
print("Result Status:", json_object['transcriptionData']['resultStatus'])
for word in word_data_list:
print("Word:", word['text'])
print("Offset:", word['offset'])
print("Duration:", word['duration'])
except websockets.exceptions.ConnectionClosedOK:
print("Client disconnected")
except websockets.exceptions.ConnectionClosedError as e:
print("Connection closed with error: %s", e)
except Exception as e:
print("Unexpected error: %s", e)
start_server = websockets.serve(handle_client, "localhost", 8081)
print('WebSocket server running on port 8081')
asyncio.get_event_loop().run_until_complete(start_server)
asyncio.get_event_loop().run_forever()
Update Transcription
For situations where your application allows users to select their preferred language, you may also want to capture the transcription in that language. To do this task, the Call Automation SDK allows you to update the transcription locale.
await call_automation_client.get_call_connection(
call_connection_id=call_connection_id
).update_transcription(
operation_context="UpdateTranscriptionContext",
locale="en-au",
#Only add the SpeechRecognitionModelEndpointId if you have a custom speech model you would like to use
speech_recognition_model_endpoint_id = "YourCustomSpeechRecognitionModelEndpointId"
)
Stop Transcription
When your application needs to stop listening for the transcription, you can use the StopTranscription request to let Call Automation know to stop sending transcript data to your web socket.
# Stop transcription without options
call_connection_client.stop_transcription()
# Alternative: Stop transcription with operation context
# call_connection_client.stop_transcription(operation_context="stopTranscriptionContext")
Event codes
Event | code | subcode | Message |
---|---|---|---|
TranscriptionStarted | 200 | 0 | Action completed successfully. |
TranscriptionStopped | 200 | 0 | Action completed successfully. |
TranscriptionUpdated | 200 | 0 | Action completed successfully. |
TranscriptionFailed | 400 | 8581 | Action failed, StreamUrl isn't valid. |
TranscriptionFailed | 400 | 8565 | Action failed due to a bad request to Cognitive Services. Check your input parameters. |
TranscriptionFailed | 400 | 8565 | Action failed due to a request to Cognitive Services timing out. Try again later or check for any issues with the service. |
TranscriptionFailed | 400 | 8605 | Custom speech recognition model for Transcription is not supported. |
TranscriptionFailed | 400 | 8523 | Invalid Request, locale is missing. |
TranscriptionFailed | 400 | 8523 | Invalid Request, only locales that contain region information are supported. |
TranscriptionFailed | 405 | 8520 | Transcription functionality is not supported at this time. |
TranscriptionFailed | 405 | 8520 | UpdateTranscription is not supported for connection created with Connect interface. |
TranscriptionFailed | 400 | 8528 | Action is invalid, call already terminated. |
TranscriptionFailed | 405 | 8520 | Update transcription functionality is not supported at this time. |
TranscriptionFailed | 405 | 8522 | Request not allowed when Transcription url not set during call setup. |
TranscriptionFailed | 405 | 8522 | Request not allowed when Cognitive Service Configuration not set during call setup. |
TranscriptionFailed | 400 | 8501 | Action is invalid when call is not in Established state. |
TranscriptionFailed | 401 | 8565 | Action failed due to a Cognitive Services authentication error. Check your authorization input and ensure it's correct. |
TranscriptionFailed | 403 | 8565 | Action failed due to a forbidden request to Cognitive Services. Check your subscription status and ensure it's active. |
TranscriptionFailed | 429 | 8565 | Action failed, requests exceeded the number of allowed concurrent requests for the cognitive services subscription. |
TranscriptionFailed | 500 | 8578 | Action failed, not able to establish WebSocket connection. |
TranscriptionFailed | 500 | 8580 | Action failed, transcription service was shut down. |
TranscriptionFailed | 500 | 8579 | Action failed, transcription was canceled. |
TranscriptionFailed | 500 | 9999 | Unknown internal server error. |
Known issues
- For 1:1 calls with ACS users using Client SDKs startTranscription = True isn't currently supported.