Does live chat avatar synthesis support WordBoundary events?

Question

Does live chat avatar synthesis support WordBoundary events?

Mindaugas Giedraitis 0

I was trying to set up the WordBoundary event callback for my live chat avatar synthesis but the callback is never run (the avatar speaks in the front-end but I get no events).

That brings me to the question - are these events even supported for live chat avatar?

If so, what am I doing wrong?

PS. the voice I am using is "en-US-JennyMultilingualNeural" and here's my code to set up the synthesizer and its connection objects.

def _set_up_speech_synthesizer(
        self,
        connection: AvatarConnection,
        session_description: RTCConnectionDescription,
    ) -> None:
        speech_config = speechsdk.SpeechConfig(
            # self.speech_wss_endpoint_tts = "wss://westeurope.tts.speech.microsoft.com/cognitiveservices/websocket/v1?enableTalkingAvatar=true"
            endpoint=f"{self.speech_wss_endpoint_tts}"
        )

        speech_config.authorization_token = connection.speech_token
        # Required for WordBoundary event sentences.
        speech_config.set_property(
            property_id=speechsdk.PropertyId.SpeechServiceResponse_RequestWordBoundary,
            value="true",
        )

        speech_synthesizer = speechsdk.SpeechSynthesizer(
            speech_config=speech_config,
        )

        def speech_synthesizer_word_boundary_callback(
            event: speechsdk.SpeechSynthesisWordBoundaryEventArgs,
        ):
            print("WordBoundary event:")
            print("\tBoundaryType: {}".format(event.boundary_type))
            print("\tAudioOffset: {}ms".format((event.audio_offset + 5000) / 10000))
            print("\tDuration: {}".format(event.duration))
            print("\tText: {}".format(event.text))
            print("\tTextOffset: {}".format(event.text_offset))
            print("\tWordLength: {}".format(event.word_length))

        speech_synthesizer.synthesis_word_boundary.connect(speech_synthesizer_word_boundary_callback)
        print(f"synthesis_word_boundary callback connected: {speech_synthesizer.synthesis_word_boundary.is_connected()}")

        connection.speech_synthesizer = speech_synthesizer
        ice_token_obj = json.loads(connection.ice_token)
        avatar_config = self._create_avatar_config(
            session_description=session_description,
            url=ice_token_obj["Urls"][0],
            username=ice_token_obj["Username"],
            password=ice_token_obj["Password"],
        )
        speech_synthesizer_connection = speechsdk.Connection.from_speech_synthesizer(
            connection.speech_synthesizer
        )
        speech_synthesizer_connection.connected.connect(
            lambda evt: print("TTS Avatar service connected.")
        )

        def tts_disconnected_callback(event: speechsdk.ConnectionEventArgs):
            print("TTS Avatar service disconnected.")
            connection.speech_synthesizer_connection = None
            connection.speech_synthesizer_connected = False

        speech_synthesizer_connection.disconnected.connect(tts_disconnected_callback)
        speech_synthesizer_connection.set_message_property(
            "speech.config", "context", json.dumps(avatar_config)
        )
        connection.speech_synthesizer_connection = speech_synthesizer_connection
        connection.speech_synthesizer_connected = True

        speech_sythesis_result = connection.speech_synthesizer.speak_text_async(
            ""
        ).get()
        if speech_sythesis_result is None:
            raise Exception(
                f"Speech synthesis result is None for connection {connection.connection_id}"
            )
        print(f"Result id for avatar connection: {speech_sythesis_result.result_id}")
        if speech_sythesis_result.reason == speechsdk.ResultReason.Canceled:
            cancellation_details = speech_sythesis_result.cancellation_details
            print(f"Speech synthesis canceled: {cancellation_details.reason}")
            if cancellation_details.reason == speechsdk.CancellationReason.Error:
                connection.status = AvatarConnectionStatus.FAILED
                print(f"Error details: {cancellation_details.error_details}")
                raise Exception(cancellation_details.error_details)
        turn_start_message = (
            connection.speech_synthesizer.properties.get_property_by_name(
                "SpeechSDKInternal-ExtraTurnStartMessage"
            )
        )
        if not turn_start_message:
            raise Exception(
                f"Turn start message is empty for connection {connection.connection_id}"
            )
        try:
            connection.speech_synthesizer_remote_sdp = json.loads(turn_start_message)[
                "webrtc"
            ]["connectionString"]
        except Exception as e:
            print(
                f"Error parsing turn start message: {turn_start_message}, {type(turn_start_message)}"
            )
            raise Exception(f"Error parsing turn start message: {e}") from e

Amira Bedhiafi 36,716 Reputation points Volunteer Moderator

2025-08-08T16:06:49.5833333+00:00

Hello !

Thank you for posting on Microsoft Learn.

Have you checked whether your voice (en-US-JennyMultilingualNeural) supports word boundary events ? because some voices might have limitations or require specific configurations to trigger these events.

Keep in mind that when you're calling speak_text_async(), this is happening after the synthesizer is fully connected and operational because the callback for synthesis_word_boundary may not trigger if there’s no active synthesis happening when the call is made.

It would also be helpful to test with a more basic configuration (such as without the avatar setup) to see if WordBoundary events are working in isolation.
Mindaugas Giedraitis 0 Reputation points

2025-08-08T16:36:34.04+00:00

Yes, I did (here). I've also tried "en-US-NovaTurboMultilingualNeural" just to make sure that it's not the voice. Both of these support what I need according to the docs.

If they indeed require some specific configuration, I've yet to find that mentioned in any documentation page. The example I was using is here.

I can get through the whole conversation with the avatar but I've never received one WordBoundary event.
Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-16T03:21:20.8966667+00:00

Hi Mindaugas Giedraitis

Based on your setup and the voice you're using—en-US-NovaTurboMultilingualNeural—the issue you're encountering with the WordBoundary event not firing is a known limitation when using the local Neural Text-to-Speech (TTS) container. Although this voice supports WordBoundary events, these events are not triggered when the synthesis is performed locally via the container.

To receive WordBoundary events reliably, you must use the Azure-hosted TTS service instead. This can be done by configuring your synthesizer with an Azure subscription key and region, which ensures full event support including WordBoundary. For example, using SpeechConfig.FromSubscription("YourSubscriptionKey", "YourRegion") will allow the SDK to connect to the cloud service where WordBoundary events are properly emitted.

This behavior has been confirmed across multiple SDK versions, including 1.30 and 1.36.0. If your application depends on these events for features like avatar lip-sync or real-time text highlighting, switching to the Azure endpoint is the recommended solution. Additionally, enabling SDK logging can help verify whether the event is being suppressed or misconfigured. Let me know if you'd like help refactoring your code or setting up a test using the Azure endpoint.

Hope it helps!

Thank you
Mindaugas Giedraitis 0 Reputation points

2025-08-25T07:46:14.8+00:00
Hi, Ravada Shivaprasad,

Thank you for your answer. The thing is - I am using an Azure endpoint (not a container). To be precise, this one:

speech_config = speechsdk.SpeechConfig( endpoint="wss://westeurope.tts.speech.microsoft.com/cognitiveservices/websocket/v1?enableTalkingAvatar=true" )

I guess the problem could be, that I am creating the config object from an endpoint and not from region and key (or token) pair. However, that approach would pose a different problem - there's now way to pass "enableTalkingAvatar=true" parameter which is needed for the avatar video synthesis to work.
Perhaps you know a workaround for that?
Otherwise I'll consider avatar video synthesis and WordBoundary events incompatible for now.

Edited: endpoint value was not copied in.

Your answer

Amira Bedhiafi 36,716 Reputation points Volunteer Moderator

2025-08-08T16:06:49.5833333+00:00

Hello !

Thank you for posting on Microsoft Learn.

Have you checked whether your voice (en-US-JennyMultilingualNeural) supports word boundary events ? because some voices might have limitations or require specific configurations to trigger these events.

Keep in mind that when you're calling speak_text_async(), this is happening after the synthesizer is fully connected and operational because the callback for synthesis_word_boundary may not trigger if there’s no active synthesis happening when the call is made.

It would also be helpful to test with a more basic configuration (such as without the avatar setup) to see if WordBoundary events are working in isolation.
Mindaugas Giedraitis 0 Reputation points

2025-08-08T16:36:34.04+00:00

Yes, I did (here). I've also tried "en-US-NovaTurboMultilingualNeural" just to make sure that it's not the voice. Both of these support what I need according to the docs.

If they indeed require some specific configuration, I've yet to find that mentioned in any documentation page. The example I was using is here.

I can get through the whole conversation with the avatar but I've never received one WordBoundary event.
Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-16T03:21:20.8966667+00:00

Hi Mindaugas Giedraitis

Based on your setup and the voice you're using—en-US-NovaTurboMultilingualNeural—the issue you're encountering with the WordBoundary event not firing is a known limitation when using the local Neural Text-to-Speech (TTS) container. Although this voice supports WordBoundary events, these events are not triggered when the synthesis is performed locally via the container.

To receive WordBoundary events reliably, you must use the Azure-hosted TTS service instead. This can be done by configuring your synthesizer with an Azure subscription key and region, which ensures full event support including WordBoundary. For example, using SpeechConfig.FromSubscription("YourSubscriptionKey", "YourRegion") will allow the SDK to connect to the cloud service where WordBoundary events are properly emitted.

This behavior has been confirmed across multiple SDK versions, including 1.30 and 1.36.0. If your application depends on these events for features like avatar lip-sync or real-time text highlighting, switching to the Azure endpoint is the recommended solution. Additionally, enabling SDK logging can help verify whether the event is being suppressed or misconfigured. Let me know if you'd like help refactoring your code or setting up a test using the Azure endpoint.

Hope it helps!

Thank you
Mindaugas Giedraitis 0 Reputation points

2025-08-25T07:46:14.8+00:00

Hi, Ravada Shivaprasad,

Thank you for your answer. The thing is - I am using an Azure endpoint (not a container). To be precise, this one:

speech_config = speechsdk.SpeechConfig( endpoint="wss://westeurope.tts.speech.microsoft.com/cognitiveservices/websocket/v1?enableTalkingAvatar=true" )

I guess the problem could be, that I am creating the config object from an endpoint and not from region and key (or token) pair. However, that approach would pose a different problem - there's now way to pass "enableTalkingAvatar=true" parameter which is needed for the avatar video synthesis to work.
Perhaps you know a workaround for that?
Otherwise I'll consider avatar video synthesis and WordBoundary events incompatible for now.

Edited: endpoint value was not copied in.

Share via

Does live chat avatar synthesis support WordBoundary events?

Your answer