How to use SSML

With SSML, developer can have some control on speech synthesis output such as rate, pitch, volume, prosody, pronunciations. For non developer, audio content generation tool provides a way to modify speech synthesis output with a GUI.

For full SSML document, please see SSML document. Here we provide some more samples to use SSML features supported by Azure TTS.

Use SSML audio

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='zh-CN'><voice xml:lang='zh-CN' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)'>您好晓晓！<audio src='https://file-examples.com/storage/feb8f98f1d627c0dc94b8cf/2017/11/file_example_MP3_700KB.mp3'>This is fallback audio.</audio></voice></speak>

Use background audio with Neural voice

Example: <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-us'><mstts:backgroundaudio src='https://file-examples.com/storage/feb8f98f1d627c0dc94b8cf/2017/11/file_example_MP3_700KB.mp3' volume='0.7' fadein='3000' fadeout='4000'/><voice name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)' >Hi, this is a demo of using background audio</voice></speak>
The audio specified with <mstts:backgroundaudio> tag will be mixed together with the other TTS synthesized waves.
For voice tag is synthesized by TTS in sequence. Background audio is played mixed with TTS waves parallel.
For fadein, it starts at TTS begin point, in milliseconds.
For fadeout, it starts at TTS end point, in milliseconds.
Volume default value is 1, the scaled sample value will be original sample value *volume.
If background audio shorter than TTS and fadeout, it will loop; if longer, it will stop when fadeout finished.

Use phoneme to change pronunciations

IPA phone set

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'><phoneme alphabet="ipa" ph="ʃaʊˈmi">pecan</phoneme> is awesome!</voice></speak>

SAPI phoneset

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'>Hello, <phoneme alphabet='sapi' ph='jh iy 1 - n iy'>Jeanne</phoneme></voice></speak>

UPS phoneset

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'>Hello, <phoneme alphabet='ups' ph='jh i n i'>Jeanne</phoneme></voice></speak>

More complete document on Pronunciation with SSML

Azure TTS: Empower every person and every organization on the planet to have a delightful digital voice!
Azure Custom Voice: Build your one-of-a-kind Custom Voice and close to human Neural TTS in cloud and edge!

Azure Speech Document

Create Custom Neural Voice

Speech SDK

Azure Speech Containers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly