Skip to content

How to use SSML

ForrestGumb edited this page Aug 22, 2024 · 7 revisions

With SSML, developer can have some control on speech synthesis output such as rate, pitch, volume, prosody, pronunciations. For non developer, audio content generation tool provides a way to modify speech synthesis output with a GUI.

For full SSML document, please see SSML document. Here we provide some more samples to use SSML features supported by Azure TTS.

Use SSML audio

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='zh-CN'><voice xml:lang='zh-CN' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)'>您好晓晓!<audio src='https://file-examples.com/storage/feb8f98f1d627c0dc94b8cf/2017/11/file_example_MP3_700KB.mp3'>This is fallback audio.</audio></voice></speak>

Use background audio with Neural voice

  • Example: <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-us'><mstts:backgroundaudio src='https://file-examples.com/storage/feb8f98f1d627c0dc94b8cf/2017/11/file_example_MP3_700KB.mp3' volume='0.7' fadein='3000' fadeout='4000'/><voice name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)' >Hi, this is a demo of using background audio</voice></speak>
  • The audio specified with <mstts:backgroundaudio> tag will be mixed together with the other TTS synthesized waves.
  • For voice tag is synthesized by TTS in sequence. Background audio is played mixed with TTS waves parallel.
  • For fadein, it starts at TTS begin point, in milliseconds.
  • For fadeout, it starts at TTS end point, in milliseconds.
  • Volume default value is 1, the scaled sample value will be original sample value *volume.
  • If background audio shorter than TTS and fadeout, it will loop; if longer, it will stop when fadeout finished.

Use phoneme to change pronunciations

IPA phone set

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'><phoneme alphabet="ipa" ph="ʃaʊˈmi">pecan</phoneme> is awesome!</voice></speak>

SAPI phoneset

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'>Hello, <phoneme alphabet='sapi' ph='jh iy 1 - n iy'>Jeanne</phoneme></voice></speak>

UPS phoneset

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'>Hello, <phoneme alphabet='ups' ph='jh i n i'>Jeanne</phoneme></voice></speak>

More complete document on Pronunciation with SSML