Skip to content

How to use SSML

szhaomsft edited this page Dec 23, 2019 · 7 revisions

For full SSML document, please see SSML document. Here we provide some more common samples to use SSML features supported by Azure TTS.

Use SSML audio

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='zh-CN'><voice xml:lang='zh-CN' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (zh-CN, XiaoxiaoNeural)'>您好晓晓!<audio src='https://file-examples.com/wp-content/uploads/2017/11/file_example_MP3_700KB.mp3'>This is fallback audio.</audio></voice></speak>

Use background audio with Neural voice

  • Example: <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='en-us'><mstts:backgroundaudio src='https://file-examples.com/wp-content/uploads/2017/11/file_example_MP3_700KB.mp3' volume='0.7' fadein='3000' fadeout='4000'/><voice name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)' >Hi, this is a demo of using background audio</voice></speak>
  • The audio specified with <mstts:backgroundaudio> tag will be mixed together with the other TTS synthesized waves.
  • For voice tag is synthesized by TTS in sequence. Background audio is played mixed with TTS waves parallel.
  • For fadein, it starts at TTS begin point, in milliseconds.
  • For fadeout, it starts at TTS end point, in milliseconds.
  • Volume default value is 1, the scaled sample value will be original sample value *volume.
  • If background audio shorter than TTS and fadeout, it will loop; if longer, it will stop when fadeout finished.

Use phoneme to change pronunciations

IPA phone set

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'><phoneme alphabet="ipa" ph="ʃaʊˈmi">pecan</phoneme> is awesome!</voice></speak>

SAPI phoneset

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'>Hello, <phoneme alphabet='sapi' ph='jh iy 1 - n iy'>Jeanne</phoneme></voice></speak>

UPS phoneset

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'>Hello, <phoneme alphabet='ups' ph='jh i n i'>Jeanne</phoneme></voice></speak>

https://microsoft.sharepoint.com/teams/SpeechOutputTeam/Shared%20Documents/Design%20Docs/UPSWhitePaper(3.2).doc?web=1

https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/speech-synthesis-markup#use-phonemes-to-improve-pronunciation