[coqui-tts] production deployment and aframe component #9

vincentfretin · 2023-07-16T07:28:02Z

I'm currently using the speechSynthesis api for text to speech, but this api doesn't work in VR on Meta browser. Also the voice is different from one platform to another, using a male voice on a female avatar is funny but not for a customer :-)
The api is a bit tricky with the voices list that is async, you can read more on this article (7 dec. 2021, so some information may not be accurate anymore)

I'm working on a coqui cpu integration with the official docker image, integrating it to my existing server without GPU.
The "tts_models/multilingual/multi-dataset/your_tts" model (article) is actually quite good for English and French (That's funny for French that you have a enough good result with speaker_id="male-pt-3\n" and language_id="fr-fr")

The backend part will consist of a docker-compose file and one or several docker containers to generate the audio from text
suitable for a production usage (several users communicating with a gpt-3.5 agent at the same time in different rooms):

nginx+gunicorn for WSGI (coqui-tts is a Python app)
Probably using the proxy cache varnish to queue similar requests for the same text, generate once the wav, cache a few minutes the wav in memory to reply to all requests.

I'm working also on an aframe component that split the text on punctuation into chunks, does the fetch call for each chunk to the coqui tts service and play the audio chunk sequentially. For the fetch call and playing the audio file received, see their code

I'm working on it for my current project. When I'm done implementing it, I'll open source it in a private repo with instructions how to self host it and use the aframe component for my $10 tier monthly sponsors. The access to the repo will be public 4 months later.

Resources:

Fabien experiments with coqui: https://twitter.com/utopiah/status/1651646682549964802 and https://twitter.com/utopiah/status/1651968536640667654
Alternative coqui tts docker image https://github.com/fquirin/docker-containers/tree/main/SEPIA/Coqui-TTS when sepia implemented the MaryTTS API for use with HomeAssistant. The MaryTTS API changes are now upstream, but you can look at the Dockerfile and scripts there.

Alternatives:

https://github.com/webaverse/tiktalknet using a server with an NVIDIA GPU, there is a Dockerfile is the repo, found via https://discord.com/channels/758943204715659264/977340924780748901/1076424376355917835
https://www.inworld.ai (Free tier available) apparently used in the Niantic / 8thWall's Wol mixed reality experiment, see article

vincentfretin · 2023-07-16T08:24:14Z

Also see:

YourTTS checkpoint: Dutch, French, German, Italian, Portuguese, Polish, Spanish, and English coqui-ai/TTS#2735

vincentfretin · 2023-07-16T08:54:44Z

An example for French with the following text:
"Ici, vous pouvez voir trois modules actuellement fermés. Le module blanc est le réfectoire, qui sert pour le déjeuner. Dans le module orange, vous trouverez la machine à café, ainsi que les vestiaires et les casiers. Enfin, le module bleu est destiné aux toilettes et aux douches."
with "tts_models/multilingual/multi-dataset/your_tts" model, speaker_id="male-pt-3\n" and language_id="fr-fr"

https://github.com/c-frame/sponsorship/assets/112249/8dfe093a-86eb-4e9a-a812-29afec0cbbd9
(sound only, not a video)

KooIaIa · 2023-07-16T18:45:04Z

I would love to help sponsor this work - I'll look into GitHub's system. I am a huge fan of open source speech tech - my favorite right now is Festival Lite + Wasm but the ecosystem isnt there yet in WASM today for audio output.

Would your method work as a open source cross platform polyfill for Speech Synth JavaScript support?

KooIaIa · 2023-07-16T18:52:27Z

Does the output of this work require a server?

vincentfretin · 2023-07-17T08:57:31Z

You will need a server running docker, a small VPS with Ubuntu 22.04 for example.
It will work everywhere, it's just a fetch call to the hosted coqui tts webservice, then it reads the downloaded wav file with an audio element.

Currently on Meta browser on Quest 1 (it's not updated anymore) window.speechSynthesis and window.SpeechSynthesisUtterance are undefined. So I guess yes we can do a polyfill. I can go in that direction.
But I'll also write an alternative api so you can force it to always go through the coqui server even if window.speechSynthesis is defined.

With speechSynthesis api I'm currently using the speechSynthesis.onvoiceschanged callback to select the preferred voice and the speechSynthesis.speaking flag to know if it's currently speaking and to speak:

const msg = new SpeechSynthesisUtterance(chunk);
msg.pitch = this.data.pitch;
msg.rate = this.data.rate;
msg.voice = this.voice;
speechSynthesis.speak(msg);

Pitch and rate wouldn't do anything in the polyfill implementation that would go through coqui tts.
I see there is other flags in the speechSynthesis api, like pending and paused and some other methods that I'm not currently using. Do you use a specific part of the API I haven't listed above?

KooIaIa · 2023-07-17T15:55:08Z

Thank you for the detailed explanation - I'm looking for a polyfill capable solution to add SpeechSynth to any website in a free and open source way and a Ubuntu Linux server based solution will not provide that (it requires two computers). I'm designing for local-fist offline-capable-AR-VR websites. I totally see the benefits of speech running on a seperate computer (like if it is truely a AI entity) and there are many great server-based solutions today that are already free and open source. If these AI models are what your personally aiming for then maybe they will be able to run locally on WebGPU in a year or two cheaply and our goals may cross.

It looks like your closely following the pattern of the standard javascript api. speak(Utterance) yeah and its neat your explanding the api for yourself. Thanks for explaining again - your open source activity is wonderful and this sponsorship system you have is neat. If you do work like this one day that works in this domain locally I'm excited to sponser it. Hopefully new headsets like the Quest 3 or Deckard will have some extra juice to do TTS in a web worker.

vincentfretin · 2023-07-17T16:11:35Z

Thanks for the kind word.
You should keep an eye to the implementation of the tts bark model running with ggml. You normally will be able to run it with WebAssembly like with whisper.cpp.

KooIaIa · 2023-07-17T16:19:03Z

Thanks. that looks like the scale I'm aiming for! Maybe we will end up with a good open source ecosystem that can run locally or remotely. Artificial / Virtual Speech is so fundamentally important for so many things - AI communication or even just humans wanting to read a book with their ears instead of their eyes. Im very excited to see how this space grows and becomes standard and friendly - thanks for the ggml model tip!

vincentfretin · 2023-07-20T14:26:44Z

The YourTTS model says really weird things in French when there is ":" or "?" in the text. :D
And also it doesn't know how to pronounce abbreviation like "etc.".
So I cheat:

text = text.replaceAll(": ", "").replaceAll("?", "").replaceAll("etc.", "ète cétéra");

vincentfretin · 2023-07-20T14:39:18Z

With the talking mistakes:
yourtts_example1.webm

With the line removing punctuation and forcing how to pronounce "etc.":
yourtts_example2.webm

I'll be testing now new voices with the new YourTTS checkpoint or generate my own.

vincentfretin · 2024-03-14T11:54:12Z

You may know that Coqui company shut down, just after they allowed to use xtts output commercially with a yearly license. So currently you can't purchase a license to use the xtts model commercially. That's unfortunate.

You can still use the inference engine (its MPL2 license) and XTTS model (CPML, so can't use commercially) though.
FYI there was a streaming server for xtts
https://github.com/coqui-ai/xtts-streaming-server

Today I'm using openai tts voices https://platform.openai.com/docs/guides/text-to-speech, works well in English and French with the same voice. I'm using it through a small nodejs process with fastify framework.

vincentfretin · 2024-05-06T10:33:14Z

Open source TTS Piper: https://rhasspy.github.io/piper-samples/
for English en_US-amy-medium.onnx is great
for French fr_FR-upmc-medium.onnx jessica and pierre is best
inference speed is fast.
I think it's finally a real open source alternative to Google TTS or OpenAI TTS for French.
I'll do more testing with it.

vincentfretin added enhancement New feature or request sponsors needed $10 labels Jul 16, 2023

vincentfretin self-assigned this Jul 16, 2023

github-project-automation bot added this to Sponsorship Jul 16, 2023

github-project-automation bot moved this to Todo in Sponsorship Jul 16, 2023

vincentfretin moved this from Todo to In Progress in Sponsorship Jul 16, 2023

vincentfretin mentioned this issue Mar 7, 2024

Move mouth on a talking avatar networked-aframe/naf-valid-avatars#22

Open

vincentfretin removed the $10 label Mar 13, 2024

vincentfretin moved this from In Progress to Todo in Sponsorship Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[coqui-tts] production deployment and aframe component #9

[coqui-tts] production deployment and aframe component #9

vincentfretin commented Jul 16, 2023 •

edited

Loading

vincentfretin commented Jul 16, 2023

vincentfretin commented Jul 16, 2023

KooIaIa commented Jul 16, 2023 •

edited

Loading

KooIaIa commented Jul 16, 2023

vincentfretin commented Jul 17, 2023

KooIaIa commented Jul 17, 2023

vincentfretin commented Jul 17, 2023

KooIaIa commented Jul 17, 2023

vincentfretin commented Jul 20, 2023

vincentfretin commented Jul 20, 2023

vincentfretin commented Mar 14, 2024

vincentfretin commented May 6, 2024

[coqui-tts] production deployment and aframe component #9

[coqui-tts] production deployment and aframe component #9

Comments

vincentfretin commented Jul 16, 2023 • edited Loading

vincentfretin commented Jul 16, 2023

vincentfretin commented Jul 16, 2023

KooIaIa commented Jul 16, 2023 • edited Loading

KooIaIa commented Jul 16, 2023

vincentfretin commented Jul 17, 2023

KooIaIa commented Jul 17, 2023

vincentfretin commented Jul 17, 2023

KooIaIa commented Jul 17, 2023

vincentfretin commented Jul 20, 2023

vincentfretin commented Jul 20, 2023

vincentfretin commented Mar 14, 2024

vincentfretin commented May 6, 2024

vincentfretin commented Jul 16, 2023 •

edited

Loading

KooIaIa commented Jul 16, 2023 •

edited

Loading