Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[coqui-tts] production deployment and aframe component #9

Open
vincentfretin opened this issue Jul 16, 2023 · 12 comments
Open

[coqui-tts] production deployment and aframe component #9

vincentfretin opened this issue Jul 16, 2023 · 12 comments
Assignees
Labels
enhancement New feature or request sponsors needed

Comments

@vincentfretin
Copy link
Member

vincentfretin commented Jul 16, 2023

I'm currently using the speechSynthesis api for text to speech, but this api doesn't work in VR on Meta browser. Also the voice is different from one platform to another, using a male voice on a female avatar is funny but not for a customer :-)
The api is a bit tricky with the voices list that is async, you can read more on this article (7 dec. 2021, so some information may not be accurate anymore)

I'm working on a coqui cpu integration with the official docker image, integrating it to my existing server without GPU.
The "tts_models/multilingual/multi-dataset/your_tts" model (article) is actually quite good for English and French (That's funny for French that you have a enough good result with speaker_id="male-pt-3\n" and language_id="fr-fr")

The backend part will consist of a docker-compose file and one or several docker containers to generate the audio from text
suitable for a production usage (several users communicating with a gpt-3.5 agent at the same time in different rooms):

  • nginx+gunicorn for WSGI (coqui-tts is a Python app)
  • Probably using the proxy cache varnish to queue similar requests for the same text, generate once the wav, cache a few minutes the wav in memory to reply to all requests.

I'm working also on an aframe component that split the text on punctuation into chunks, does the fetch call for each chunk to the coqui tts service and play the audio chunk sequentially. For the fetch call and playing the audio file received, see their code

I'm working on it for my current project. When I'm done implementing it, I'll open source it in a private repo with instructions how to self host it and use the aframe component for my $10 tier monthly sponsors. The access to the repo will be public 4 months later.

Resources:

Alternatives:

@vincentfretin vincentfretin self-assigned this Jul 16, 2023
@vincentfretin vincentfretin moved this from Todo to In Progress in Sponsorship Jul 16, 2023
@vincentfretin
Copy link
Member Author

@vincentfretin
Copy link
Member Author

An example for French with the following text:
"Ici, vous pouvez voir trois modules actuellement fermés. Le module blanc est le réfectoire, qui sert pour le déjeuner. Dans le module orange, vous trouverez la machine à café, ainsi que les vestiaires et les casiers. Enfin, le module bleu est destiné aux toilettes et aux douches."
with "tts_models/multilingual/multi-dataset/your_tts" model, speaker_id="male-pt-3\n" and language_id="fr-fr"

https://github.com/c-frame/sponsorship/assets/112249/8dfe093a-86eb-4e9a-a812-29afec0cbbd9
(sound only, not a video)

@KooIaIa
Copy link

KooIaIa commented Jul 16, 2023

I would love to help sponsor this work - I'll look into GitHub's system. I am a huge fan of open source speech tech - my favorite right now is Festival Lite + Wasm but the ecosystem isnt there yet in WASM today for audio output.

Would your method work as a open source cross platform polyfill for Speech Synth JavaScript support?

@KooIaIa
Copy link

KooIaIa commented Jul 16, 2023

Does the output of this work require a server?

@vincentfretin
Copy link
Member Author

You will need a server running docker, a small VPS with Ubuntu 22.04 for example.
It will work everywhere, it's just a fetch call to the hosted coqui tts webservice, then it reads the downloaded wav file with an audio element.

Currently on Meta browser on Quest 1 (it's not updated anymore) window.speechSynthesis and window.SpeechSynthesisUtterance are undefined. So I guess yes we can do a polyfill. I can go in that direction.
But I'll also write an alternative api so you can force it to always go through the coqui server even if window.speechSynthesis is defined.

With speechSynthesis api I'm currently using the speechSynthesis.onvoiceschanged callback to select the preferred voice and the speechSynthesis.speaking flag to know if it's currently speaking and to speak:

const msg = new SpeechSynthesisUtterance(chunk);
msg.pitch = this.data.pitch;
msg.rate = this.data.rate;
msg.voice = this.voice;
speechSynthesis.speak(msg);

Pitch and rate wouldn't do anything in the polyfill implementation that would go through coqui tts.
I see there is other flags in the speechSynthesis api, like pending and paused and some other methods that I'm not currently using. Do you use a specific part of the API I haven't listed above?

@KooIaIa
Copy link

KooIaIa commented Jul 17, 2023

Thank you for the detailed explanation - I'm looking for a polyfill capable solution to add SpeechSynth to any website in a free and open source way and a Ubuntu Linux server based solution will not provide that (it requires two computers). I'm designing for local-fist offline-capable-AR-VR websites. I totally see the benefits of speech running on a seperate computer (like if it is truely a AI entity) and there are many great server-based solutions today that are already free and open source. If these AI models are what your personally aiming for then maybe they will be able to run locally on WebGPU in a year or two cheaply and our goals may cross.

It looks like your closely following the pattern of the standard javascript api. speak(Utterance) yeah and its neat your explanding the api for yourself. Thanks for explaining again - your open source activity is wonderful and this sponsorship system you have is neat. If you do work like this one day that works in this domain locally I'm excited to sponser it. Hopefully new headsets like the Quest 3 or Deckard will have some extra juice to do TTS in a web worker.

@vincentfretin
Copy link
Member Author

Thanks for the kind word.
You should keep an eye to the implementation of the tts bark model running with ggml. You normally will be able to run it with WebAssembly like with whisper.cpp.

@KooIaIa
Copy link

KooIaIa commented Jul 17, 2023

Thanks. that looks like the scale I'm aiming for! Maybe we will end up with a good open source ecosystem that can run locally or remotely. Artificial / Virtual Speech is so fundamentally important for so many things - AI communication or even just humans wanting to read a book with their ears instead of their eyes. Im very excited to see how this space grows and becomes standard and friendly - thanks for the ggml model tip!

@vincentfretin
Copy link
Member Author

The YourTTS model says really weird things in French when there is ":" or "?" in the text. :D
And also it doesn't know how to pronounce abbreviation like "etc.".
So I cheat:

text = text.replaceAll(": ", "").replaceAll("?", "").replaceAll("etc.", "ète cétéra");

@vincentfretin
Copy link
Member Author

With the talking mistakes:
yourtts_example1.webm

With the line removing punctuation and forcing how to pronounce "etc.":
yourtts_example2.webm

I'll be testing now new voices with the new YourTTS checkpoint or generate my own.

@vincentfretin
Copy link
Member Author

You may know that Coqui company shut down, just after they allowed to use xtts output commercially with a yearly license. So currently you can't purchase a license to use the xtts model commercially. That's unfortunate.

You can still use the inference engine (its MPL2 license) and XTTS model (CPML, so can't use commercially) though.
FYI there was a streaming server for xtts
https://github.com/coqui-ai/xtts-streaming-server

Today I'm using openai tts voices https://platform.openai.com/docs/guides/text-to-speech, works well in English and French with the same voice. I'm using it through a small nodejs process with fastify framework.

@vincentfretin vincentfretin moved this from In Progress to Todo in Sponsorship Mar 14, 2024
@vincentfretin
Copy link
Member Author

Open source TTS Piper: https://rhasspy.github.io/piper-samples/
for English en_US-amy-medium.onnx is great
for French fr_FR-upmc-medium.onnx jessica and pierre is best
inference speed is fast.
I think it's finally a real open source alternative to Google TTS or OpenAI TTS for French.
I'll do more testing with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request sponsors needed
Projects
Status: Todo
Development

No branches or pull requests

2 participants