-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[coqui-tts] production deployment and aframe component #9
Comments
An example for French with the following text: https://github.com/c-frame/sponsorship/assets/112249/8dfe093a-86eb-4e9a-a812-29afec0cbbd9 |
I would love to help sponsor this work - I'll look into GitHub's system. I am a huge fan of open source speech tech - my favorite right now is Festival Lite + Wasm but the ecosystem isnt there yet in WASM today for audio output. Would your method work as a open source cross platform polyfill for Speech Synth JavaScript support? |
Does the output of this work require a server? |
You will need a server running docker, a small VPS with Ubuntu 22.04 for example. Currently on Meta browser on Quest 1 (it's not updated anymore) With speechSynthesis api I'm currently using the const msg = new SpeechSynthesisUtterance(chunk);
msg.pitch = this.data.pitch;
msg.rate = this.data.rate;
msg.voice = this.voice;
speechSynthesis.speak(msg); Pitch and rate wouldn't do anything in the polyfill implementation that would go through coqui tts. |
Thank you for the detailed explanation - I'm looking for a polyfill capable solution to add SpeechSynth to any website in a free and open source way and a Ubuntu Linux server based solution will not provide that (it requires two computers). I'm designing for local-fist offline-capable-AR-VR websites. I totally see the benefits of speech running on a seperate computer (like if it is truely a AI entity) and there are many great server-based solutions today that are already free and open source. If these AI models are what your personally aiming for then maybe they will be able to run locally on WebGPU in a year or two cheaply and our goals may cross. It looks like your closely following the pattern of the standard javascript api. speak(Utterance) yeah and its neat your explanding the api for yourself. Thanks for explaining again - your open source activity is wonderful and this sponsorship system you have is neat. If you do work like this one day that works in this domain locally I'm excited to sponser it. Hopefully new headsets like the Quest 3 or Deckard will have some extra juice to do TTS in a web worker. |
Thanks for the kind word. |
Thanks. that looks like the scale I'm aiming for! Maybe we will end up with a good open source ecosystem that can run locally or remotely. Artificial / Virtual Speech is so fundamentally important for so many things - AI communication or even just humans wanting to read a book with their ears instead of their eyes. Im very excited to see how this space grows and becomes standard and friendly - thanks for the ggml model tip! |
The YourTTS model says really weird things in French when there is ":" or "?" in the text. :D text = text.replaceAll(": ", "").replaceAll("?", "").replaceAll("etc.", "ète cétéra"); |
With the talking mistakes: With the line removing punctuation and forcing how to pronounce "etc.": I'll be testing now new voices with the new YourTTS checkpoint or generate my own. |
You may know that Coqui company shut down, just after they allowed to use xtts output commercially with a yearly license. So currently you can't purchase a license to use the xtts model commercially. That's unfortunate. You can still use the inference engine (its MPL2 license) and XTTS model (CPML, so can't use commercially) though. Today I'm using openai tts voices https://platform.openai.com/docs/guides/text-to-speech, works well in English and French with the same voice. I'm using it through a small nodejs process with fastify framework. |
Open source TTS Piper: https://rhasspy.github.io/piper-samples/ |
I'm currently using the speechSynthesis api for text to speech, but this api doesn't work in VR on Meta browser. Also the voice is different from one platform to another, using a male voice on a female avatar is funny but not for a customer :-)
The api is a bit tricky with the voices list that is async, you can read more on this article (7 dec. 2021, so some information may not be accurate anymore)
I'm working on a coqui cpu integration with the official docker image, integrating it to my existing server without GPU.
The "tts_models/multilingual/multi-dataset/your_tts" model (article) is actually quite good for English and French (That's funny for French that you have a enough good result with speaker_id="male-pt-3\n" and language_id="fr-fr")
The backend part will consist of a docker-compose file and one or several docker containers to generate the audio from text
suitable for a production usage (several users communicating with a gpt-3.5 agent at the same time in different rooms):
I'm working also on an aframe component that split the text on punctuation into chunks, does the fetch call for each chunk to the coqui tts service and play the audio chunk sequentially. For the fetch call and playing the audio file received, see their code
I'm working on it for my current project. When I'm done implementing it, I'll open source it in a private repo with instructions how to self host it and use the aframe component for my $10 tier monthly sponsors. The access to the repo will be public 4 months later.
Resources:
Alternatives:
The text was updated successfully, but these errors were encountered: