A simple application to clone voices. Audio is outputted to the default sound device instead of being saved to a file.
Currently following along with the fine-tuning video in the references section and architecting the code so that it can easily be re-used on other datasets without a lot of manual steps
Available Utilities
- Audacity labeling and audio splitting automation
- Faster Whisper ljspeech dataset creation
There is some utilities available to use the audacity scripting API to automate labeling audio and splitting it out into individual pieces of audio.
Before you run the application, you need to make sure you have a dataset loaded in lib/assets/training_data
in this repository.
a valid ljspeech dataset can be retrieved from the coqui-ai-tts repository here
Important
This project uses poetry to manage dependencies and python versions. Before using this project, you will need to install it.
# Create venv
python3 -m venv venv
# Activate
source venv/bin/activate
# Install dependencies
poetry install
# Run the application
poetry run python main.py
According to nanomonad, who references the author of coqui-tts, a fine-tuned xtts_v2 should have decent samples after 1.3 epochs, assuming that training and testing data were set up properly.
Audio datasets should conform to the following format:
| project-root
| lib
| --> assets
| --> training_data
| --> $speaker_name
| --> wavs/*.wav
training_data
can contain a variety of speakers, simply create a folder named after your speaker, and fill up the wavs folder
A speakers dataset folder should conform to the ljspeech dataset specification, which requires wav files to be in a wavs
directory, with
some kind of metadata.txt file available which contains transcriptions of your wav audio.
There's been some simple experimentation done to see what constitutes a good dataset. Currently, it seems like you can increase the quality in these ways:
- Combining multiple wav files into one wav file, for a single speaker
- Increasing the number of speaker samples available in the dataset
- Removing audio clips that do not have the tone of the speaker you wish to capture
- Until more research is done, it's recommended to keep audio clips for training below 10 seconds
Models that are fine-tuned on a specific speaker are going to pick up on little idiosyncracies in a speakers voice. If you want uniform / consistent output from the model, then you need to ensure that your audio dataset only contains the inflection's you wish the outputted audio to capture.
Through testing, it seems that deleting audio clips that contain unwanted speaker inflection and increasing the number of samples with the desired inflection / tone can increase the quality of the output. In this case, a subjective assessment is made on whether or not the output matches the inputted speaker's voice closely or not.
This python application uses poetry to manage its dependencies. Please install poetry before continuing
poetry run python main.py
In order to fine tune the model, you will need to download the model from huggingface here
To do real time audio streaming, you should have an xtts config json. The one used in the project was first grabbed from here
Helpful links
tldr for CUDA on WSL Visit this site And click: Linux -> x86_64 (or whatever your architecture is) -> WSL-Ubuntu -> 2.0
Although it says ubuntu, it should also theoretically work for Debian. YMMV.
There seems to be an issue with the version of deepspeed used: "0.10.3" on amd CPUs. Needs more investigation. Current plans are to enable the user to flag whether or not they want to use deepspeed
You should go here and make sure that you're able to leverage the CUDA toolkit before running this program
Text length exceeding 250 characters
This might be an issue with the length of the audio files in your training dataset.
Please try using the util script delete-long-audio-files.sh
to take out the long audio files from your dataset.
Afterwards, you can use the diff_ljspeech_metadata_csv.py
file to create a new metadata.csv with the deleted files excluded.
run/training/GPT_XTTS_v2.0_LJSpeech_FT-single_channel_wavs-July-28-2024_01+59PM-9f8773b