Does any one have a sample of audio that has been generated with a trained model? #29
Replies: 4 comments 11 replies
-
I believe |
Beta Was this translation helpful? Give feedback.
-
It's pretty easy to train, so I suggest giving it a shot yourself! Here's a sample from a training run on the LibriTTS-R dataset, trained for 500k steps on a single A100. The total training time was about 36 hours. I used the semantic tokens from the 2k token-variant Hubert model available here, but you can experiment with other semantic representations depending on your use case. Report back if you get some good results! 🚀 |
Beta Was this translation helpful? Give feedback.
-
@lucasnewman I just tried to replicate what you found. I used "L9 km2000 (Expresso)". I noticed the following:
I used a female audio sample (from training split) to generate semantic tokens, no matter what cond I use, it generates a female voice similar to original.
It seems I have to train my own TextToSemantic to avoid point 2. But this is what I got so far :) |
Beta Was this translation helpful? Give feedback.
-
Hey folks, for those interested in experimenting with Voicebox, I've got a small pretrained model up on HF 🚀 Here are the model specs: 12 layers It was trained on random 5 second crops of the LibriTTS-R train-clean-100 & train-clean-360 subsets for 150k steps on a single A100 with a batch size of 112, learning rate of 2e-4, max gradient norm of 0.2, 5k warmup steps, and 0.2 conditional drop probability. It trained for roughly 24 hours. I've also put up a notebook that demonstrates how to use both unconditional generation and in-filling / style transfer, and I left a couple of examples to listen to in the notebook as well. Note that this doesn't include a text-to-semantic model -- it currently derives the semantic tokens from other audio samples. You can generate samples on an A100 with good quality (~32 ODE steps) at about 2x realtime, and the notebook should run on any GPU on Colab or even the CPU if you switch the accelerator from "cuda" to "cpu". |
Beta Was this translation helpful? Give feedback.
-
Does any one have a sample of audio that has been generated with a trained model?
Would love to hear a sample of some generated audio along with some info about:
Beta Was this translation helpful? Give feedback.
All reactions