Does any one have a sample of audio that has been generated with a trained model? #29

KonstantineGoudz · 2023-09-29T18:20:13Z

KonstantineGoudz
Sep 29, 2023

Does any one have a sample of audio that has been generated with a trained model?
Would love to hear a sample of some generated audio along with some info about:

how much audio was used to train.
how long did it take to train the model

ADD-eNavarro · 2023-10-20T10:36:10Z

ADD-eNavarro
Oct 20, 2023

I believe https://github.com/chenht2010/Voicebox does.
No idea about the two questions though.

0 replies

lucasnewman · 2023-10-20T16:10:26Z

lucasnewman
Oct 20, 2023

It's pretty easy to train, so I suggest giving it a shot yourself!

Here's a sample from a training run on the LibriTTS-R dataset, trained for 500k steps on a single A100. The total training time was about 36 hours.

I used the semantic tokens from the 2k token-variant Hubert model available here, but you can experiment with other semantic representations depending on your use case.

Report back if you get some good results! 🚀

1 reply

Jon-Zbw Jan 2, 2024

can you share your train code？TKS

Alrowithi · 2023-11-26T08:24:48Z

Alrowithi
Nov 26, 2023

@lucasnewman I just tried to replicate what you found. I used "L9 km2000 (Expresso)".
I trained the model using raw audios from LibriTTS-R for 1M updates on 8 A100 GPUs with FP16
I used voicebox-pytorch 0.4.0

I noticed the following:

The model actually is not following the cond audio and instead it's learning the voice mainly from Semantic tokens (I guess too much speaker info within tokens?).

I used a female audio sample (from training split) to generate semantic tokens, no matter what cond I use, it generates a female voice similar to original.

I also noticed that the decoded audio sounds very fast (AFAIU this is due to semantic tokens vector being short) so my dirty solution was to repeat each token twice, which somewhat generates something good.

It seems I have to train my own TextToSemantic to avoid point 2.

But this is what I got so far :)

6 replies

lucasnewman Nov 28, 2023

@Alrowithi I put up a PR here that should help with the conditioning issue you were hitting. Note you'll need to retrain the network if you want to try it out 👍

Alrowithi Nov 29, 2023

Wow! that was fast :).
I'll start a new run later today and report back my findings 👍

Alrowithi Feb 16, 2024

So I had great success conditioning on aligned characters through UnitY2 aligner from SeamlessM4T repo (it seems they did the same in AudioBox paper). It works great and I'm currently running training on much bigger data.

Edit:
forgot to mention Audio conditioning got much better when I scaled the data, moving from LibriTTS-R to CV-Eng I had better results on speaker conditioning.

ex3ndr Feb 16, 2024

I have much much worse quality of voice when scaled to common voice.

Also isn't this aligner for graphemes and not phonemes?

Alrowithi Feb 16, 2024

Quality is indeed worse compared to the samples Lucas provided here, but I'm going to worry about that later. My current goal is to get good zero-shot performance and support for multilingual speech.
Regarding the aligner yes indeed it's grapheme based. The alignment it generates is actually great, and it supports around 38 languages so that's a big plus. I also noticed Meta used it to train AudioBox TTS, you can refer to the paper they mentioned it in section 5.1 .

Also, I tried to train a duration model based on alignment generated from UnitY2 and it follows the same architecture in VoiceBox paper, although it converges but results were not that good (durations generated are very short). I will investigate it later this week.

lucasnewman · 2023-12-01T19:09:49Z

lucasnewman
Dec 1, 2023

Hey folks, for those interested in experimenting with Voicebox, I've got a small pretrained model up on HF 🚀

Here are the model specs:

12 layers
512 embed
16 heads
140M parameters including the vocoder

It was trained on random 5 second crops of the LibriTTS-R train-clean-100 & train-clean-360 subsets for 150k steps on a single A100 with a batch size of 112, learning rate of 2e-4, max gradient norm of 0.2, 5k warmup steps, and 0.2 conditional drop probability. It trained for roughly 24 hours.

I've also put up a notebook that demonstrates how to use both unconditional generation and in-filling / style transfer, and I left a couple of examples to listen to in the notebook as well.

Note that this doesn't include a text-to-semantic model -- it currently derives the semantic tokens from other audio samples.

You can generate samples on an A100 with good quality (~32 ODE steps) at about 2x realtime, and the notebook should run on any GPU on Colab or even the CPU if you switch the accelerator from "cuda" to "cpu".

4 replies

rodrigoGA Jan 4, 2024

Great job
Where can I find a pre-trained model for text-to-semantic?
I am only interested in evaluating the text-only feature

wassimseif Jan 10, 2024

@lucasnewman
First, thank you so much for all your open source efforts !!
Following some of your tips you left here and there I have this setup now. Setup one is still training, I'll update this thread once I get some results but setup two not working and it's the one I wrote following your tips

Setup One ( With SpearTTs pretrained model )

I trained a spear tts model on mls_english (pretraining only)
TTS config

dim: 256
num_text_token_ids: 32100
source_depth: 6
target_depth: 6
heads: 8
dim_head: 64
attn_dropout: 0.5
ff_mult: 2
ff_dropout: 0.5

I loaded this TTs model onto voicebox training and currently training on a small subnet of mls_english.

Setup two ( Without TTS )

( I think ) this is the setup you used with the pretrained model you publishedo on hf?

I have a Dataset that reads .flac files and sends back the data. More or less similar to this ( I think i got it from you also :) )

def __getitem__(self, idx):
    ...
    data, sample_hz = torchaudio.load(file)
    data_orig = torch.clone(data)
    if data.shape[0] > 1:
        # the audio has more than 1 channel, convert to mono
        data = reduce(data, "c ... -> 1 ...", "mean")
    data = resample(data, sample_hz, self.target_sample_hz)
    audio_length = data.size(1)
    if audio_length > max_length:
        max_start = audio_length - max_length
        start = torch.randint(0, max_start, (1,))
        data = data[:, start : start + max_length]
    else:
        data = F.pad(data, (0, max_length - audio_length), "constant")
    
    data = rearrange(data, "1 ... -> ...")
    return data

Training setup is similar to this
VoiceBox Config

condition_on_text: True
dim: 512
dim_cond_embed: 512
num_text_token_ids: 32100
num_cond_tokens: 2001
depth: 12
dim_head: 64
heads: 16
ff_mult: 4
attn_qk_norm: False
use_gateloop_layers: False
num_register_tokens: 0

model = get_voicebox_model(config)
cfm_wrapper = ConditionalFlowMatcherWrapper(voicebox=voicebox, cond_drop_prob=0.2)

Now when you try to train with such a setup, it will fail because you will get this AssertionError from voicebox_pytorch here. because :

if self.condition_on_text: <--- This will be true 
    if exists(self.text_to_semantic) or exists(semantic_token_ids): <--- This will be false because we don't have a tts model and where do we get the semantic_token_ids ? The dataset should return them?
        assert not exists(phoneme_ids), 'phoneme ids are not needed for conditioning with spear-tts text-to-semantic'

        if not exists(semantic_token_ids):
            assert input_is_raw_audio
            wav2vec = self.text_to_semantic.wav2vec
            wav2vec_input = resample(raw_audio, input_sampling_rate, wav2vec.target_sample_hz)
            semantic_token_ids = wav2vec(wav2vec_input).clone()

        cond_token_ids = semantic_token_ids
    else:
        assert exists(phoneme_ids) <--- This will be triggered  because phoneme_ids id None
        cond_token_ids = phoneme_ids

I did a small workaround where I do this

wav2vec = get_wav2vec_models(config) # hubert_base_ls960_L9_km2000_expresso.bin, 24_000
cfm_wrapper = ConditionalFlowMatcherWrapper(voicebox=voicebox, cond_drop_prob=0.2)
cfm_wrapper.wav2vec = wav2vec

and then in the training code I added this

if self.condition_on_text:
    if exists(self.text_to_semantic) or exists(semantic_token_ids):
        assert not exists(
            phoneme_ids
        ), "phoneme ids are not needed for conditioning with spear-tts text-to-semantic"

        if not exists(semantic_token_ids):
            assert input_is_raw_audio
            wav2vec = self.text_to_semantic.wav2vec
            wav2vec_input = resample(
                raw_audio, input_sampling_rate, wav2vec.target_sample_hz
            )
            semantic_token_ids = wav2vec(wav2vec_input).clone()

        cond_token_ids = semantic_token_ids
###################################
    elif hasattr(self, "wav2vec"):
        wav2vec = self.wav2vec
        wav2vec_input = resample(
            raw_audio, input_sampling_rate, wav2vec.target_sample_hz
        )
        semantic_token_ids = wav2vec(wav2vec_input).clone()

        cond_token_ids = semantic_token_ids
###################################
    else:
        assert exists(phoneme_ids)

        cond_token_ids = phoneme_ids

I am also waiting for that result now. If you have any advice or feedback on the approaches that would be wonderful ! or if you can share more details about your training setup also !

wassimseif Jan 18, 2024

also @lucasnewman I'm super interested how you were able to fit a batch size of 112 on 1 A100. Any explanation on this would be very appreciated :)

ex3ndr Feb 2, 2024

@wassimseif i think if you cast your data to float16/bfloat16 it would fit 112 on 1 single A100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does any one have a sample of audio that has been generated with a trained model? #29

{{title}}

Replies: 4 comments 11 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Does any one have a sample of audio that has been generated with a trained model? #29

Replies: 4 comments · 11 replies

Setup One ( With SpearTTs pretrained model )

Setup two ( Without TTS )

Replies: 4 comments 11 replies