Mel model #44

lixuyuan102 · 2024-02-02T03:35:05Z

May I ask if this implementation of the model has been experimented on the MEL spectrum.? I used
Transformer model with only convolutional positional coding added at the beginning to get discontinuous generation results.

lucidrains · 2024-02-02T15:28:23Z

no, it is not well tested for mel

always welcome contributions

ex3ndr · 2024-02-02T18:09:06Z

I am working with Mel version in my reimplementation, AMA!

atmbb · 2024-02-15T07:40:43Z

@ex3ndr
Does your model work well in zero-shot task?

ex3ndr · 2024-02-15T07:44:49Z

@atmbb which task?

atmbb · 2024-02-15T08:36:56Z

@ex3ndr
Thanks for reply.
I asked Style transfer in Figure 4 of paper (zero-shot TTS task)

Diverse sampling is well working.
But generated speech in style transfer task does not follow prompt speech style in my model.

ex3ndr · 2024-02-15T09:24:47Z

@atmbb i just restarted training from scratch of my model. I am now at 27651 step, training on just two 4090 batch size is 16 * 8 per GPU - quite small comparing to original paper. It somehow follows, the prompt but it is too early.

In my previous run i have trained for 400k steps and it followed prompts correctly.

ex3ndr · 2024-02-15T09:40:51Z

@atmbb i remembered one thing: alibi requires longer training sequences that is easily available for training. I have been training for max 5 sec segments and audio style collapsed after ~5 seconds. I have seen same problem when audio was conditioned well for few seconds and then collapses, then i figured out that for longer conditioning audio i had less "valid" seconds. Alibi starts to work at around 300k iterations in my case, but longer context training still required.

Funny thing is that they mention degradation on longer conditioning and they have seen degradation start at ~15 seconds which is exactly how long is author's training context size.

Looking at alibi coefficients i think it needed ~2k context size for training or more to generalize well. No one expected alibi to generalize after just 500 (~5 seconds) - the coefficients are just not steep enough to vanish.

lixuyuan102 · 2024-02-15T16:00:53Z

Employing a different backbone network than the one (Transformer model with only convolutional positional coding) used in the voicebox paper to implement the ODE model, I have achieved a good zero-shot performance. However, multi-layer transformer with one convolutional positional coding layer still does not work on Mel in my experiment. I speculate that perhaps the original paper may have used multiple layers of convolutional positional encoding before the transformer module. I‘ll try to contribute the code that worked well on Mel.

ex3ndr · 2024-02-15T17:39:27Z

zero-shot.zip
Trained for 50k steps and performance is now reasonable. Style is followed as it should be. (quality is still not great - need few more days of training)

ex3ndr · 2024-03-21T06:21:54Z

I have published beta: https://github.com/ex3ndr/supervoice
It collapses on long sentences, also some voices are distorted (i bet it is just undertrained) and my gpt phonemiser network doesn't support numbers now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mel model #44

Mel model #44

lixuyuan102 commented Feb 2, 2024

lucidrains commented Feb 2, 2024

ex3ndr commented Feb 2, 2024

atmbb commented Feb 15, 2024

ex3ndr commented Feb 15, 2024

atmbb commented Feb 15, 2024

ex3ndr commented Feb 15, 2024

ex3ndr commented Feb 15, 2024

lixuyuan102 commented Feb 15, 2024

ex3ndr commented Feb 15, 2024

ex3ndr commented Mar 21, 2024

Mel model #44

Mel model #44

Comments

lixuyuan102 commented Feb 2, 2024

lucidrains commented Feb 2, 2024

ex3ndr commented Feb 2, 2024

atmbb commented Feb 15, 2024

ex3ndr commented Feb 15, 2024

atmbb commented Feb 15, 2024

ex3ndr commented Feb 15, 2024

ex3ndr commented Feb 15, 2024

lixuyuan102 commented Feb 15, 2024

ex3ndr commented Feb 15, 2024

ex3ndr commented Mar 21, 2024