wav2vec: Unsupervised Pre-training for Speech Recognition

Given an audio signal as input, we optimize our model to predict future samples from a given signal context.

Our model takes raw audio signal as input and then applies two networks. The encoder (convolutional) network embeds the audio signal (about 30 ms with stride 10 ms) in a latent space (low frequency feature representation) and the context network (convolutional) combines multiple time-steps of the encoder (210 ms total) to obtain contextualized representations (Figure 1). Both networks are then used to compute the objective function.

After training, we input the representations produced by the context network to the acoustic model instead of log-mel filterbank features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1904.05862-wav2vec-Unsupervised-Pre--training-for-Speech-Recognition.md

1904.05862-wav2vec-Unsupervised-Pre--training-for-Speech-Recognition.md

wav2vec: Unsupervised Pre-training for Speech Recognition

Files

1904.05862-wav2vec-Unsupervised-Pre--training-for-Speech-Recognition.md

Latest commit

History

1904.05862-wav2vec-Unsupervised-Pre--training-for-Speech-Recognition.md

File metadata and controls

wav2vec: Unsupervised Pre-training for Speech Recognition