Skip to content

Latest commit

 

History

History
18 lines (12 loc) · 1.09 KB

1904.05862-wav2vec-Unsupervised-Pre--training-for-Speech-Recognition.md

File metadata and controls

18 lines (12 loc) · 1.09 KB

wav2vec: Unsupervised Pre-training for Speech Recognition

Paper | Project page | Code

wav2vec architecture

Given an audio signal as input, we optimize our model to predict future samples from a given signal context.

Our model takes raw audio signal as input and then applies two networks. The encoder (convolutional) network embeds the audio signal (about 30 ms with stride 10 ms) in a latent space (low frequency feature representation) and the context network (convolutional) combines multiple time-steps of the encoder (210 ms total) to obtain contextualized representations (Figure 1). Both networks are then used to compute the objective function.

wav2vec objective

After training, we input the representations produced by the context network to the acoustic model instead of log-mel filterbank features.