Paper | Project page | Code
Given an audio signal as input, we optimize our model to predict future samples from a given signal context.
Our model takes raw audio signal as input and then applies two networks. The encoder (convolutional) network embeds the audio signal (about 30 ms with stride 10 ms) in a latent space (low frequency feature representation) and the context network (convolutional) combines multiple time-steps of the encoder (210 ms total) to obtain contextualized representations (Figure 1). Both networks are then used to compute the objective function.
After training, we input the representations produced by the context network to the acoustic model instead of log-mel filterbank features.