Skip to content

Speech Emotion Classification with novel Parallel CNN-Transformer model built with PyTorch, plus thorough explanations of CNNs, Transformers, and everything in between

License

Notifications You must be signed in to change notification settings

Jon-Zbw/transformer-cnn-emotion-recognition

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel is All You Want: Combining Spatial and Temporal Feature Representions of Speech Emotion by Parallelizing CNNs and Transformer-Encoders

80.44% Accuracy

Abstract

In this notebook, I'm going to build upon my Intro to Speech Audio Classification repo and build two parallel convolutional neural networks (CNN) in parallel with a Transformer encoder network to classify audio data. We're working on the RAVDESS dataset to classify emotions from one of 8 classes. We combine the CNN for spatial feature representation and the Transformer for temporal feature representation. We augment the training data by increasing variation in the dataset to reduce overfitting; we use Additive White Gaussian Noise (AWGN) to augment the RAVDESS dataset three-fold for a total of 4320 audio samples.

We harness the image-classification and spatial feature representation power of the CNN by treating mel spectrograms as grayscale images; their width is a time scale, their height is a frequency scale. The value of each pixel in the mel spectrogram is the intensity of the audio signal at a particular mel frequency at a time step.

Because of the sequential nature of the data, we will also use the Transformer to try and model as accurately as possible the temporal relationships between pitch transitions in emotions.

This notebook takes inspirations from a variety of recent advances in deep learning and network architectures; in particular, stacked and parallel CNN networks combined with multi-head self-attention layers from the Transformer Encoder. I hypothesize that the expansion of CNN filter channel dimensions and reduction of feature maps will provide the most expressive feature representation at the lowest computaitonal cost, while the Transformer-Encoder is used with the hypothesis that the network will learn to predict frequency distributions of different emotions according to the global structure of the mel spectrogram of each emotion. With the strength of the CNN in spatial feature representation and Transformer in sequence encoding, I manage to achieve a 80.44% accuracy on a hold-out test set from the RAVDESS dataset.

Table of Contents

Appendices

References

Notable mentions:

Cite

If you find this work useful in your own research, please cite as follows:

@misc{Zenkov-Transformer-CNN-SER,
  author = {Zenkov, Ilia},
  title = {transformer-cnn-emotion-recognition},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/IliaZenkov/transformer-cnn-emotion-recognition}},
}

Licence

License: MIT

About

Speech Emotion Classification with novel Parallel CNN-Transformer model built with PyTorch, plus thorough explanations of CNNs, Transformers, and everything in between

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 86.4%
  • Python 13.6%