Strong Generalized Speech Emotion Recognition Based on Effective Data Augmentation.
Huawei TaoShuai ShanZiyi HuChunhua ZhuHongyi GePublished in: Entropy (Basel, Switzerland) (2022)
The absence of labeled samples limits the development of speech emotion recognition (SER). Data augmentation is an effective way to address sample sparsity. However, there is a lack of research on data augmentation algorithms in the field of SER. In this paper, the effectiveness of classical acoustic data augmentation methods in SER is analyzed, based on which a strong generalized speech emotion recognition model based on effective data augmentation is proposed. The model uses a multi-channel feature extractor consisting of multiple sub-networks to extract emotional representations. Different kinds of augmented data that can effectively improve SER performance are fed into the sub-networks, and the emotional representations are obtained by the weighted fusion of the output feature maps of each sub-network. And in order to make the model robust to unseen speakers, we employ adversarial training to generalize emotion representations. A discriminator is used to estimate the Wasserstein distance between the feature distributions of different speakers and to force the feature extractor to learn the speaker-invariant emotional representations by adversarial training. The simulation experimental results on the IEMOCAP corpus show that the performance of the proposed method is 2-9% ahead of the related SER algorithm, which proves the effectiveness of the proposed method.
Keyphrases
- machine learning
- electronic health record
- big data
- deep learning
- autism spectrum disorder
- working memory
- randomized controlled trial
- depressive symptoms
- systematic review
- computed tomography
- magnetic resonance imaging
- magnetic resonance
- oxidative stress
- single molecule
- borderline personality disorder
- neural network