论文阅读Hierarchical sparse coding framework for speech emotion recognition

Introduction

此文发表于Speech Communication，影响因子1.768，发表时间2018年
Speech emotion recognition 的 feature representation部分，通常有两种方法。

hand-crafted feature encoding，如AVEC。（不知道这是个什么）
自动学习feature

本文采用 sparse coding framework ，创建 hierarchical sparse coding (HSC) scheme。即本文贡献。

Automatic speech recognition (ASR) system 自动语音识别系统，错误率较高

常用于SER（speech emotion recognition语音情感识别）的特征有：

pitch
energy
rhythm
spectral coefficients
statistical variations，如mean, median, skewness（偏度）等

（以上均不知道怎么翻译）

现在希望加入人类听觉系统（human auditory system）相关特征，如下：

loudness
accents
harmonicity
timbre texture（音色）
voice quality

简单介绍一下Sparse Coding。简单理解，即用一组向量表示x，并且用来表示x的个数比较少（这一组向量的个数是非常多的）。根据原理是“绝大多数的感官数据，比如自然图像，可以被表示成少量基本元素的叠加”，稀疏编码。
HSC的流程图如下：
论文阅读Hierarchical sparse coding framework for speech emotion recognition

Global descriptor extraction layer

提取人类感觉相关的特征主要有两种思路：1、音频分析，分析出里面的各种人类感觉因素。2、分析人类感觉特征明显的音频。
常用特征有：MFCC(Mel-frequency cepstral coefficients):梅尔频率倒谱系数（最常用，然而就这个没仔细解释，可能作者觉得这个太常用了不用解释吧），LFPC(Log Frequency Power coefficients)，LPCC(Linear Prediction Cepstral Coefficients)（这个已证明适用于中文），HFCC(Human Factor Cepstral Coefficients)，LFCC(Linear Frequency Cepstral Coefficients)，PLP(Perceptual Linear Prediction)，MELBS(Mel Bank Spectral Coefficients)，LFBS(Linear Frequency Bank Ceptual Coefficients)
对声音的感觉一般有4个维度：pitch（声音信号在声谱的位置，即speech的旋律）， loudness， phonetic identification，voice quality（音色，同种音调和声音大小情况下，听者因此区分不同的声音。同时，音色和情感有关，认为温柔和尖锐的声音音色是不同的）。
F200特征集合，如图。
论文阅读Hierarchical sparse coding framework for speech emotion recognition
FPH特征，如下图：

最终，实验人员找到了200+43=243个特征，在第一层特征提取的时候，提取出来。

Sparse coding layer

这一章主要介绍本文使用的模型。用到的方法有ODL（online dictionary learning）、random dictionary construction从而建立一个codebook，一个算法，一个软阈值方法。最终将输入的每一个音频信号转化为p个特征的线性和。

Experiments

实验数据集：VAM-Audio database和AVEC2012 database。
评价方式：Pearson’s Correlation Coefficient, CC
最终结果如下图。可见，在合理选择features的情况下，valence和arousal表现很好，并超过了第四行目前较好的方法。
论文阅读Hierarchical sparse coding framework for speech emotion recognition

论文阅读Hierarchical sparse coding framework for speech emotion recognition

Introduction

Global descriptor extraction layer

Sparse coding layer

Experiments

相关推荐