"Depression Assessment by Fusing High and Low Level Features from Audio, Video, and Text" 论文笔记
Introduction 以及 Related work 部分略过
Feature extraction
The current DAIC-WOZ dataset [10] includes audio recordings, audio features, interview transcripts, video features, pixel coordinates for 68 2D facial landmarks, world coordinates for 68 3D facial landmarks, gaze vector, head-pose vector, Histogram of Oriented Gradients (HOG) features, emotion, and Action Unit (AU) labels.(当前的DAIC-WOZ数据集[10]包括音频记录、音频特征、访谈记录、视频特征、68个二维面部地标的像素坐标、68个三维面部地标的世界坐标、注视向量、头姿态向量、方向梯度直方图(HOG)特征、情绪和动作单元(AU)标签。本文采用DAIC-WOZ数据集所提供的特征进行计算)
High level features are those which can be translated to common sense knowledge; for instance head motion, blinking, facial expressions, AUs, and text related features, can be annotated with a high degree of certainty by a human expert. (高层次的特征是那些可以转化为常识知识的特征;例如,头部运动、眨眼、面部表情、AUs和文本相关的特征,可以由人类专家进行高度确定的注释。)
Low level features, on the other hand, are derived from image processing algorithms, which extract descriptors from an image, but cannot be directly translated to human knowledge. Such features extracted here are the Landmark Motion History Images combined with LBP and HOG, Landmark Motion Magnitude, and most of the audio-based features.(另一方面,低层次的特征来自图像处理算法,这些算法从图像中提取描述符,但不能直接翻译成人类知识。本文中提取的这些特征是结合了LBP和HOG的地标运动历史图像、地标运动幅度以及大部分基于音频的特征。)
visual features
landmark motion history images
这部分其实没看太懂,大致说一说叭。
landmark motion history images(LMHI)是在没有视频帧的实际强度的情况下,在提供的2D坐标点上计算的。LMHI将面部特征的运动编码成灰度图,最近的运动对应白色像素,最早的运动对应最深的灰色,时间上的中间运动对应灰度值。是在Ptucha and Savakis [23]光流场工作上进行的拓展。用于LMHI的面部坐标分为眉毛、眼睛、鼻子和嘴四个部分。在计算之前,使用仿射变换对面部坐标进行校准,通过对齐轮廓点进行标准化。
灰度值的定义为:第i帧的灰度值=最大像素值255/总帧数*第i帧。
最终的LMHI为计算的LMHI的平均值。
然后在LMHI的基础上计算LBP和HOG特征。
如果要深入研究,可能需要再看一下Ptucha and Savakis [23]的工作叭。了解一下基于光流场的分析。
landmark motion magnitude
Landmark Motion Magnitude (LMM) was applied here to the 2D landmarks. For the extraction of the landmark activity features, the vectors that displace each landmark from one frame to the next were calculated based on the landmark coordinates. From these vectors, unwanted global head motion was removed by subtracting the average motion vector of a landmark group representing the nose.(LMM特征基于2D坐标点,计算的是一帧到下一帧的位移向量,为了减去不必要的头部运动,减去了代表头部的坐标群的平均位移向量.)
the maximum magnitude from each of the following 5 landmark groups was selected for each frame: right eyebrow {18-22}, left eyebrow {23-27}, mouth center {51-53, 62-64, 66-68, 57-59}, right mouth corner {49, 61} and left mouth corner {55, 65}. (每一帧计算5个坐标区域 : 左眉毛,右眉毛,嘴中心,右嘴角以及左嘴角.)
The five LMM time-signals, corresponding to the five regions defined above, were used for statistical and spectral features extraction.
the resulting features in total are as follows:
variance |
The variance of the time intervals between any two subsequent spikes or transients was used as an index of movement based on the assumption that rhythmic movements would be associated with near-zero variance. |
energy ratio | The energy ratio of the autocorrelation sequence was calculated as the relationship of the energy contained in the last 75% of the samples of the autocorrelation sequence and the energy contained by the first 25%, and was used as a measure of the motion manifested as quasiperiodic spikes (randomness). |
median | The median of the signal based on the P2 algorithm [13]. |
standard deviation | The standard deviation of the signal. |
interquartile range | The interquartile range (i.e. the difference between the 75th and 25th quartiles), based on the P2 algorithm [13]. |
skewness | The skewness of the sample distribution, defined as the ratio of the 3rd central moment to the 3/2-th power of the 2nd central moment of the samples. |
kurtosis | The kurtosis of the sample distribution, defined as the ratio of the 4th central moment and the square of the 2nd central moment of the samples, minus 3. |
shannon entropy | The Shannon entropy of the energy in bins, H(xi) calculated from the normalized energy of 10 equally sized consecutive bins, taken from the signal.(公式参考原文) |
spectral power frequency | The 25% spectral power frequency, which corresponds to the upper bound of the frequency band starting at 0 Hz that contains 25% of the total spectral power. |
dominant frequency | The dominant frequency which corresponds to the signal frequency associated with the highest power. |
spectral roll-off | The spectral roll-off : the frequency value at which 80% of the spectral power is below that point. |
spectral centroid |
Head motion
head motion was computed based on the horizontal and vertical deviations of specific reference points (landmarks 2, 4, 14, 16) between consecutive frames. (头的运动是根据特定参考点(界标2,4,14,16)在连续帧之间的水平和垂直偏差来计算的。)
Several statistical indices were subsequently derived from the six time series in the form of the mean, median, and standard deviation of a) velocity and displacement separated on the X and Y axes, and b) velocity and displacement magnitude, resulting in a total of 18 features.(x和y轴的速度和位移,以及向量级的速度和位移共六个时间序列,分别计算其均值,中值和标准差,共18个特征.)
blinking rate
Blinking rate was extracted using the 2D landmarks in order to segment and mark out the eyeball perimeter.
The area defined by each set of landmarks was computed over the entire recording. Time series were filtered to remove spikes and smooth out highly variable segments. Sharp decreases were considered as blinks. (在整个记录过程中,计算每组地标所定义的区域。时间序列被过滤,以去除尖峰和平滑高度可变的片段。急剧下降被认为是眨眼。)
Detection of downward peaks was performed using a gradient peak detection algorithm utilizing the following parameters: minimum peak distance, peak duration, derivative amplitude, and derivative peak distance. The resulting feature was blink frequency, i.e. the number of blinks per minute.(使用梯度峰值检测算法检测下行峰值,该算法利用以下参数:最小峰值距离、峰值持续时间、导数振幅和导数峰值距离。结果是眨眼频率,即每分钟眨眼的次数。)
Emotions, AUs, Gaze & Pose
The statistical measures chosen to represent this variability were: minimum, maximum, mean, mode, median, range, mean deviation, variance, standard deviation, skewness, and kurtosis.(11个统计量表示变化) These indices were calculated for each of the 10 prelabeled emotions, and for each of 19 AUs, resulting in 110 (10*11) and 209 (19*11) feature sets, respectively. The same set of statistical indices were computed for the provided gaze and pose.
audio and text features 略过
feature selection
The value of individual features was assessed by documenting the effect of removing each feature or set of features on the resulting f1 score on the development set.
classification
Four different approaches were implemented for classification: Gender-based, Feature-Level Fusion, Decision Fusion, and the Posterior-Probability Model. A Decision Tree classification algorithm was applied in each case.
experimental results and conclusion 略过