Sequence discriminative training

       语音识别声学模型DNN训练通常用cross-entropy(CE)作为损失函数进行训练,CE可以看做是KL散度的一种形式 ,用来评价期望分布和当前训练模型概率分布的差距。方便计算,所以常常被用来作为损失函数广泛使用。然而基于帧识别的语音识别中,往往用WER或者CER,PER来作为评价语音识别的准确率。损失函数和训练目标不一致,因此[1]提出了基于序列区分度(Sequence Discriminative training)目标函数来进行训练,sequence discriminative training在训练过程中结合了发音词典和语言模型,以序列进行区分训练,使得相近的句子识别概率更大,其他的概率尽量小,使得训练的模型识别率更加提升。sequence discriminative training训练前往往先使用CE训练对齐[2],产生lattice,然后再通过lattice进行sequence discriminative training[3][8][10][11]。当然近期也有用过lattice-free来进行sequence training[4]。本篇文章就主要介绍sequence discriminative training通过lattice进行训练的过程。lattice-free进行sequence discriminative training会在后面的文章进行介绍。接下来我们会把sequence discriminative training简称为sequence training(需要注意的是sequence to sequence training和本文无关,seq2seq training是机器翻译中的encode-Decoder模型,逐渐应用到语音识别中,文章中已经有了详细的介绍)

      当前比较流行的sequence training目标函数有最大互信息Maximum Mutual Information (MMI), boosted MMI(BMMI), 和最小贝叶斯风险 sMBR。MMI目标函数公式如下[5]:

                     Sequence discriminative training

              Sequence discriminative training

BMMI和MBR目标函数如下所示[5]:

                        Sequence discriminative training

观查MMI目标函数,可以看到,训练的目标为给定句子Sequence discriminative training,求出句子对应的senone(绑定的状态)序列概率,然后遍历句子r,求和使得总的概率足最大。

也有对句子w进行建模,MMI和sMBR目标函数如下所示[2]:

                           Sequence discriminative training

                              Sequence discriminative training

其中Wu表示正确的标注,k表示scale,Sequence discriminative training表示Kronecker delta 函数。T表示所有帧的数目,MMI训练目标就是使得分子概率最大,使得分母概率最小的过程。

这里也给出CE目标函数,如下所示:

Sequence discriminative training

    Sequence discriminative training 表示Kronecker delta 函数


        上面已经介绍了sequence training和MMI目标函数,接下来看一下MMI训练过程。

给定输入x,模型参数θ,损失函数L(x, θ的梯度定义如下:

Sequence discriminative training

由链式法则可以看出,损失函数对参数求梯度,首先是求损失函数对**值at(k)进行求导,也叫作外部导数(outer derivatives),然后是**值对神经网络参数θ求导,也叫作内部导数(innerderivatives)。MMI和CE差别就是这个outer导数上,inner导数对不同类型(DNN,LSTM等)的神经网络导数也不同。

MMI外部导数如下所示[6]:

Sequence discriminative training

其中k为scale因子,通常取语言模型scale的倒数,

Sequence discriminative trainingSequence discriminative training表示t时刻分母lattice和分子lattice在状态k的占有概率,可以通过WFST生成的lattice使用前向-后向算法计算得到[9]。如下所示:


Sequence discriminative training

然而在计算分子lattice和分母lattice时,为了避免分母lattice中没有候选路径,因此常常把分子lattice合并到分母lattice中作为分母lattice,如下图所示中的merge lattice 过程:

Sequence discriminative training

因此MMI导数如下所示[7]:

Sequence discriminative training


上面介绍了MMI目标函数句子ut计算分子分母状态s占有概率γ的过程,接下来,介绍一下MMI目标函数训练神经网络的总体流程:

1)所有训练语句进行随机化shuffle

2)对每个句子U,进行计算分子lattice和分母lattice的状态占有概率,如下图所示:

Sequence discriminative training

3)句子U分批前向打分,误差计算和内部梯度计算,并根优化策略SGD或者ASGD反向传播更新参数θ

4)转到2)循环所有训练的音频

5)  循环多个epoch

6)结束


以上就完成了Sequence discriminative training 训练神经网络的整体流程。


       

      



=================================================================================================


[1]Maximum mutual information estimation of hidden Markov model parameters for speech recognition

[2]Sequence Discriminative Distributed Training of Long Short-Term Memory Recurrent Neural Networks

[3]Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling

[4]Purely sequence-trained neural networks for ASR based on lattice-free MMI

[5]Error Back Propagation For Sequence Training Of Context-Dependent Deep Networks For Conversational Speech Transcription

[6]Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks- Towards Big Data

[7]Asynchronous Stochastic Optimization for Sequence Training of Deep Neural Networks

[8]A Log-Linear Discriminative Modeling Framework for Speech Recognition

[9]Discriminative Training and Acoustic Modeling for Automatic Speech Recognition

[10]MMIE training of large vocabulary recognition systems

[11]Discriminative training for large vocabulary speech recognition