paper review : END-TO-END AUDIOVISUAL SPEECH RECOGNITION

END-TO-END AUDIOVISUAL SPEECH RECOGNITION

code: https://github.com/tstafylakis/Lipreading-ResNet

Tip : this paper is easy to understand, and it got 8th in libreading dataset. We can find it in paper with code

Summary

The author want to design a high accurary algorithm to class 500 kinds of words. The result is well. The main controbution of this paper as follow:

  1. design a new framework which only use raw image and wave.
  2. The nerual network get an effective result when noise is big.

摘要 (中文)

最近出现了几种端到端深度学习方法,它们从输入图像或音频信号中提取音频或视觉特征,并执行语音识别。然而,对端到端视听模型的研究还非常有限。本文提出了一种基于残差网络和双向门控循环单元(BGRUs)的端到端视听模型。据我们所知,这是第一个同时学习从图像像素和音频波形中直接提取特征并在大数据集(LRW)上进行上下文内认知的视听融合模型。该模型由两条流组成,每一条流对应于每一模态,它们直接从口腔区域和原始波形中提取特征。每个流/模态中的时间动态通过一个2层BGRU进行建模,多个流/模态的融合通过另一个2层BGRU进行。据报道,在干净的音频条件和低噪声水平下,与端到端纯音频和基于mfcc的模型相比,分类率略有提高。在高电平噪声存在的情况下,端到端audiovisualmodel的性能明显优于两种纯音频模型。

Research Objective

The author design a audiovisual framework for speech recongnition.

Background and Problems

  • Background

    • not states
  • previous methods brief introduction

    • Traditional audiovisual fusion systems consist of two stages, feature extraction from the image and audio signals and com-bination of the features for joint classification [1, 2, 3]
    • Re-cently, several deep learning approaches for audiovisual fusion have been presented which aim to replace the feature extraction stage with deep bottleneck architectures. They often use MFCC
  • Problem Statement

    • previous work usually has two stages: feature extraction and joint classification.
    • Some real end t to end work use MFCC as input, it is not real raw end to end.

main work

  1. Based on other end to end work, the author change framework and use raw data as input.
  2. The experiment result show classification of 500 words from the LRW
    database achieving state-of-the-art performance. In a high nosie enviroment, the performance is significant.

paper review : END-TO-END AUDIOVISUAL SPEECH RECOGNITION

Related work

  • it wrote in introduction part.

Method(s)

  • Visual stream :
      1. The visual sream using 3DCNN +ResNet-34 + BGRU*2
  • Audio stream :
      1. The visual sream using 1D resnet due the input is audio which just has one dimension
  • Fusion stream :
      1. Audio and visual stream fusion into BGRU.

Experiment

  • Dataset

    • LRW DATABASE.The database consists of short segments (1.16 seconds) from BBC programs, mainly news and talk shows.
  • Preprocessing

  • video intput is moth area. And the size is 96*96. Audio input be propressed.
  • Training
  • Data augmentation, add noise and pretrain ResNet. The whole training work is divied into two stages. The frist one is initalisation,some framework just train by single stream. The next stage is end to end train whole audioviual network .
  • Training
  • The result show below figure. It can been see that this audiovisual network not has a big imporve by raw data. But with noise increase, the model significantly roubust.
    paper review : END-TO-END AUDIOVISUAL SPEECH RECOGNITION

paper review : END-TO-END AUDIOVISUAL SPEECH RECOGNITION

Conclusion

  • main controbution
  1. we design a good audiovisual network for speech calssification.
  • further work
  1. Doing sentense level reconginition.
  2. we will investigate fusion mechanism which learn weight each modality based on the noise levels.

Reference(optional)

Arouse for me

  • This kind s of paper is easy to read, the introduction of hyper-paramter is carefully,and it has pubished relate code. So it is worth to read.
  • This paper writing style is not well. It not has a area background in the begin. Maybe it is due conference paper. But I will not use this writing style.
  • I can comfortable read this kind of paper. Maybe the next step, we can read all relate paper in “paper with code” . The other not include paper just read after readed paper which has code.