Almost Unsupervised Text to Speech and Automatic Speech Recognition

  • Abstract:

    • 无监督方法,只需要利用几百对文本—语音对和额外的无标签的数据,提供给TTS和ASR
    • components:
      • 1.a denosising auto-encoder
      • 2. 双机制训练;TTS是把text y转成语音x,ASR把利用x和y进行训练,反之亦然
      • 3. 双向序列建模,主要解决长语音序列和文本序列在训练过程中出现的错误传播问题
      • 4.一个unified model 包含 TTS和ASR
  • Instroduction:

    • 介绍了low-resource 和 zero-resource场景的ASR和TTS 一些论文
    • 利用大量标签的语音—文本数据合成某个人的特地语音,这种transfer learning 。依赖于两训练好的ASR和TTS模型
    • methods:
      • 1. self-supervised learning  for unpaired speech and text data ,去建立语言和文本领域的语言理解和建模能力。 使用了denoising auto-encoder 
      • 2.训练过程:
        • 1.TTS把文本y合成语音x,然后ASR利用(x,y)进行训练
        • 2.ASR把语音x识别成文本y,然后TTS利用(y,x)进行训练。
      • 3.由于语音和文本序列比其他seq-seq任务的长度更长,防止更严重的误差反向传播。这里利用了双向的序列建模。
      • 4.建立了一个基于Transformer 的unified model structure 联合TTS和ASR
  • Background:

    • 2.1. Sequence to Sequence Learning 
      • 基于en-de的框架:
        • The encoder reads the source sequence and generates a set of representations. 
        • the decoder estimates the conditional probability of each target element given the source representations and its preceding elements. 
        • The attention mechanism (Bahdanau et al., 2015) is further introduced between the encoder and decoder in order to determine which source representation to focus on when predicting the current element, and is an important component for sequence to sequence learning. 
    • 2.2. TTS and ASR based on the Encoder-Decoder Framework 
      • 端到端的TTS和ASR;长句子使用Transformer;
  • Our Mothod:

    • 3.1. Denoising Auto-Encoder 

      • 给定大量的无标签的数据,为了更好的理解语音或者文本。使用denoising auto-encoder 从正确的版本自身,重构语音或者文本序列。

      • denoising auto-encoder是典型自监督学习,广泛应用到无监督学习中。
      • loss:

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

    • 3.2. Dual Transformation 
      • 1.TTS把文本y合成语音x,然后ASR利用(x,y)进行训练
      • 2.ASR把语音x识别成文本y,然后TTS利用(y,x)进行训练

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

    • 3.3.Bidirectional Sequence Modeling 
      • 通常生成序列的右边比左边的质量要低,所以是在 low- or zero-resource setting  则质量更低。
      • 解决前面的问题,作者使用了双向序列建模去生成语音或文本序列。从左到右和者从右到左。
      • 并且在无监督学习中且数据比较少时,利用双向序列建模,可以达到数据增强的作用。

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

      • unlike the conventional decoder using a zero vector as the start element for training and inference,  ;we learn four start embeddings in total, two for speech generation and the other two for text generation. 
      • Thus we reverse the source sequence to make it consistent with the target sequence 
    • 3.4.Model Structure 
      • Unified Training Flow:

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

      • Transformer Module:
        • embed256 ,hidden_size:256; fft:1025
      • In/Out Module:
        • The post-net consists of a 5-layer 1-dimensional convolutional network with hidden size of 256, which aims to refine the quality of the generated mel-spectrograms. 
        • a phoneme embedding
  • Experiments and Results  :

    • 4.1. Training and Evaluation Setup

      • LjJspeech 13100(12500+300+300) 24hours 

      • evaluation: MOS (mean option score) for TTS and PER (phoneme error rate) 
    • 4.2.Rsults
      • PER and MOs:

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

    • 4.3. Analyses
      • Different Components of Our Method:
        • we con-duct ablation studies by gradually adding each component to the baseline Pair-200 system to check the performancechanges.

          Almost Unsupervised Text to Speech and Automatic Speech Recognition

      • Visualization of Mel-Spectrograms 
        • pic:

          Almost Unsupervised Text to Speech and Automatic Speech Recognition

      • Varying Paired Data 
        • pic:

          Almost Unsupervised Text to Speech and Automatic Speech Recognition

      • Different Masking Probabilities in DAE 
        • vary mask probablity:

          Almost Unsupervised Text to Speech and Automatic Speech Recognition

  • Relate work:

    • TTS and ASR:

      • TTS:DeepVoice; Tacitron; ClariNet

    • Zero-/Low-resource TTS and ASR :
      • pic

        Almost Unsupervised Text to Speech and Automatic Speech Recognition

  •  
  • 音素embed ;encoder :两层Transformer ; 单机四卡 batch 一共512; 训练了三天
  • 那200对数据到底是怎么用的?

    Almost Unsupervised Text to Speech and Automatic Speech Recognition

    Almost Unsupervised Text to Speech and Automatic Speech Recognition

  •  
  • 1.text to phoneme:

    Almost Unsupervised Text to Speech and Automatic Speech Recognition