One-shot VC by Separating Speaker and Content Representations with Instance Normalization

会议:2019 interspeech
单位:国立台湾大学
作者:Ju-chieh Chou, Hung-yi Lee

abstract

motivation: one-shot vc,src&tar speaker都可以是训练集unseen的,一句话转换
idea: disentangling speaker and content representations with instance normalization (IN)
证明:模型可以无监督的情况下学到有意义的说话人表征

1. introduction

信号包含两部分信息:(1)static—说话人,声道,几乎不随句子变化;(2)文本,事件相关。本文实现:speaker encoder+content encoder(归一化通道信息,使用不带affine transformation的instance normalization,消除说话人信息)+decoder(采用adaptive instance normalization,对应的affine参数由speaker encoder提供,因此说话人信息仅由speaker encoder控制)。训练时候没有提供任何的speaker label,但是speaker encoder学到了有意义的speaker label.
factorized representations

GAN移除了句子中某些属性,但是需要额外的计算量训练判别器。而本文用instance normalization替代对抗训练。

ps. instance normalization是对单张图片做norm, batch normalization是对整体做norm
图像的风格变幻上有用到【28】

2. Proposed Approach

2.1 VAE

loss函数包含两部分,decoder重建损失LrecL_{rec}content encoder的损失LklL_{kl}
One-shot VC by Separating Speaker and Content Representations with Instance Normalization
One-shot VC by Separating Speaker and Content Representations with Instance Normalization
One-shot VC by Separating Speaker and Content Representations with Instance Normalization

2.2 IN for feature disentanglement

Instance normalization
对content encoder卷积的每个通道求均值方差,然后作用于这个通道
One-shot VC by Separating Speaker and Content Representations with Instance Normalization
One-shot VC by Separating Speaker and Content Representations with Instance Normalization

adaIN
对decoder首先做IN,然后由speaker encoder提供参数加adaIN
One-shot VC by Separating Speaker and Content Representations with Instance Normalization

3. Implementation Details

One-shot VC by Separating Speaker and Content Representations with Instance Normalization

4. experiment

4.1 数据集

VCTK:80-train, 9-eva, 20-test,128帧以下的移除(50/12.5),train_utt_num=16000,
train的时候设置segment length=128为了卷积操作,inference的时候可以处理任何长度的语音

4.2 Evaluation of disentanglement

用ablation test测试content encoder中包含的说话人信息,怎么测试的???
One-shot VC by Separating Speaker and Content Representations with Instance Normalization
(1)EcE_c没有预计的低—speaker encoder对decoder中通道信息的控制,因此模型更倾向通过speaker encoder学习说话人信息。
(2)为了确认这一点,把IN加载speaker encoder上,发现精确加大,说明content encoder中说话人信息更多了。因为speaker encoder中的average pooling+IN阻止s学习说话人,因此更多的说话人信息通过content encoder流过。

4.2. Speaker embedding visualization

将训练过程中seen和unseen speaker的句子分别经过speaker encoder,将输出的结果二维可视化
(1)不同的说话人分的比较开;
(2)seen和unseen都分的开
(3)同样的句子用4.1的content encoder测精度,seen speaker达到0.9973,unseen speaker达到0.9998 。
说明speaker encoder达到预期。

4.3 客观测试

  1. global variance for each of the frequency index (ref【35】)用于谱分布的可视化展示,target和converted随机挑100句,结论:转换的语音不匹配target 的句子???

  2. Spectrograms example:展示了source和converted基频的对比(heatmap),说明文本信息的保留。

4.4 主观MOS测试

没有baseline model,直接做了和source以及target相似度的直方图展示