Audio samples from "Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm"

Authors: Jennifer Williams, Yi Zhao, Erica Cooper, and Junichi Yamagishi

Abstract: We present a new approach to disentangle speaker voice and phone content by introducing new components to the VQ-VAE architecture for speech synthesis. The original VQ-VAE does not generalize well to unseen speakers or content. To alleviate this problem, we have incorporated a speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub-phone codebooks. We also compare two training methods: self-supervised with global conditions and semi-supervised with speaker labels. Adding a speaker VQ component improves objective measures of speech synthesis quality (estimated MOS, speaker similarity, ASR-based intelligibility) and provides learned representations that are meaningful. Our speaker VQ codebook indices can be used in a simple speaker diarization task and perform slightly better than an x-vector baseline. Additionally, phones can be recognized from sub-phone VQ codebook indices in our semi-supervised VQ-VAE better than self-supervised with global conditions.

(there are 7 columns of audio samples - please scroll to the right if needed.)

Each column corresponds to variant of the VQ-VAE system. The testing conditions (1-4) correspond to conditions that were seen or unseen during training. For example, in Condition1, the speakers and text were seen during training. In Condition4, the speakers and text were unseen. For the +Speaker and +Adversarial systems, (S) denotes softmax and (AS) denotes angular-softmax.

These examples are sampled from the objective evaluation for Table 1 in the paper.

Condition1 (seen speakers / seen text):

Natural	VQ-VAE	+Global	+Speaker (S)	+Speaker (AS)	+Adversarial (S)	+Adversarial (AS)
p246_003 (male): Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

p269_005 (female): She can scoop these things into three red bags and we can go meet her Wednesday at the train station.

Condition2 (seen speakers / unseen text):

Natural	VQ-VAE	+Global	+Speaker (S)	+Speaker (AS)	+Adversarial (S)	+Adversarial (AS)
p246_158 (male): I'd love to be like Peter.

p269_153 (female): That could mean the difference between life and death in action.

Condition3 (unseen speakers / seen text):

Natural	VQ-VAE	+Global	+Speaker (S)	+Speaker (AS)	+Adversarial (S)	+Adversarial (AS)
p285_013 (male): Some have accepted it as a miracle without physical explanation.

p310_009 (female): There is according to legend a boiling pot of gold at one end.

Condition4 (unseen speakers / unseen text):

Natural	VQ-VAE	+Global	+Speaker (S)	+Speaker (AS)	+Adversarial (S)	+Adversarial (AS)
p285_034 (male): Mr. Ferguson became a minister after seven years as a journalist.

p300_032 (female): The music industry hasn't changed at all.