Audio samples from "Exploring Disentanglement with Multilingual and Monolingual VQ-VAE"

Paper: arXiv paper - submitted to Speech Synthesis Workshop 2021 (SSW11)

Authors: Jennifer Williams, Jason Fong, Erica Cooper, and Junichi Yamagishi

Abstract:

This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copy-synthesis, voice transformation, linguistic code-switching, and content-based privacy masking. From these tasks, we reflect on how disentangled phone and speaker representations can be used to manipulate speech in a meaningful way. Our experiments demonstrate that the VQ representations are suitable for these tasks, including creating new voices by mixing speaker representations together. We also present our novel technique to conceal the content of targeted words within an utterance by manipulating phone VQ codes, while retaining speaker identity and intelligibility of surrounding words. Finally, we discuss recommendations for further increasing the viability of disentangled representations.

Multilingual

SIWIS Copy-Synthesis
SIWIS Voice Transformation
SIWIS Linguistic Code-Switching

Monolingual Copy-Synthesis (seen):

	VCTK-003	VCTK-004	VCTK-005	VCTK-006
Natural (M, p345)
Synth (M, p345)
Natural (F, p229)
Synth (F, p229)

Monolingual Copy-Synthesis (unseen):

	VCTK-031	VCTK-032	VCTK-033	VCTK-034
Natural (M, p260)
Synth (M, p260)
Natural (F, p300)
Synth (F, p300)

Monolingual Voice Transformation:

* indicates an item that was not used for the evaluation in the paper.

	VCTK-p229_005	VCTK-p253_005	VCTK-p256_009	VCTK-p302_009
VCTK Natural
Speaker Code 67
Speaker Code 109
Speaker Code 242
Speaker Code 109+242
* Speaker Code 5
* Speaker Code 30
* Speaker Code 220
* Speaker Code 30+220

Content-Based Masking:

* indicates an item that was not used for the evaluation in the paper.

	* VCTK-003	* VCTK-004	VCTK-005	VCTK-006
Natural (F, p229)
	--	--	--	--
	"blue cheese"	"plastic snake"	"these things"	"sunlight strikes"
SSN-Mask (F, p229)
Reverse-Mask (F, p229)
	--	--	--	--
	"fresh snowpeas"	"big toy frog"	"three red bags"	"raindrops in the air"
SSN-Mask (F, p229)
Reverse-Mask (F, p229)

	"blue cheese"	"plastic snake"	"these things"	"sunlight strikes"
Natural (M, p246)
	--	--	--	--
SSN-Mask (M, p246)
Reverse-Mask (M, p246)
	--	--	--	--
	"fresh snowpeas"	"big toy frog"	"three red bags"	"raindrops in the air"
SSN-Mask (M, p246)
Reverse-Mask (M, p246)