Audio samples from "Exploring Disentanglement with Multilingual and Monolingual VQ-VAE"

Paper: arXiv paper - submitted to Speech Synthesis Workshop 2021 (SSW11)

Authors: Jennifer Williams, Jason Fong, Erica Cooper, and Junichi Yamagishi

Abstract:

This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on monolingual data. We explore the multi- and monolingual models using four small proof-of-concept tasks: copy-synthesis, voice transformation, linguistic code-switching, and content-based privacy masking. From these tasks, we reflect on how disentangled phone and speaker representations can be used to manipulate speech in a meaningful way. Our experiments demonstrate that the VQ representations are suitable for these tasks, including creating new voices by mixing speaker representations together. We also present our novel technique to conceal the content of targeted words within an utterance by manipulating phone VQ codes, while retaining speaker identity and intelligibility of surrounding words. Finally, we discuss recommendations for further increasing the viability of disentangled representations.

Monolingual

Multilingual





Monolingual Copy-Synthesis (seen):

VCTK-003VCTK-004VCTK-005VCTK-006
Natural (M, p345)
Synth (M, p345)
Natural (F, p229)
Synth (F, p229)


Monolingual Copy-Synthesis (unseen):

VCTK-031VCTK-032VCTK-033VCTK-034
Natural (M, p260)
Synth (M, p260)
Natural (F, p300)
Synth (F, p300)





Monolingual Voice Transformation:

* indicates an item that was not used for the evaluation in the paper.

VCTK-p229_005VCTK-p253_005VCTK-p256_009VCTK-p302_009
VCTK Natural
Speaker Code 67
Speaker Code 109
Speaker Code 242
Speaker Code 109+242
* Speaker Code 5
* Speaker Code 30
* Speaker Code 220
* Speaker Code 30+220




Content-Based Masking:

* indicates an item that was not used for the evaluation in the paper.

* VCTK-003* VCTK-004 VCTK-005VCTK-006
Natural (F, p229)
-- -- -- --
"blue cheese""plastic snake""these things""sunlight strikes"
SSN-Mask (F, p229)
Reverse-Mask (F, p229)
-- -- -- --
"fresh snowpeas""big toy frog""three red bags""raindrops in the air"
SSN-Mask (F, p229)
Reverse-Mask (F, p229)


Natural (M, p246)
-- -- -- --
"blue cheese""plastic snake""these things""sunlight strikes"
SSN-Mask (M, p246)
Reverse-Mask (M, p246)
-- -- -- --
"fresh snowpeas""big toy frog""three red bags""raindrops in the air"
SSN-Mask (M, p246)
Reverse-Mask (M, p246)






Multilingual Copy-Synthesis (seen):

* indicates an item that was not used for the evaluation in the paper.

SIWIS-EnglishSIWIS-FrenchSIWIS-German* SIWIS-Italian
Natural (M)
Synth (M)
Natural (F)
Synth (F)


Multilingual Copy-Synthesis (unseen):

SIWIS-EnglishSIWIS-FrenchSIWIS-German* SIWIS-Italian
Natural (M)
Synth (M)
Natural (F)
Synth (F)




Multilingual Voice Transformation:

* indicates an item that was not used for the evaluation in the paper.

SIWIS-EnglishSIWIS-FrenchSIWIS-German* SIWIS-Italian
SIWIS Natural
Speaker Code 85
Speaker Code 192
Speaker Code 238
Speaker Code 131+248
-- -- -- --
* Speaker Code 42




Linguistic Code-Switching:

Reference speech is concatenated audio, and synthetic speech is generated from concatenated VQ phone codes.

English-FrenchFrench-EnglishEnglish-GermanGerman-English
Reference (M)
Synth (M)
-- -- -- --
Reference (F)
Synth (F)