Audio samples from "Analysis of Voice Conversion and Code-Switching Synthesis Using VQ-VAE"

Paper: Submitted to IEEE SLT 2023

Authors: Shuvayanti Das, Jennifer Williams, and Catherine Lai

Abstract: This paper presents an analysis of speech synthesis quality achieved by simultaneously performing voice conversion and linguistic code-switching using multilingual VQ-VAE speech synthesis in German, French, English and Italian. In this paper, we utilize VQ code indices representing phone information from VQ-VAE to perform code-switching and a VQ speaker code to perform voice conversion in a single system with a neural vocoder. Our analysis examines several aspects of code-switching including the number of language switches and the number of words involved in each switch. We found that speech synthesis quality degrades after increasing the number of language switches within an utterance and decreasing the number of words. We also found some evidence of accent transfer when performing voice conversion across languages as observed when a speaker's original language differs from the language of a synthetic target utterance. We present results from our listening tests and discuss the inherent difficulties of assessing accent transfer in speech synthesis. Our work highlights some of the limitations and strengths of using a semi-supervised end-to-end system like VQ-VAE for handling multilingual synthesis. Our work provides insight into why multilingual speech synthesis is challenging and we suggest some directions for expanding work in this area.

Note: two samples from male speakers exhibit a loud buzz due to failure of the vocoder, please adjust volume to lower levels.


Code-Switching with varying linguistic unit

Unit-4 , Switches-8Unit-8 , Switches-4Unit-8 , Switches-8Unit-8 , Switches-12
Speaker 19 (female):
Speaker 17 (male):
Speaker 24 (female):
Speaker 14 (male):

Accent Tranfser (combining code-switching with voice conversion):

English - FrenchFrench - English
Speaker 19 (female):
Speaker 35 (male):
Speaker 33 (male):
Speaker 25 (female):
Speaker 15 (female):
Speaker 16 (male):
Speaker 22 (female):
Speaker 26 (male):