A versatile voice cloning system

Different interfaces of the proposed voice cloning system

Figure a) describes a typical text-to-speech (TTS) interface while Figure b) describes a typical voice conversion (VC) interface.

By utilizing deep learning technology, we developed a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. The proposed system is capable of cloning new voices using speech of target speakers without requiring its transcription. Moreover, depending on the availability of additional data of the target, the cloning strategy can be adjusted to take advantage of them and improve overall performance of text-to-speech (TTS) and/or voice conversion (VC) systems as well as the speaker similarity by capturing unique and subtle characteristics. Evaluations show that our system achieves state-of-the-art performances for both TTS and VC when cloning with just five minutes of speech. Moreover, it has the ability to switch between TTS and VC while maintaining high speaker characteristic consistency, which is useful for many practical applications such as movie dubbing, video games, and personalized voice avatar.

While TTS and VC are not new technologies, the ability to clone new voices using speech without transcript has a huge implication on its application and social impacts, as it can be used to clone millions of voices inexpensively and efficiently by analyzing speech utterances, which are recorded with or without active involvement of the target speakers. This is a requisite for the mass adoption of the personalized speech synthesis technology to general public instead of limited uses in industrial and corporation setting. However, given its high speaker similarity and data efficiency properties, there are risks that such technology might be used with malicious intents which can cause significant harm to the society.

  • Title: NAUTILUS: A Versatile Voice Cloning System
  • Author name: Hieu-Thi Luong, Junichi Yamagishi
  • Journal title: IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • https://doi.org/10.1109/TASLP.2020.3034994
  • Publication year:2020

Department of Informatics, LUONG HIEU THI