Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

Salman Faroz
Voice Tech Podcast
Published in
3 min readJul 17, 2020

--

Many people think cloning someone’s voice requires hours of the voice them, But we do not need it by using advanced technique, this paper suggests only Five seconds for cloning someone’s voice, from given someone’s five seconds of voice we can clone their voice. Demo

paper link

How does it work?

There are three components

1. Speaker Encoder

2. Synthesizer

3. Vocoder

Speaker Encoder

It is a Neural network that has been trained by using few thousand speakers, so it tries to learn the essence of the human speaker, this training will need only once at the beginning after using that representation we can move forward for cloning.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Synthesizer

It takes text as input and it gives us the Mel Spectrogram, which is a representation of someone’s voice and Intonation

Vocoder

The neural vocoder is implemented by the Deep mind’s WaveNet technique, It is different from the Conventional Vocoder, this takes acoustic parameters e.g: Mel Spectrogram and converts it to Speech. WaveNet Paper

Evaluation

Two things to mainly focus on the Evaluation period

1. Naturalness

2. Speaker Similarity

VCTK — contains 44 hours of clean speech from 109 speakers

Librispeech — consists of the union of the two “clean” training sets, comprising 436 hours of speech from 1,172 speakers

Here, Trained with the VCTK and test with Librispeech, and after another experiment Trained with the Librispeech and test with VCTK.

— — — — — — — — — — — — — — — — — — —

— -Trained — — — — Test — — — -Naturalness — — - Similarity

VCTK Librispeech —- 4.28 ± — — 0.05 1.82 — — — -± 0.08

Librispeech VCTK — — 4.01 ± — — 0.06 2.77 — — — ± 0.08

— — — — — — — — — — — — — — — — — — —

Where,

VCTK — contains 44 hours of clean speech from 109 speakers

Librispeech — consists of the union of the two “clean” training sets, comprising 436 hours of speech from 1,172 speakers

Above you can clearly see that tested results are not that much better, but for these kinds of complex tasks, this can be a better start.

Summary

This model has been trained using a neural network and synthesized with Deep Mind’s WaveNet, Takes an input of voice which minimum can be up to five seconds and we can provide some input text that text will be converted to that particular input’s voice of Intonation

“Our intelligence is what makes us human, and AI is an extension of that quality.” — Yann LeCun```

Something just for you

--

--