Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
Many people think cloning someone’s voice requires hours of the voice them, But we do not need it by using advanced technique, this paper suggests only Five seconds for cloning someone’s voice, from given someone’s five seconds of voice we can clone their voice. Demo
How does it work?
There are three components
1. Speaker Encoder
2. Synthesizer
3. Vocoder
Speaker Encoder
It is a Neural network that has been trained by using few thousand speakers, so it tries to learn the essence of the human speaker, this training will need only once at the beginning after using that representation we can move forward for cloning.
Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com
Synthesizer
It takes text as input and it gives us the Mel Spectrogram, which is a representation of someone’s voice and Intonation
Vocoder
The neural vocoder is implemented by the Deep mind’s WaveNet technique, It is different from the Conventional Vocoder, this takes acoustic parameters e.g: Mel Spectrogram and converts it to Speech. WaveNet Paper
Evaluation
Two things to mainly focus on the Evaluation period
1. Naturalness
2. Speaker Similarity
VCTK — contains 44 hours of clean speech from 109 speakers
Librispeech — consists of the union of the two “clean” training sets, comprising 436 hours of speech from 1,172 speakers
Here, Trained with the VCTK and test with Librispeech, and after another experiment Trained with the Librispeech and test with VCTK.
— — — — — — — — — — — — — — — — — — —
— -Trained — — — — Test — — — -Naturalness — — - Similarity
VCTK Librispeech —- 4.28 ± — — 0.05 1.82 — — — -± 0.08
Librispeech VCTK — — 4.01 ± — — 0.06 2.77 — — — ± 0.08
— — — — — — — — — — — — — — — — — — —
Where,
VCTK — contains 44 hours of clean speech from 109 speakers
Librispeech — consists of the union of the two “clean” training sets, comprising 436 hours of speech from 1,172 speakers
Above you can clearly see that tested results are not that much better, but for these kinds of complex tasks, this can be a better start.
Summary
This model has been trained using a neural network and synthesized with Deep Mind’s WaveNet, Takes an input of voice which minimum can be up to five seconds and we can provide some input text that text will be converted to that particular input’s voice of Intonation
“Our intelligence is what makes us human, and AI is an extension of that quality.” — Yann LeCun```