Audio Samples for RTVC-5 Voice Cloning Model

Repository: CorentinJ/Real-Time-Voice-Cloning @5425557

Description: RTVC-5 uses a synthesizer trained on the mic1 recordings from the VCTK dataset. Speakers p240 and p260 are held out of the training set. Like RTVC-4, silence in the raw recordings has been removed using VAD. The vocoder is trained on ground truth mel spectrograms from the mic1 data.

Click here for more voice cloning experiments.

RTVC-5 Model Overview

	Name	Model	Steps	Batch Size	Datasets Used	Speakers	Audio Duration
Speaker Encoder:	Pretrained	GE2E	1,564,501	64	LibriSpeech train-other-500 VoxCeleb1 Dev A-D VoxCeleb2 Dev A-H	8371	3201 hours
Synthesizer:	VCTK_Taco2_242k	Tacotron 2	242,000	12	VCTK	109	44 hours
Vocoder:	VCTK_GT_733k	WaveRNN	733,000	80	VCTK	109	44 hours

Voice Cloning Results

All speakers are unseen during training. The first row is the reference audio used to compute the speaker embedding. The rows below that are synthesized using that speaker embedding.

	VCTK p240	VCTK p260	LibriSpeech 1320	LibriSpeech 3575	LibriSpeech 6829	LibriSpeech 8230
	Reference:


	Synthesized:
	0: Take a look at these pages for crooked creek drive.
Google:
RTVC-4:
RTVC-5:

	1: There are several listings for gas station.
Google:
RTVC-4:
RTVC-5:

	2: Here's the forecast for the next four days.
Google:
RTVC-4:
RTVC-5:

	3: Here is some information about the Gospel of John.
Google:
RTVC-4:
RTVC-5:

	4: His motives were more pragmatic and political.
Google:
RTVC-4:
RTVC-5:

	5: She had three brothers and two sisters.
Google:
RTVC-4:
RTVC-5:

	6: This work reflects a quest for lost identity, a recuperation of an unknown past.
Google:
RTVC-4:
RTVC-5:

	7: There were many editions of these works still being used in the nineteenth century.
Google:
RTVC-4:
RTVC-5:

	8: Modern birds are classified as coelurosaurs by nearly all palaeontologists.
Google:
RTVC-4:
RTVC-5:

	9: He was being fitted for ruling the state, in the words of his biographer.
Google:
RTVC-4:
RTVC-5: