Microsoft has announced an AI voice simulator capable of accurately imitating a person’s voice after listening to them speak for just three seconds.
The VALL-E language model was trained using 60,000 hours of English speech from 7,000 different speakers in order to synthesize “high-quality personalised speech” from any unseen speaker.
Once the artificial intelligence system has a person’s voice recording, it is able to make it sound like that person is saying anything. It is even able to imitate the original speaker’s emotional tone and acoustic environment.
Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot text to speech synthesis (TTS) system in terms of speech naturalness and speaker similarity. In addition, VALL-E could preserve the speaker’s emotion and acoustic environment of the acoustic prompt in synthesis.
Potential applications include authors reading entire audiobooks from just a sample recording, videos with natural language voice overs, and filling in speech for a film actor if the original recording was corrupted.
As with other deepfake technology that imitates a person’s visual likeness in videos, there is the potential for misuse.
The VALL-E software used to generate the fake speech is currently not available for public use, with Microsoft citing potential risks in misuse of the medel, such as spoofing voice identification or impersonating a specific speaker.
Microsoft said it would also abide by its Responsible AI Principles as it continues to develop VALL-E, as well as consider possible ways to detect synthesized speech in order to mitigate such risks.
Microsoft trained VALL-E using voice recordings in the public domain, mostly from LibriVox audiobooks, while the speakers who were imitated took part in the experiments willingly.
See What’s Next in Tech With the Fast Forward Newsletter
Tweets From @varindiamag
Nothing to see here - yet
When they Tweet, their Tweets will show up here.