Microsoft’s VALL-E- An AI tool that can Mimic Anyone’s Voice

Microsoft’s VALL-E- An AI tool that can Mimic Anyone’s Voice

Microsoft's new AI bot VALL-E can be trained with only a three-second audio sample 

An innovative text-to-speech AI model named VALL-E has been created by a team of Microsoft researchers. Once trained, it can replicate a person's voice almost perfectly. The team requires only a three-second audio sample to train this Microsoft's new AI bot

Moreover, the researchers claim that once the AI tool learns a specific voice, VALL-E can synthesize audio of that person saying anything, and do it in a way that attempts to preserve the speaker's emotional tone, as well as the environment where the speaker is in. The developers of Microsoft's VALL-E may be utilized for high-quality text-to-speech applications, and speech editing, which would allow a person's voice recording to be altered and changed from a text transcript, and in conjunction with other generative AI models like GPT-3 to create content. A technique dubbed EnCodec, which Meta revealed in October 2022, is the foundation for Microsoft's VALL-E. VALL-E produces discrete audio codec codes from text and acoustic cues, in contrast to conventional text-to-speech systems that typically synthesize speech by modifying waveforms. VALL-E decodes a person's voice into tokens after conducting a voice analysis. Then it matches what it "knows" about how that voice would sound if it spoke additional words with the training data.

Microsoft has trained the synthesis abilities of its new VALL-voice E using the audio library LibriLight, which was assembled by Meta, the parent company of Facebook. More than 7,000 different people are represented among the 60,000 hours of English-language speech that were primarily extracted from LibriVox public domain audiobooks. For Microsoft's new AI bot to produce an acceptable result, the voice in the three-second sample must closely resemble a voice in the training data.

In addition to preserving a speaker's vocal timbre and emotional tone, VALL-E can also imitate the "acoustic environment" of the sample audio. For example, the audio output will simulate the acoustic and frequency characteristics of a phone call in its synthetic output, which is another way of saying that it would sound like a phone call. Furthermore, Microsoft's samples (included in the "Synthesis of Diversity" section) demonstrate how VALL-E may generate various voice tones by changing the random seed used during creation. Microsoft AI Research is creating artificial intelligence machines that complement human reasoning to augment and enrich our experience and competencies.

Disclaimer: Analytics Insight does not provide financial advice or guidance. Also note that the cryptocurrencies mentioned/listed on the website could potentially be scams, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. You are responsible for conducting your own research (DYOR) before making any investments. Read more here.

Related Stories

No stories found.
Analytics Insight