Meta has unveiled a new AI tool, dubbed ‘Voicebox’, which it claims represents a breakthrough in AI-powered speech generation. However, the company won’t be unleashing it on the public just yet because doing so could be disastrous.
This type of technology could be used in the future to help creators easily edit audio tracks, allow visually impaired people to hear written messages from friends in their voices, and enable people to speak any foreign language in their own voice
Meta
It is also multilingual and can produce speech in six languages after being trained on over 50,000 hours of recorded speech and transcripts from public domain audiobooks in English, French, Spanish, German, Polish, and Portuguese.
In the future, multipurpose generative AI models like Voicebox could give natural-sounding voices to virtual assistants and non-player characters in the metaverse.
They could allow visually impaired people to hear written messages from friends read by AI in their voices, give creators new tools to easily create and edit audio tracks for videos, and much more.
Meta
Below is a short summary of the capabilities of Meta’s Voicebox:
- In-context text-to-speech synthesis — Voicebox can match the audio style of a sample as short as two seconds and use it for text-to-speech generation.
- Speech editing and noise reduction — Voicebox can recreate a portion of the speech in an audio clip interrupted by noise or replace misspoken words without re-recording an entire speech. For example, you can identify a segment of audio that’s interrupted by a dog barking, crop it, and instruct Voicebox to re-generate that segment.
- Cross-lingual style transfer — When given a sample of someone’s speech and a passage of text in English, French, German, Spanish, Polish, or Portuguese, Voicebox can produce a reading of the text in any of those languages, even when the sample speech and the text are in different languages. This capability could be used in the future to help people communicate naturally and authentically, even if they don’t speak the same languages.
- Diverse speech sampling — Having learned from diverse data, Voicebox can generate speech more representative of how people talk in the real world.
Meta has provided an in-depth overview discussing how Voicebox works but has opted not to make its model or code publicly available for now due to the potential risks of misuse.“While we believe it is important to be open with the AI community and to share our research to advance the state of the art in AI, it’s also necessary to strike the right balance between openness with responsibility,” Meta said.
Meta may be concerned that Voicebox could allow for creating believable “deep fake” sound clips of famous or influential people saying things they never said. That could be particularly problematic when combined with powerful visual AI technology that can replace a person’s face with someone else’s.
To mitigate against similar abuse, Meta has built a “highly effective” classifier to distinguish between authentic speech and audio generated with Voicebox. and for more details on this and Meta’s general approach to developing Voicebox, the company has published a 32-page research paper that is available to the general public.