Meta has announced a ground-breaking generative AI model called “Voicebox” that has the power to transform speech synthesis. Voicebox, according to a blog post by Meta, is the first model that can perform well for speech-generation tasks, even without specialized training for such tasks.
Voicebox is an expert in creating high-quality audio snippets, unlike conventional models that generate visuals or text. It has the ability to create speech in a variety of styles, either from scratch or by adjusting existing samples. Six languages, including German, Spanish, English, French, Polish, and Portuguese are supported by the model for speech synthesis. Voicebox also provides functions including noise reduction, content editing, style conversion, and varied sample production.
Voicebox is distinguished by its distinctive learning methodology. Voicebox learns directly from the untranscribed audio and the corresponding transcriptions rather than using autoregressive models. As a result, the model is more flexible and versatile because it can alter any portion of a given sample.
According to Meta, when given the surrounding speech and its associated transcript, Voicebox can be trained to predict a speech segment. Once the model has mastered the capacity to complete speech depending on context, it can be used for a variety of speech production tasks, enabling it to produce only the necessary parts of an audio recording rather than the entire recording.
Voicebox excels in a variety of applications because of its adaptability, such as in-context text-to-speech synthesis, cross-lingual style transfer, voice denoising and editing, and diversified speech sampling. Performance and adaptability of the model open up new avenues for creative audio generation and advanced speech manipulation.