Meta introduced the Audiobox as their latest foundational research model for audio generation. Within this family of models are specialized versions such as Audiobox Speech and Audiobox Sound.
These models enable the creation of voices and sound effects by amalgamating voice inputs with natural language prompts, catering to diverse audio needs. Underlying these variants is the Audiobox SSL, a self-supervised model forming the common foundation for all Audiobox iterations.
Audiobox further permits users to merge an audio voice input alongside a textual style prompt, facilitating the synthesis of speech in various environments or emotional tones, such as speaking in a cathedral or expressing sadness at a slower pace.
The inclusion of text and voice inputs significantly amplifies Audiobox’s controllability in contrast to other Meta inventions like Voicebox. Audiobox empowers users to utilize text description prompts to specify and manipulate sound effects, expanding the range of controllable features. When combined, the voice input establishes the fundamental timbre, while the text prompt becomes a tool for altering other attributes.
Audiobox inherits Voicebox’s guided audio generation training objective and flow-matching modeling method, enabling audio infilling. This capability permits users to refine sound effects, such as incorporating diverse thunder sounds into a rain soundscape, enhancing the model’s versatility.