A machine learning model called Neural Acoustic Fields (NAFs) has been developed by researchers from MIT and the IBM Watson AI Lab to forecast what sounds a listener will hear in various 3D settings. The machine-learning model can mimic what a listener would hear at different locations by simulating how any sound in a room would travel across the space using spatial acoustic information.
The neural acoustic fields system can understand the underlying 3D geometry of a room from sound recordings by precisely modeling the acoustics of a scene. The researchers may utilize the acoustic data collected by their system to create realistic visual reconstructions of a space, akin to how people use sound to infer the elements of their physical surroundings.
This approach might aid artificial intelligence agents in better comprehending their surroundings in addition to its potential uses in virtual and augmented reality. According to Yilun Du, a graduate student in the Department of Electrical Engineering and Computer Science (EECS) and co-author of a paper describing the model, an underwater exploration robot could sense things that are farther away by simulating the acoustic properties of the sound in its environment.
Du admits that most researchers have so far concentrated solely on simulating vision. These models often combine a neural renderer with an implicit representation that has been trained in order to capture and render visuals of a scene simultaneously. By leveraging the multiview consistency between visual observations, these methods can extrapolate images of the same scene from unique viewpoints. However, because humans have a multimodal perception, sound is just as essential as vision, which opens up an attractive research area on improving how sound is used to describe the environment.
Previous studies on capturing a location’s acoustics called for careful planning and designing of its acoustic function which cannot be applied to arbitrary scenes. According to the study report from MIT, despite recent improvements in learned implicit functions that have produced ever better visual world representations, learning spatial auditory representations has not made similar strides. A variant of the machine-learning model known as the implicit neural representation model has been employed in computer vision research to produce continuous, smooth reconstructions of 3D scenes from images. These models make use of neural networks, which are composed of layers of linked nodes, or neurons, that analyze data to perform an action. These models make use of neural networks, which are composed of layers of linked nodes, or neurons, that analyze data to perform an action.
The MIT researchers used a similar model to depict how sound continuously permeates a scene. However, they failed!
This inspired the team to work on neural acoustic fields, an implicit model that reflects how sounds travel in a spatial environment. Neural acoustic fields encode and transmit an impulse response in the Fourier frequency domain to capture the complex signal representation of impulse responses.
In order to enable neural acoustic fields to constantly map all emitter and listener location pairings to a neural impulse response function that can be used to process any sound, acoustic propagation in a scene is modeled as a linear time-invariant system. Using sound instead of visuals enabled the team to get around the (vision) model’s dependence on photometric consistency—a phenomenon where an item seems to look about the same no matter where you are standing but does not apply to sound—which was necessary. To circumvent the photometric consistency issue, the neural acoustic fields method utilizes the reciprocal nature of sound i.e., exchanging the location of the source and the listener has no effect on how the sound is perceived, as well as the impact of regional elements like furniture or carpeting on the sound as it travels and bounces. The model randomly picks locations and learns from its experience by using a grid of objects and architectural features.
Researchers input the final neural acoustic fields with both model visual information about an acoustic setting as well as spectrograms that demonstrate what an audio piece would sound like if the emitter and listener were positioned at specific points around the room. The algorithm then forecasts what the audio would sound like at any location in the scenario in which the listener would move.
The machine learning model produces an impulse response that depicts how a sound would alter as it spreads throughout the environment. The researchers then use this impulse response to various sounds to hear how they ought to alter when a person moves about a room.
The researchers found that their methodology consistently produced more precise sound models when compared to other techniques for modeling acoustic data. Their model also had a far higher degree of generalization to new locations in a scene than previous approaches since it incorporated local geometric information.
Additionally, researchers discovered that incorporating the acoustic knowledge their model picks up into a computer vision model can improve the visual reconstruction of the scene. In other words, a neural acoustic fields model could be used backwards to enhance or even build a visual map from scratch.
The researchers intend to continue improving the model so that it can be generalized to new scenarios. Additionally, they plan to use this method for more complex impulsive reactions and larger scenes, like entire buildings or even a whole town or metropolis.
The MIT-IBM Watson AI Lab’s principal research staff member Chuang Gan believes that this new method may present novel opportunities to develop a multimodal immersive experience for the metaverse application.
The research team also mention the limitations of their neural acoustic fields model. Their method, like previous spatial acoustic field coding studies, does not model the phase. While a magnitude-only approximation may still be sufficient for tasks that depend on the phase, it may not be able to reproduce believable spatial acoustic effects in a compact and continuous manner. This NAF model needs a precomputed acoustic field, which was also the prerequisite in earlier acoustic field studies. Though this isn’t a drawback for many applications, researchers believe the potential to generalize from really small training samples can create new opportunities. Finally, this model is fitted to a particular scene like earlier research that uses implicit neural representations. It’s still unclear if it’s possible to forecast the acoustic field of new scenes.