Researchers from Facebook AI and other universities have developed a framework, AV-MAP, that can infer rooms’ general layout from a house’s short-clip. This framework predicts the house’s whole structure with 66% accuracy from the short clip covering merely 25% of its floor plan.
AV-MAP stands out from current methods. Current methods mostly need movements to map the floor using videos from cameras and 3D sensors. These methods do not use audio of the videos that provide complementary information about distant free space and rooms outside of the camera’s reach, like an echo in the hall, a dishwasher’s humming, showers in the bathroom, and more. Hence, the current methods can not predict beyond the visual field captured in the video.
The team at Facebook AI, Carnegie Mellon University, and the University of Texas came up with AV-MAP that does not need movement to capture the house’s layout. The basic intuition was to use sound with the video input. Sound inherently driven by geometry, i.e., reflection reveals the distance between rooms. Identifying meaningful sounds because of activities or objects coming from different directions reveals the plausible room layouts. For instance, the sounds from the left and utensil sounds from the right indicate that the drawing-room on the left and the kitchen on the right.
Also read : Computer Vision Has A New DeIT By Facebook
AV-MAP uses a novel multimodal encoder-decoder framework that jointly learns about audio-visual features to reconstruct a floor plan from a given short video clip. The framework consists of three components: Top-Down Feature Extraction, Feature Alignment, and a Sequence Encoder-Decoder architecture. The feature extractor, a modified ResNet, obtains top-down floor plan-aligned features for each modality (ambisonic audio and RGB) independently at each time-step.
These extracted features are mapped to a standard coordinate frame using the relative motion of the camera. At the encoder, the entire feature sequence undergoes pixelwise self-attention operations and convolutions. Lastly, the two modalities are fused at the decoder via a series of self-attention and convolution layers. The AV-MAP model then predicts the interior structure of the environment and the associated rooms’ semantic labels, like bathroom, kitchen, and more.
The team created two experimentation settings (active and passive) to test the framework using Matternet3D and SoundSpaces datasets. These are datasets of 3D modeled houses hosted in Facebook’s AI habitat. In the active setting, a virtual camera emits a known sound while moving throughout the room of a model home. Similarly, in the passive setting, the model uses sounds made by objects and people inside the house. Overall, the researchers found that AV-MAP offers an 8% improvement in floor plan accuracy over the state-of-the-art approach.
Read about the framework here.