Artificial intelligence-based frameworks and techniques aim to mimic humans and accomplish tasks on their behalf much more efficiently. Along similar lines, researchers have been trying to simulate focal adjustments in computer vision, an AI sub-field that enables computers to analyze digital images, similar to how human eyes observe coarse objects in their surroundings. The area of research is challenging because modeling all the gritty details of visual inputs and then adjusting focal points makes it tedious.
Researchers from Microsoft have pioneered a new architecture and have proposed FocalNets (Focal Modulation Networks), neural networks with focal modulation, to build better systems based on computer vision. Computer vision technologies have significantly advanced with the help of transformers, specifically vision transformers. These transformers offer a self-attention (SA) mechanism that makes them highly applicable in visioning as it allows each token to quickly collect the required information from others.
However, the self-attention mechanism is compatible only with a determined set of prepared tokens having particular scope and granularity. Moreover, there have always been efficiency concerns due to the quadratic complexity posed by the mechanism, especially for high-resolution inputs.
While developing FocalNets, researchers have entirely replaced the self-attention mechanism with a module for focal modulation inspired by focal attention, a technique that aggregates coarse-grained visuals at multiple levels.
Focal modulation is a straightforward element-wise multiplication mechanism that enables modulator-based interaction of the model and the input. The modulator is derived using a two-step focused aggregation procedure. The first step, called focal contextualization, pulls contexts at different granularities from local to global scales. The second uses gated aggregation to pack the modulator with all context features at different granularity levels.
The model exhibits a dynamic, interpretable learning behavior when finding and identifying objects in photos and videos. Without any dense supervision, it learned to distinguish objects and adjust the focus in a manner consistent with human annotation.
Researchers also experimented with FocalNet by using some traditional techniques similar to the ones used for vision transformers. They utilized overlapped patch embeddings to downsample the input dataset and observed an improvement in the model, irrespective of its size. They also experimented by making the FocalNets deeper but thinner. These alterations led to smaller model sizes but a significantly higher time cost as it increased the sequential blocks. The idea was to see how diverse FocalNets can be and how they can be related to other architectural designs, like PoolFormer, depth convolution, etc.
FocalNets were tested on standard tasks like ImageNet classification, and their performance was evaluated against other vision networks like Vision Transformers (ViT), ConvNeXT, and Swin Transformers. The findings show that FocalNet consistently outperformed the others. Its attention-free architecture of targeted modulation considerably enhanced dense visual prediction with high-resolution picture input.
The research paper aims to educate the community on newer computer vision systems that can work accurately even with high-resolution visual inputs using FocalNets. The only concern is that the model might be biased toward the training data as it is trained on massively large, web-crawled images. When the datasets are large, the negative impact of the bias may be amplified due to the biased elements. Nevertheless, FocalNets are a significant step forward in developing computer vision applications. The researchers plan to undertake a more comprehensive study to analyze if focal modulation can be extended to other domains like natural language processing (NLP).