Last month, Meta had released its first high-performance self-supervised algorithm called data2vec for multiple modalities. The moniker data2vec is a pun on “word2vec,” a Google technology for language embedding that was released in 2013. Word2vec is an example of a neural network developed for a certain sort of input, in this instance text, since it anticipated how words cluster together.
The research team explained that data2vec is trained by predicting the model representations of the whole input data given a partial view of the input in the paper “Data2vec: A General Framework for Self-supervised Learning in Speech, Vision, and Language.” Meta AI has released data2vec source code and pre-trained models for voice and natural-language processing on GitHub under the permissive MIT license.
Earlier, AI machines were made to learn from labeled data. But, things changed since the advent of self-supervised learning that allows machines to learn about their surroundings by watching them and then decoding the structure of pictures, voice, or text. This technique allows computers to tackle new complicated data-related jobs more efficiently, such as comprehending text in more spoken languages.
However, most of the existing models are proficient at performing only single tasks. For example, a facial recognition system cannot generate textual content nor can a credit card fraud detection system help in detecting tumors in patients. In simpler words, while we have built state-of-the-art machines for a particular application, it is confined to that niche, the machines’ AI prowess may not be transferable. Self-supervised learning research nowadays nearly often concentrates on a single modality. As a result, researchers who work on one modality frequently use a completely different technique from those who specialize in another.
This deficit in the AI industry motivated Meta to develop data2vec, which not only unifies the learning process but also trains a neural network to recognize images, text, or speech. The data2vec surpassed current processes for a variety of model sizes on the primary ImageNet computer vision benchmark. It outperformed two prior Meta AI self-supervised voice algorithms, wav2vec 2.0 and HuBERT. It was tested on the popular GLUE text benchmark suite and found to be on par with RoBERTa, a reimplementation of BERT.
Data2vec employs a single model but offers two modes: teacher and student. The student mode of data2vect will learn from the teacher mode and update the model parameters at each time step. In the teacher mode, a given sample is used to produce a representation of the joint probability of data input, be it images or speech, or text. The student mode is given a block-wise masked version of the same sample and is tasked with predicting representations of the whole input data while only being provided a portion of it. This prediction is based on internal representations of the input data, which eliminates the need to operate in a single modality.
Here, since data2vec relies on the use of the self-attention mechanism of Transformer, the representations are contextualized in nature, i.e. they store a specific timestep as well as other information from the sample. This is the most significant distinction between this work and prior ones, which lacked context.
Unlike other Transformer-based models such as Google’s BERT and OpenAI’s GPT-3, data2vec does not focus on creating certain output data types. Instead, data2vec focuses on inner neural network layers that represent the data before it is produced as a final output. This is due to the self-attention mechanism that allows inputs to interact with each other (i.e calculate attention of all other inputs with respect to one input.
The researchers trained data2vec on 960 hours of voice audio, millions of words from books and Wikipedia pages, and pictures from ImageNet-1K using a combination of 16 Nvidia V100 and A100 GPUs. Meta leveraged the ViT, which entails encoding a picture as a series of patches, each of which spans 16×16 pixels and is fed into a linear transformation. The ViT, or vision Transformer is a neural network, built by Alexey Dosovitskiy and colleagues at Google, particularly intended for visual applications, last year. A multi-layer 1-D convolutional neural network is then used to encode the voice data. It converts 16 kHz waveforms to 50 Hz equivalents. Even the text is pre-processed to obtain sub-word units, which are embedded in distributional space through learning embedding vectors.
Multi-modal systems have already been proved to be vulnerable to adversarial assaults. If the word “iPod” appears in the image, OpenAI’s CLIP model, which is trained on pictures and text, will mistakenly classify an image of an apple as an iPod. However, it’s uncertain whether data2vec has the same flaws.
According to the official statement, Meta has not specifically examined how data2vec will respond to adversarial examples, but because current models are trained separately for each modality, it believes that existing research on adversarial attack analysis for each modality would apply to data2vec as well.