The AI pundits believe that the key to having a successful artificial intelligence system is building one that is at par with the ability of humans to grasp and learn any language and perform a task. There have been multifarious AI technologies based systems that possess the capability of thinking, planning, and learning about a task, parsing and representing insights gained from a dataset, and communicating using NLG-NLP algorithms. However, most of them cater to only a single form of tasks, i.e., solving quadratic equations, captioning an image, or playing chess, etc.
DeepMind has taken use of recent developments in large-scale language modeling to create a single generalist agent that can handle more than just text outputs. Earlier this month, DeepMind unveiled a novel “generalist” AI model called Gato. This latest AGI agent operates as a multi-modal, multi-task, multi-embodiment network, which means that the same neural network (i.e. a single architecture with a single set of weights) can do all tasks while involving intrinsically distinct types of inputs and outputs.
DeepMind also published a paper titled ‘A Generalist Agent,’ which detailed the model’s capabilities and training procedure. DeepMind argues that the general agent can be tweaked with a bit more data to perform even better on a wider range of jobs. They point out that having a general-purpose agent reduces the need for hand-crafting policy models for each area, increases the volume and diversity of training data, and allows for ongoing improvements at the data, compute, and model scale. A general-purpose agent can also be seen as the first step toward artificial general intelligence, which is the ultimate objective of artificial general intelligence (AGI).
A modality, in layman’s terms, refers to the manner in which something occurs or is perceived. Most people relate the word modality with sensory modalities, like vision and touch, representing our major communication pathways. When a research topic or dataset contains different modalities, it is referred to as multimodal. To make real progress in comprehending the world around us, AI must be able to interpret and reason about multimodal data.
According to the Alphabet-owned AI lab, Gato can play Atari video games, caption images, chat, and stack blocks with a real robot arm – overall performing 604 distinct tasks.
Though Deepmind’s preprint describing Gato is not explicitly detailed, it does reveal that its genesis is deeply anchored in transformers as used in natural language processing and text generation. It is trained not only with text, but also with images, torques acting on robotic arms, button presses from computer games, and so on. Essentially, Gato combines all types of inputs and determines whether to produce understandable text (for example, to chat, summarize, or translate text), torque powers (for robotic arm actuators) or button presses (to play games) based on context.
Gato exhibits the adaptability of transformer-based machine learning architectures by exhibiting how they may be used for a range of applications. In contrast to earlier neural network applications that were specialized for playing games, interpreting texts, and captioning photos, Gato is versatile enough to accomplish all of these tasks on its own, with only a single set of weights and a very simple architecture. Previous specialized networks required the integration of numerous modules in order to function, where the integration was largely reliant on the problem to be solved.
Researchers acquired data from a variety of tasks and modalities in order to train Gato. Training for vision and language was done using MassiveText, a multi-modal text dataset that comprises web pages, books, and news stories, as well as code and vision-language datasets including ALIGN (Jia et al., 2021) and COCO captions.
A transformer neural network batched and processed the data once it was serialized into a flat sequence of tokens. While any general sequence model can be used to predict the next token, the researchers chose a transformer for its simplicity and scalability. They employed a decoder-only transformer with 1.2 billion parameters, 24 layers, and a 2048-embedding size. What makes Gato interesting is that it is by orders of magnitude smaller than in single-task systems like GPT-3. It’s smaller than OpenAI’s GPT-2 language model, with “only” 1.2 billion weights, which is nearly 2 orders of magnitude lower than GPT-3’s 175 billion weights. Parameters are system components learned from training data that define how well a system can handle a problem, often including text generation.
When a prompt is deployed, it is tokenized, resulting in an initial sequence. The initial observation is produced by the environment, it is also tokenized and added to the sequence. Gato then takes one token at a time and samples the action vector autoregressively. The action is decoded and delivered to the environment, which steps and produces a new observation after all tokens in the action vector have been sampled. The process is then repeated. According to the DeepMind researchers, the model always observes all past observations and actions inside its context window of 1024 tokens. DeepMind emphasized in its paper that the loss is masked such that Gato only anticipates action and text targets.
The study showed that transformer sequence models perform better as multi-tasking policies in real-world settings, including visual and robotic tasks. Gato illustrates the ability to use prompting to learn new tasks rather than training a model from start.
Gato was assessed on a range of tasks, including simulated control, robotic stacking, and ALE Atari games. Gato exceeded the 50% expert score criteria on 450 of the 604 tasks in the experiments. DeepMind also found that Gato’s performance improves as the number of parameters increases: The scientists simultaneously trained two smaller models with 79 million and 364 million parameters, in addition to the main model. As per the benchmark results, the average performance tends to increase linearly with the parameters. This phenomenon has previously been seen in large-scale language models, and it was thoroughly investigated in the scientific paper “Scaling Laws for Neural Language Models” published in the early 2020s.
Demis Hassabis, the co-founder of DeepMind, congratulated the team in a tweet, saying, “Our most general agent yet!! Fantastic work from the team!”
However, not everyone is on board with Gato being touted as an AGI agent. According to David Pfau, a staff research scientist at DeepMind, the team amalgamated all of the policies of a group of individually trained agents into a single network, which is not as surprising nor exciting as per the hype around Gato.
Read More: DeepMind Trains AI to Regulate Nuclear Fusion in Tokamak Reactor
Surely, the Gato model is unquestionably a significant step forward in AGI research. However, it highlights the question of how far research has progressed in AGI research.
For instance, AGI is a term used to describe a cohort of AI-powered computers that can function totally independently while executing a set of activities that need human-level intellect. While it is possible, DeepMind’s Gato is far from capable of general intelligence in any form. This is due to the fact that general intelligence can acquire new skills without prior training, which was not the case with Gato.
Based on observances, the ‘AGI’ agent outperforms a dedicated machine learning program when it comes to directing a robotic Sawyer arm that stacks blocks. But the captions it generates for photographs are frequently subpar, especially misgendering people. Its capacity to carry on a typical chat conversation with a human interlocutor is similarly dismal, occasionally yielding contradictory and illogical responses.
In their 40-page report, DeepMind reveals that when asked what the capital of France is, the system occasionally responds with ‘Marseille’ and on other occasions with ‘Paris.’ Such errors, according to the researchers, can presumably be rectified with additional scaling.
Gato also has memory impairments, making it challenging to learn a new activity via conditioning on a prompt, such as demonstrations of desirable behavior.
Because of accelerator memory constraints and exceptionally long sequence lengths of tokenized demonstrations, the longest possible context length does not allow the agent to attend over an informative-enough context.
In addition, its performance on Atari 2600 video games is inferior to that of most specialized machine learning algorithms meant to compete in the benchmark Arcade Learning Environment.
Furthermore, a single AI system capable of doing several jobs isn’t new. In fact, Google recently began employing a system known as the multitask unified model, or MUM, in Google Search to accomplish tasks ranging from discovering interlingual variances in a word’s spelling to linking a search query to an image. However, the variety of jobs addressed and the training approach are possibly unique to Gato.
To surmise, while Gato brings a fresh spin to the AGI domain by performing multi-modal multi-task activities, it still falls short to be labeled as an AGI model.