A large language model called BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) with 176 billion parameters, was released by the BigScience project. On June 17, 2022, a preliminary version of the BLOOM language model was made available. The Bloom language model, which was created with the assistance of around 1,000 academics and researchers from across the world to challenge big tech’s dominance over large language models, will be open source and the first model of its magnitude to be multilingual.
The BigScience research project was initiated in 2021. It comprises over 250 institutions and researchers from more than 60 countries. While Hugging Face is the principal investigator on the project, it also includes researchers from GENCI, the IDRIS team at the CNRS, the Megatron team at NVIDIA, and the Deepspeed team at Microsoft.
How is BLOOM competing with other large language models?
The researchers explain in their paper that large language models are algorithms that learn statistical correlations between billions of words and phrases to accomplish tasks including creating summaries, translating, answering queries, and categorizing material. The models are trained by tweaking values, referred to as parameters, by redacting words and comparing their predictions with reality. BLOOM contains 176 billion parameters, matching one of the most well-known models of its kind, GPT-3, developed by the nonprofit OpenAI and licensed by Microsoft.
Despite the impressive accomplishments these large language models have produced (such as writing articles), they cannot comprehend the fundamentals of human communication and language, leading them to produce nonsensical output. The fact that they might encourage abuse or self-harm and reflect pre-existing racial or gendered stereotypes woven into the human-written content they learn from is even more concerning. Furthermore, training these models typically costs millions of dollars and has a massive carbon footprint.
Thorough knowledge of how these language models are created, how they work, and how the larger community can improve them is essential, considering the possible influence of such language models. Popular models, like GPT-3, are not available as open source. This indicates that only a small number of individuals are aware of how these models work internally. Most large technology companies creating cutting-edge large language models prohibit others from using them and keep their models’ inner workings secret. It is challenging to hold them responsible because of this.
The main inspiration behind BLOOM was to challenge and change these norms of opacity and exclusivity!
BLOOM was created by hundreds of academics, including philosophers, lawyers, and ethical experts, in addition to staff members from Facebook and Google, unlike previous large language models. A US$7 million investment in public computing time is being used to train BLOOM. BigScience was given free access to France’s national Jean Zay supercomputer (IDRIS) facility outside of Paris in order to train BLOOM.
To fully utilize the computational power available, the researchers polished up the data collection using a multilingual web crawl, vetted for quality and with minor redaction for privacy. The team also made an effort to lessen the typical over-representation of porn sites, which might cause sexist connotations in the model, without eliminating keywords that would exclude information related to open discussions of sexuality in frequently under-represented communities.
Despite the aforesaid precautions, researchers acknowledge that BLOOM will not entirely be bias-free, but they expect to advance current models by supplying it with diverse and excellent sources. Importantly, since the model’s code and data collection are public, researchers could try to identify the causes of undesirable behaviors, which could enhance subsequent versions. In addition, BLOOM attempts to disrupt the sway of major businesses over large language models. It achieved that since the project was created in an open environment and makes use of an open license based on the Responsible AI license. BigScience created this license to discourage the use of BLOOM in high-risk industries like law enforcement or health care, as well as to harm, defraud, exploit, or mimic individuals. According to Danish Contractor, an AI researcher who volunteered for the project and co-created the license, the license is an experiment in self-regulating large language models before laws catch up.
Addressing Availability
Bloom is capable of understanding texts in 46 native languages and dialects, as well as 13 computer languages. The native languages include French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indic languages (such as Hindi), and 20 African languages. Only little over 30% of its training data was in English – thus making it an exception from large language models, where English dominates.
Bloom can be tasked with creating summaries or translations of text, output code from instructions, and follow prompts to complete original tasks like writing recipes, extracting data from news articles, or constructing sentences using a newly-defined invented word, despite the fact that it was never trained on any of those particular tasks.
For researchers who want to experiment with it or train it on fresh data for particular applications, the fully trained BLOOM model has been made accessible for download. However, downloading and using it call for a sizable amount of hardware. BigScience has also provided scaled-down, less resource-intensive versions of the model as well as developed a distributed system that will enable laboratories to share it across several servers. Hugging Face has even released a web application that will allow anybody to query BLOOM without installing it.
Wrapping Up
Large language models are one of the most exciting and hottest topic of research in the AI industry. As this trend dominates the sector, companies are racing to build a larger (in terms of parameters) and more capable model. Cerebras Systems said last month that it has achieved a record for the biggest AI models ever trained on a single device, in this instance a massive silicon wafer with hundreds of thousands of cores.
While some businesses have chosen to compromise on the unfairness and privacy loss posed by large-scale language models, others have chosen to open source some of their language models, such as Yandex’s YaLM 100B.
At the same time, experts are questioning the use of enormous datasets and computing power by DeepMind’s Gopher and Chinchilla models, OpenAI’s GPT-3, Google’s LaMDA and PaLM, and DeepMind.
While BLOOM claims to address all these concerns, it also needs to improve on its performance before going mainstream.