Language is not just a communication tool but an expression of different cultures, societies, and opinions across the globe. Nonetheless, language is also the barrier separating them. Thanks to translation technologies and artificial intelligence (AI) taking over the linguistic world, people can now read in their preferred languages. The world would lose a significant portion of its cultural treasures if it weren’t for translation and, more recently, technologies for translation.
Like other technological developments, translation technologies have evolved too. Currently, the most frequently used method of translation is via machines, Machine Translation (MT). Other methods like Computer-Assisted Translation (CAT) Technology are also prominently used. These technologies have undoubtedly offered seamless communication capabilities that people have wanted for ages. Likewise, they still have undeniable limitations.
All tools and technologies used for translation work on different principles and consequently deliver different results. Some offer more accurate results, while others are compatible with a more significant number of languages. Moreover, all high-end translation tools are not accessible to billions of people and are incompatible with hundreds of languages. People cannot openly participate in online conversations and communities in their regional/native languages.
Read More: Measuring Weirdness In AI-Based Language-Translations
To remove some of these barriers and make people a part of the future metaverse, AI researchers at Meta have created ‘No Language Left Behind-200’ or NLLB-200, an AI model to enhance machine translation capabilities for most of the world’s languages. The company claims that the model translates 200 languages with higher accuracy by an average of 44%. These languages include lesser-known African languages like Kamba and Lao (55 in total) and languages from other parts of the world. Such languages are incompatible with other existing translation tools.
No Language Left behind (NLLB) is a part of Meta’s long-term efforts to build language and machine translation tools. Launched in February 2022, the project builds advanced AI models to learn and decipher languages based on fewer examples.
The NLLB-200 is made to truly serve everyone, as other AI systems are not designed to cater to hundreds of local languages and provide a real-time speech-to-speech translation. Covering 200 languages is a step forward in overcoming data scarcity and acquiring more training data in local/regional languages. The new AI model also aims to overcome some modeling challenges of expansion faced by the company in previous years.
It is not the first time that Meta has developed a translation model. It released the 100-language M2M-100 translation model in 2020 with improved architectures and data acquiring practices. The AI company has now scaled to another 100 languages in NLLB-200. It can be used to advance other technologies, developing assistants for languages like Uzbek and creating subtitles for movies in Oromo/Swahili. There are endless possibilities to extend its application and democratize access for people in virtual worlds.
Meta trained NLLB-200 on FLORES-200, a dataset that enables AI’s performance assessment in 40,000 different language directions. The dataset measured NLLB-200’s performance in each of the 200 languages to be highly accurate.
Adding to the upsides, Meta is open-sourcing the model and the FLORES-200 dataset to all developers. It has also open-sourced the model training code. The company has also provided a demo to show the application of this open-source translator. The sole reason behind providing open-source access is to help researchers improve their work and translation capabilities via machines. Since inaccessibility is a major drawback of other language translation technologies/tools, Meta’s AI would make technology accessible to ordinary people.
Further, NLLB-200 will aid in promoting native languages and enabling people to read things without an intermediary language. Languages like Mandarin, English, and Spanish dominate the language webspace. Many people from other countries or regions cannot get the sentiments or context of things written in languages other than their own. NLLB-200 will bridge this gap and add meaning to the text, as people can now read in their preferred language.
As an incentive to use the AI model impactfully, Meta is awarding up to US$200,000 grants to researchers and nonprofit organizations. These researchers/organizations are invited to use NLLB-200 to translate underrepresented languages.
Meta has also collaborated with Wikimedia Foundation, a nonprofit organization, to offer translation services on Wikipedia. The model would help reduce the disparity between English publications on the website and those in other languages, especially those spoken outside of America and Europe. For instance, there are only 3,260 Wiki articles in Lingala, a native language spoken by 45M people in the Democratic Republic of Congo, against 2.5M Wiki articles in a language like Swedish, spoken in Sweden and Finland by much lesser people.
Even though the AI model has enhanced accuracy and meaningful translation of more languages than before, there is an endless scope for improvement. 200 languages cannot cover the entire language space. Additionally, the company faced several challenges in expanding the model from 100 to 200 languages. Since many of these languages are regional, the challenge is to acquire data from low-resource datasets. The model starts overfitting if trained for extended periods due to data scarcity. Such challenges would only scale as the number of languages increases. Long story short, there is a long road ahead for translation technologies, but NLLB-200 takes us one step forward in the right direction. Meta plans to strive for a more inclusive and connected world by breaking down linguistic and technological barriers and empowering people.