Koo has introduced KooBERT, a masked language model trained on data from the multilingual micro-blogging social media platform Koo India. This BERT based pretrained model was built in collaboration with Koo India and AI4Bharat.
In his LinkedIn post, Head of Machine Learning & AI Harsh Singhal said, “KooBERT is a testament to our commitment to inclusivity, diversity, and multilingualism in Al. Trained on a large corpus of Koos 10+ Indian languages, it’s a significant leap forward in democratizing Al for millions of non-English speakers in India and around the world.”
On the Koo platform, there are microblogs (Koos) which are limited to 400 characters and available in multiple languages, including assamese, Bengali, English, Gujarati, Marathi, Oriya, Punjabi, Tamil, Hindi, Kannada, Malayalam, and Telugu. The model was trained on a dataset that contains multilingual koos from Jan 2020 to Nov 2022 on masked language modeling tasks.
This model can be used to perform downstream tasks like toxicity detection, content classification, and more for supported Indic languages. It can also be used with sentence-transformers library for the creation of multilingual vector embeddings for other uses.
As with any machine learning model, KooBERT does have limitations and biases. This model has been trained on Koo Social Media data and may not generalize well for other domains. It is also possible that the model may have biases in the data it was trained on, which can affect its predictions.