Indian AI startup Sarvam AI has introduced OpenHathi-Hi-vo.1, representing the inaugural release within the OpenHathi series of large language models. The model expands upon the powerful Llama2-7B and boats performance similar to GPT-3.5 (sometimes even surpassing), specifically tailored for Indic languages.
OpenHathi notably expanded the Llama2-7B tokenizer by adding 48,000 more tokens. This is possible as a result of a meticulous two-phase training process. Initially, the focus lies on embedding alignment, a method that strategically aligns the initial random Hindi embeddings. Following this is the bilingual language modeling phase, which educates the model on how to handle different languages attentively across tokens.
Sarvam AI’s rigorous assessments cover not just standard Natural Language Generation tasks but also practical, real-world challenges. These evaluations, comparing OpenHathi against GPT-3.5 with GPT-4 as the referee, consistently highlight OpenHathi’s superior performance in Hindi, both in its native script and Romanized versions.
This collaboration saw Sarvam AI teaming up with academic partners from AI4Bharat, bringing in crucial language resources and benchmarking knowledge. Moreover, the model’s refinement was a result of collaboration with KissanAI, utilizing conversational data derived from a bot engaging with farmers in diverse languages.
Pratyush Kumar and Vivek Raghavan, the founders of Sarvam AI, initiated this venture in July 2023. They received $41 million in Series A funding. Lightspeed spearheaded the investment round, and Peak XV Partners and Khosla Ventures contributed significantly.