Indian large language models gain momentum
2024-05-13
The utilization of Indian large language models trained on Indic languages marks a significant advancement in leveraging AI technology to address the diverse linguistic landscape of India. By deploying these models, businesses and governments can enhance their ability to communicate effectively with citizens and customers in their preferred languages. This not only facilitates better service delivery but also promotes inclusivity and accessibility for all language communities within the country.
Today, modern India boasts 22 official languages, numerous dialects and a heterogeneous terrain where tone, accent and vocabulary can vary every few kilometres. This cultural diversity lends itself to the development of local large language models (LLMs) that better suit the needs of a multilingual country.
Expert says, regarding the training data used in most global large language models (LLMs) is quite pertinent, especially in the context of India's linguistic diversity. English-centric training data may not adequately capture the nuances and complexities of the multitude of languages spoken in India.
Indian large language models like BharatGPT, Krutim, OpenHathi and Vani that have been trained on Indic languages are now being used by businesses and governments to better serve the needs of a diverse multilingual country.
Krutim's initiative, backed by a prominent company like Ola and trained on such a vast dataset of over two trillion tokens, represents a significant advancement in Indian large language models (LLMs). The fact that Krutim emphasizes a large representation of Indic language tokens is particularly noteworthy, as it reflects a commitment to addressing the linguistic diversity of India.
Secondly, BharatGPT, as a homegrown generative AI initiative, BharatGPT aims to address by focusing on training language models specifically on data from Indian languages. It is supporting a wider range of Indian languages beyond just English. Similarly, Indian AI startup Sarvam AI has released OpenHathi LLM, marking a significant leap in the realm of Hindi language models.
Google-backed Project Vaani, has 12 states and 80 districts represented in one of the largest datasets of Indian dialects. The data is expected to be open sourced through platforms such as Bhashini under the Ministry of Electronics and Information Technology’s national language translation initiative, and spur the development of automatic speech recognition and natural language processing technologies that better understand how Indians speak.
These language models also enable various applications, including customer support, content localization, and language translation services, to cater to the linguistic diversity of India. Moreover, they contribute to the democratization of information and access to services by breaking down language barriers.
See What’s Next in Tech With the Fast Forward Newsletter
Tweets From @varindiamag
Nothing to see here - yet
When they Tweet, their Tweets will show up here.