News APP

NewsApp (Free)

Read news as it happens
Download NewsApp

Available on  gplay

Rediff.com  » News » Can India Create Multi Language AI Like Chat GPT?

Can India Create Multi Language AI Like Chat GPT?

By Shivani Shinde
January 08, 2024 13:40 IST
Get Rediff News in your Inbox:

Creating an LLM trained on Indian languages is not easy.
Experts say each language in India has a nuance of its own, so creating a ChatGPT-like product is an ambitious challenge.

Illustration: Dominic Xavier/Rediff.com
 

2023 was all about Open AI's ChatGPT chatbot, a software application designed to mimic human-like conversation based on user prompts.

The year ended on the note that India-based large language models (LLMs) -- the backbone of tools like ChatGPT -- will be available soon.

Bhavish Aggarwal, CEO of Ola Electric, announced Krutrim, an LLM his company describes as 'India's own AI' (artificial intelligence).

Aggarwal is not the first to venture into Indian LLMs. Bhashini, a Government of India initiative, AI4Bharat of Indian Institute of Technology-Madras, Sarvam.ai, and project Vaani are some other such projects.

Creating an LLM trained on Indian languages is not easy. Experts say each language in India has a nuance of its own, so creating a ChatGPT-like product is an ambitious challenge.

Billionaire Vinod Khosla, a pioneer in Silicon Valley AI investments and an early supporter of OpenAI, said about Sarvam's $41 million fund-raise: 'We need companies like Sarvam AI to develop deep expertise for building AI in and for India.'

To create an LLM three things matter the most: Access to data in the local language, computing power and third constant training of datasets.

All three conditions are hurdles when building an Indian LLM. That is unlike ChatGPT, which was primarily created in the English language and had access to ample amount of data.

Access to data in Indian languages is an issue to begin with.

"The ecosystem of Indian languages is way more difficult and confusing.

"When you try to enable a lot of people to do things, they need to be able to fall back on really trustworthy and easy to use standards," says Vivekanand Pani, co-founder and chief technology officer of Reverie Language Technologies, a Reliance Jio portfolio company.

"And today Indian languages definitely suffer big time," Pani adds.

Reverie has since 2009 been working for the inclusion of Indian languages in digital devices.

'Pain point'

"My biggest pain point when it comes to creating access in Indian languages is that [it is] unlike in English, in which data creation was possible because it had a great technological support ecosystem," explains Pani.

"In Indian languages we have standards that are ambiguous for people who are creating the basic typing tools or spell checks," adds Pani, who has worked in the Indian language segment for years.

English is the most popular language for Web content, representing 58.8 per cent of Web sites as of January 2023, according to a report by Statista, a global data and business intelligence platform.

The report said the United States and India, the countries with the most number of internet users after China, are also the world's biggest English-speaking markets.

ChatGPT took almost six years to be where it is today despite English language data being available in abundance.

Such data in English is digitised and there are companies who help digitise text or offline data. That is not the case with Indian languages.

However, recent efforts by the Indian government have created tools to get data in Indian languages.

Bhashini and AI4Bharat have speech recognition and translation in 22 languages and both have text-to-speech capabilities.

Before Aggarwal announced Krutrim, a Bangalore-based startup made headlines for aiming to create an Indian AI. Sarvam.ai got $41 million in a Series A funding round, one of the highest ever amounts raised by an Indian AI startup.

The firm aims to create a full-stack Generative AI (GenAI), a type of AI technology that can produce various types of content, including text, imagery, and audio.

Sarvam AI will focus on training AI models to support Indian languages and voice-first interfaces.

"Our intent is that 500 million Indians should be able to use GenAI. We believe that India cannot be just a user of OpenAI's ChatGPT. We need to understand the models and how one can be delivered in an Indian context," says Vivek Raghavan, co-founder of Sarvam.ai.

Year ahead

Raghavan feels GenAI will be used differently in India.

"I think in India if GenAI is going to be used then it will be through the medium of voice. We have made it very hard to type in our own languages and hence in many cases the interfaces will be voice," he says.

Pani and Raghavan believe as translation mechanisms have improved, there are now more techniques that allow creation of data in Indian languages.

Mayuresh A Nirhali, head of engineering & products at Reverie, says a format like ONDC, the open e-commerce network launched by the government, would be a better approach while building an Indian LLM.

"I think we should consider an ONDC type of format to bring government and corporates together in solving this problem," says Nirhali.

"Not just companies, even at the government level, there are different departments doing different things and following different approaches in sourcing data and building models," Nirhali adds.

Nirhali has a point about LLM projects getting a common platform like ONDC.

Other than Bhashini and AI4Bharat, there are other efforts such as Indian Institute of Science in Bengaluru and AI and Robotics Technology Park are partnering with Google to launch an LLM called Project Vaani.

Vaani expects to create a dataset of more than 150,000 hours of speech, part of which will be transcribed in Indian language scripts.

The dataset of natural speech and text from about 1 million people in 773 districts of India will be open source.

Despite challenges Pani believes that 2024 will be significant for Indian LLM.

"India as a user base for such services is seen as a fastest growing region, which is why the interest," he says.

Krutrim is a significant step. Its biggest positive is the fact that it is trained on 2 trillion tokens.

The LLM will have generative support for 10 Indian languages and will support inputs in 22 languages.

Of course, Krutrim's ability will only be seen when it is available for all to test.

Feature Presentation: Ashish Narsale/Rediff.com

Get Rediff News in your Inbox:
Shivani Shinde
Source: source
 
India Votes 2024

India Votes 2024