Indian AI: What is it, and can we make one?

NEW DELHI
:

Bhavish Aggarwal, cofounder of Ola Cabs, has announced a startup that he said is building “India’s own” foundational artificial intelligence (AI) model. The government, too, has been speaking about an India AI Programme. What exactly is ‘Indian’ AI? Mint decodes:

Does AI really differ by region?

What is colloquially referred to as ‘Indian’ AI is currently aspirational, referring to datasets that foundational AI models are trained on. Researchers argue that the largely West- and English-centric internet will teach AI programs biases and sensibilities that are mostly tuned to Western countries. That’s why the Centre, researchers and industry veterans, speak about why AI in India would differ. Key variations would be in understanding non-English languages, and getting nuances of India-centric cases of harm, societal bias and polity. Experts say such factors will make AI differ by region and culture.

How would India’s datasets differ?

The ministry of electronics and information technology (Meity)’s upcoming AI policy—India AI Programme—will be announced on 10 January. A key part will be around creating datasets, where a ministry-affiliated data governance office will handle anonymized and non-personal user-data collected through Central government affiliates and private organizations. This dataset will include languages spoken in India. To be sure, the latest generation foundational models from the likes of OpenAI and Google already include Indic language databases, but they use English data for primary training.

What else would the India AI Programme include?

Meity says researchers will have access to this Indic language database, making it a research repository. The programme will also have a module to develop indigenous computing power, including data-centre capacities and custom silicon designs to scale efficiency and cost. The latter will take place through some form of public-private partnerships.

What have private firms done so far here?

Startup Sarvam AI has showcased an open-source large language model for Hindi. Backed by venture capitalists Lightspeed and Peak XV, Sarvam’s AI model, called ‘OpenHathi-Hi-0.1’, is the first published model in India that is natively trained in non-English datasets. Hiranandani’s data centre firm Yotta has launched a partnership with Nvidia to scale cloud-based access for startups looking to train their own AI models. Ola’s Aggarwal claims to have built a foundational AI model for India, but has given few details.

What are the key challenges?

A big one is the lack of organized, publicly available datasets in Indic languages. This makes it difficult to source ample data for foundational models, which typically have trillions of data parameters. While stakeholders are digitizing physical literature to increase the availability of structured Indic language data, doing so is significantly more expensive than sourcing data in English. Designing custom silicon and manufacturing them at scale, as India proposes, will cost well over $1 billion.