Advertisement

India turns to AI in ‘a special effort’ to capture its 121 languages

  • Few of India’s many languages are covered by natural language processing, the branch of AI that enables computers to understand text and spoken words
  • Hundreds of millions of Indians are thus excluded from useful information and many economic opportunities. Governments and start-ups are trying to bridge this gap

Reading Time:4 minutes
Why you can trust SCMP
People walk along a narrow street in a small town in Karnataka state. Photo: Shutterstock

For a few weeks this year, villagers in the southwestern Indian state of Karnataka read out dozens of sentences in their native Kannada language into an app as part of a project to build the country’s first AI-based chatbot for Tuberculosis.

Advertisement
There are more than 40 million native Kannada speakers in India, and it is one of the country’s 22 official languages, and one of over 121 languages spoken by 10,000 people or more in the world’s most populous nation.

But few of these languages are covered by natural language processing (NLP), the branch of artificial intelligence that enables computers to understand text and spoken words.

Hundreds of millions of Indians are thus excluded from useful information and many economic opportunities.

“For AI tools to work for everyone, they need to also cater to people who don’t speak English or French or Spanish,” said Kalika Bali, principal researcher at Microsoft Research India. “But if we had to collect as much data in Indian languages as went into a large language model like GPT, we’d be waiting another 10 years. So what we can do is create layers on top of generative AI models such as ChatGPT or Llama.”

A rural area in Lakkundi, Karnataka. Villagers in the state are among thousands of speakers of different Indian languages generating speech data for tech firm Karya. Photo: Shutterstock
A rural area in Lakkundi, Karnataka. Villagers in the state are among thousands of speakers of different Indian languages generating speech data for tech firm Karya. Photo: Shutterstock

The villagers in Karnataka are among thousands of speakers of different Indian languages generating speech data for tech firm Karya, which is building data sets for firms such as Microsoft and Google to use in AI models for education, healthcare and other services.

Advertisement