Building Local AI in Southeast Asia: Overcoming Challenges and Embracing Cultural Nuances

Introduction

In November 2022, OpenAI’s launch of ChatGPT highlighted the bias of large language models (LLMs) towards Western, industrialized, wealthy, educated, and democratic countries. This bias was evident in the assumption that if LLMs spoke a specific language and reflected a particular worldview, it would be Western. However, developers in Southeast Asia had already recognized the need for AI tools that communicate with the region in its many languages, a task complicated by the fact that over 1200 languages are spoken there.

The Political Nature of Language in Southeast Asia

In a region where distant memories of ancient civilizations often clash with contemporary postcolonial narratives, language is deeply political. Even seemingly monolingual countries hide significant linguistic diversity: Cambodians speak nearly thirty languages, Thais around seventy, and Vietnamese over a hundred. This is a region where communities switch languages effortlessly, where non-verbal communication plays a substantial role, and where oral traditions often hold more cultural and historical nuances than written texts.

Challenges in Developing Local AI Models

Creating truly local AI models in such a linguistically diverse region presents numerous challenges. These include scarcity of high-quality, annotated data and limited access to the computational power needed to build and train models from scratch. In some cases, there are even more basic obstacles like the scarcity of native speakers, lack of common orthographic standards, and frequent power outages.

Adjusting Pre-trained Models

Due to these limitations, many regional AI developers have resorted to fine-tuning pre-existing models created by established foreign companies. This involves taking a largely trained model and adjusting it for specific tasks using smaller datasets. Between 2020 and 2023, language models like PhoBERT (Vietnamese), IndoBERT (Indonesian), and Typhoon (Thai) were developed based on larger models like Google’s BERT, Meta’s RoBERTa (later LLaMA), and France’s Mistral. Even the initial versions of SeaLLM, a set of models optimized for regional languages and published by Alibaba’s DAMO Academy, relied on the architectures of Meta, Mistral, and Google.

Qwen’s Impact

In 2024, Alibaba Cloud’s Qwen disrupted the Western dominance by offering Southeast Asian developers a broader range of options. A study by the Carnegie Foundation for International Peace revealed that five out of twenty-one regional models launched that year were based on Qwen.

Ideological Bias in Chinese-Trained Models

However, just as Southeast Asian developers previously had to account for the latent Western bias in foundational models, they now need to be vigilant about the ideological slant of models trained in China. Ironically, in their attempt to localize AI and ensure more autonomy for Southeast Asian communities, developers might become more reliant on larger entities initially.

Addressing the Issue

Southeast Asian developers have started tackling this issue. Models like SEA-LION (for a collection of eleven regional official languages), PhoGPT (Vietnamese), and MaLLaM (Malay) have been pre-trained from scratch using large generic datasets of the respective languages. This crucial step in the machine learning process will enable fine-tuning these models for specific tasks.

Importance of Local Knowledge

To truly represent indigenous perspectives, a strong foundation of local knowledge is essential. A faithful representation of Southeast Asian viewpoints and values cannot exist without understanding the political aspects of language, traditional sense-making mechanisms, and historical dynamics.

Examples of Local Language Nuances

For instance, many indigenous communities perceive time and space differently (considered linear, divisible, and measurable for productivity in modern contexts). Balinese historical writings that challenge conventional chronological models might be dismissed as myths or legends in Western terms, yet they still shape how these communities interpret the world.

Regional historians have warned that applying a Western lens to local texts risks misinterpreting indigenous perspectives. During the 18th and 19th centuries, Indonesian colonial administrators often imposed their interpretations on Java chronicles they accessed through translated reproductions. This led to many biased observations about Southeast Asians from British and European sources being considered valid historical accounts, internalizing ethnic categorizations and stereotypes present in official documents. If such data is used to train AI, the outcome could deepen prejudices.

Data vs. Knowledge

Language is inherently social and political, reflecting the relational experiences of its users. To ensure autonomy in the age of AI, technical capability to communicate in local languages is insufficient. Inherited biases must be consciously purged, assumptions about one’s identity questioned, and indigenous knowledge repositories in regional languages rediscovered.

About the Author

Elina Noor is a senior researcher at the Carnegie Foundation for International Peace’s Asia Program.

Introduction

The Political Nature of Language in Southeast Asia

Challenges in Developing Local AI Models

Adjusting Pre-trained Models

Qwen’s Impact

Ideological Bias in Chinese-Trained Models

Addressing the Issue

Importance of Local Knowledge

Examples of Local Language Nuances

Data vs. Knowledge

About the Author

Copyright

Most recent

markets

Fibra Mty Completes Acquisition of Batach Industrial Portfolio Worth $73.4 Million

companies

Private Consumption Slows Down in Mexico: INEGI’s Latest Report

video

EU Proposes €2 Trillion Budget for 2028-2034: A Comprehensive Overview

international

Tragedy at Gaza Aid Center Leaves 20 Dead; Armed Men Blamed

entrepreneurship

Women’s Representation in Mexican Corporate Boards Increases to 46%