<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:thoughtbot="https://thoughtbot.com/feeds/">
  <title>Giant Robots Smashing Into Other Giant Robots</title>
  <subtitle>Written by thoughtbot, your expert partner for design and development.
</subtitle>
  <id>https://robots.thoughtbot.com/</id>
  <link href="https://thoughtbot.com/blog"/>
  <link href="https://feed.thoughtbot.com" rel="self"/>
  <updated>2026-05-18T00:00:00+00:00</updated>
  <author>
    <name>thoughtbot</name>
  </author>
<entry>
  <title>AI and minority languages</title>
  <link rel="alternate" href="https://thoughtbot.com/blog/ai-and-minority-languages"/>
  <author>
    <name>Ferdia Kenny</name>
  </author>
  <id>https://thoughtbot.com/blog/ai-and-minority-languages</id>
  <published>2026-05-18T00:00:00+00:00</published>
  <updated>2026-05-11T12:56:27Z</updated>
  <content type="html">&lt;p&gt;Last week I attended an excellent conference in the Irish consulate in San Francisco titled “&lt;a href="https://minoritylanguages.ai/"&gt;AI &amp;amp; Minority Languages: A Bay Area Perspective&lt;/a&gt;”.&lt;/p&gt;

&lt;p&gt;Our team at thoughtbot speaks over 25 different languages, from Swedish to Bekwarra, so this conference was of particular relevance. Here are my key takeaways from the discussions.&lt;/p&gt;
&lt;h2 id="who-gets-in-the-ark"&gt;
  
    Who gets in the ark?
  
&lt;/h2&gt;

&lt;p&gt;AI poses both an opportunity and an existential risk for minority languages. While these are typically languages whose speakers are fewer than those of another group within a defined area, in this context it more closely relates to languages that are not one of the world’s dominant languages such as English, French, German, Spanish, Russian or Chinese (including the likes of Mandarin and Cantonese).&lt;/p&gt;

&lt;p&gt;On the one hand, new AI-powered tools like &lt;a href="https://www.abair.ie/"&gt;Abair&lt;/a&gt; from Trinity College Dublin (for Irish/Gaeilge), &lt;a href="https://projecteaina.cat/en/"&gt;Aina&lt;/a&gt; (for Catalan) and &lt;a href="https://vaani.iisc.ac.in/"&gt;Project Vaani&lt;/a&gt; (for a whole range of Indian languages and dialects), can create new ways for people to interact with minority languages and dialects.&lt;/p&gt;

&lt;p&gt;But the picture is not all rosy. AI is accelerating the loss of languages with 97% of the world’s languages now being categorised as “in danger”.&lt;/p&gt;

&lt;p&gt;That is partly because language is not just about how many people speak it. It’s about usability; what you can achieve using your language. If a minority language is no longer useful in the modern world, it becomes associated with the past. And languages that are associated with the past die out. With the proliferation of AI, for the first time in history, you could have a language spoken by 20 million people that could actually be in danger because it’s about to be drowned out in the present technological wave.&lt;/p&gt;

&lt;p&gt;The old global divide was about access; do you have a device, a connection, an account? The new divide is around quality. If a doctor can use technology to get decision support in English but a doctor looking for support in Swahili gets only noise, the language is going to be in danger.&lt;/p&gt;

&lt;p&gt;We’re at an inflection point and the implications of not making it into the ark are profound.&lt;/p&gt;
&lt;h2 id="what-are-the-problems"&gt;
  
    What are the problems?
  
&lt;/h2&gt;
&lt;h3 id="biased-data"&gt;
  
    Biased data
  
&lt;/h3&gt;

&lt;p&gt;The current imbalances largely occur due to gaps in the data. &lt;a href="https://www.economist.com/science-and-technology/2024/01/24/why-ai-needs-to-learn-new-languages"&gt;93% of ChatGPT3’s training data was in English&lt;/a&gt;. Therefore, the language you speak determines your access to and effectiveness with the technology. If the data used to train models is only in a handful of primary languages, we will have less diversity and fewer useful living languages.&lt;/p&gt;
&lt;h3 id="defining-success-in-different-languages"&gt;
  
    Defining success in different languages
  
&lt;/h3&gt;

&lt;p&gt;Public success metrics for performance in general reasoning models nearly always relate to how proficient a model is &lt;em&gt;in English&lt;/em&gt;. But if you assess a model in terms of multilingual performance, the success rate is much lower.&lt;/p&gt;

&lt;p&gt;For example, &lt;a href="https://artificialanalysis.ai/models/multilingual?language=en%2Ces%2Cmy%2Cbn%2Cyo"&gt;Claude Sonnet 4.5 performs at 94% in general reasoning in English, but at only 76% in Yoruba&lt;/a&gt;. That is quite a disparity, and it likely increases when compared to even more marginalised languages.&lt;/p&gt;

&lt;p&gt;Furthermore, &lt;a href="https://arxiv.org/pdf/2412.03304"&gt;28% of questions that are asked of an LLM require culturally sensitive knowledge&lt;/a&gt;, which becomes increasingly difficult to accommodate the more marginalised a language is.&lt;/p&gt;
&lt;h3 id="accessibility"&gt;
  
    Accessibility
  
&lt;/h3&gt;

&lt;p&gt;Language is subtle and nuanced. There can be a difference between accuracy and understandability. An AI system can be technically correct, but it still might miss out on nuances in different languages.&lt;/p&gt;

&lt;p&gt;A simple example is that an AI system might say 25%, whereas humans might more commonly say “one in four”. It’s the same outcome, but expressed differently. Humans also use metaphors whereas AI tends to rely on literal descriptions. The question isn’t whether AI can do something, but how does AI communicate meaning.&lt;/p&gt;

&lt;p&gt;A humorous example of this failing came during an Eleven Labs demo of an AI voice product which was tasked with creating some Irish folk tunes. While it completed the task well, the voice agent, despite speaking in the voice of an American man, decided to name itself “Aoife”. For those who don’t know, Aoife is a very popular Irish girls name. The system clearly knew enough to select an Irish name, but it clearly didn’t understand the name itself. It missed the nuance.&lt;/p&gt;
&lt;h3 id="computational-power"&gt;
  
    Computational power
  
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/asrikun/"&gt;Dr. Mochamed Asri&lt;/a&gt; brought up one of the most fascinating issues discussed during the conference. He explained that, while models might be getting better, the tokeniser is seriously lagging behind. It focuses on English, and it needs to get much better to bring parity to different languages.&lt;/p&gt;

&lt;p&gt;Over 90% of model training data is in English. Because of this familiarity, it’s easy for the LLM to assign a single token to words. For example, the word “community” in English registers as a single token. But the word “masyarakat”, which is Japanese for the same word “community”, because it is not recognised, gets split into 4 different parts; “mas”, “ya”, “ra” and “kat”. That’s four tokens, which means four times the computational power, for the same meaning.&lt;/p&gt;
&lt;h3 id="energy-gap"&gt;
  
    Energy gap
  
&lt;/h3&gt;

&lt;p&gt;Computing power leads us on to the energy gap issue. Unfortunately, the countries with the greatest need for extra compute power because of the tokeniser imbalance, are the ones least equipped to provide the necessary energy, which prevents countries from making their own models.&lt;/p&gt;

&lt;p&gt;For comparison, the United States has a population of ~350 million people and &lt;a href="https://www.statista.com/statistics/184246/us-electric-generating-capacity-from-2000/"&gt;an electrical capacity of ~1,200 GW&lt;/a&gt;. Indonesia, with a population of ~288 million people, &lt;a href="https://www.statista.com/statistics/865232/indonesia-electricity-generation-capacity/"&gt;has a capacity of only ~80GW&lt;/a&gt;. Kenya, with a population of ~59 million people has only &lt;a href="https://www.statista.com/statistics/1240951/installed-capacity-of-electricity-generation-in-kenya/"&gt;~4GW of electrical capacity&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id="bystander-problem"&gt;
  
    Bystander problem
  
&lt;/h3&gt;

&lt;p&gt;AI can contribute to bystander syndrome. We might think someone else will work on adding and protecting our languages, that there’s no need for me to take any action. But the reality is that big frontier models will not do this work for us.&lt;/p&gt;
&lt;h2 id="improving-models-but-more-to-do"&gt;
  
    Improving models but more to do
  
&lt;/h2&gt;

&lt;p&gt;The frontier models are improving, which is a step in the right direction.&lt;/p&gt;

&lt;p&gt;During his talk, Zach Parent of OpenAI, demonstrated how ChatGPT 3.5 turbo performed much worse than ChatGPT 5.5 at Irish language tasks. The test demo was simple; he asked each version of the model a question in English but asked it to give the answer in Irish. Then he asked both to translate the Irish answer back to English. 5.5 gave a pretty accurate response, while 3.5 outputted mostly gibberish.&lt;/p&gt;

&lt;p&gt;This demonstrated that over 3 years, it has improved a lot. But Zach highlighted that while everyone focuses on training the model, the other steps in the process are where the gaps exist.&lt;/p&gt;

&lt;p&gt;The models improve by putting more data through them. The process needs to start with Automatic Speech Recognition, or ASR, to turn spoken word into text. The model then needs to understand the inputs before applying Text To Speech (TTS), followed by a real-time agent to make it conversational.&lt;/p&gt;

&lt;p&gt;Each of these steps requires high quality data that is not easy to capture. In particular, there is a real gap when moving from text to voice.&lt;/p&gt;
&lt;h2 id="how-to-fix-it"&gt;
  
    How to fix it
  
&lt;/h2&gt;
&lt;h3 id="high-quality-data"&gt;
  
    High Quality data
  
&lt;/h3&gt;

&lt;p&gt;We need more high quality data and representation in the models. This is not just about having more translations, but actual high quality input.&lt;/p&gt;

&lt;p&gt;Processes like transcribing voice to text is manual, it’s labour intensive and requires native speakers. Once it’s done, that data can get ingested and the model gets trained on it.&lt;/p&gt;
&lt;h3 id="government-support"&gt;
  
    Government support
  
&lt;/h3&gt;

&lt;p&gt;That transcription piece won’t just happen on its own, it can’t just be automated. Governments, non-profits and universities need to step in to support this work.&lt;/p&gt;
&lt;h3 id="small-language-models-slms"&gt;
  
    Small Language Models (SLMs)
  
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.ibm.com/think/topics/small-language-models"&gt;Small language models&lt;/a&gt; offer some promise, especially in relation to compute power and the energy gap. Again, governments will need to provide supports to build SLMs and edge deployments.&lt;/p&gt;
&lt;h3 id="improved-tokeniser"&gt;
  
    Improved tokeniser
  
&lt;/h3&gt;

&lt;p&gt;Even as the models improve, we won’t get language parity until the tokeniser improves and becomes more diverse, reducing compute power constraints for minority languages.&lt;/p&gt;
&lt;h3 id="accountability"&gt;
  
    Accountability
  
&lt;/h3&gt;

&lt;p&gt;As a society, we need to hold frontier models more accountable when they publish performance metrics. Applying pressure to disclose cross language performance metrics will help highlight the gap and, hopefully, will lead to further action.&lt;/p&gt;
&lt;h3 id="get-creative"&gt;
  
    Get creative
  
&lt;/h3&gt;

&lt;p&gt;One of the most unique, creative ways to drive a minority language and culture into the modern age came from the Iñupiat, an Alaska Native people. &lt;a href="https://nativefederation.org/2019/03/gloria-oneill/"&gt;Gloria O’Neill&lt;/a&gt; of the Cook Inlet Tribal Council explained how they created a puzzle-platformer video game called &lt;a href="https://www.neveralonegame.com/"&gt;Never Alone (Kisima Inŋitchuŋa)&lt;/a&gt; which is based on a traditional story passed down through generations.&lt;/p&gt;

&lt;p&gt;The game was created in partnership with E-Line Media. With over 15 million players worldwide, the reception of Never Alone: Kisima Ingitchuna launched a movement of social-impact video games. It went on to win a Peabody Award for its storytelling and a BAFTA for Best Debut Game. Eight years later, Never Alone 2 is on the cusp of being released.&lt;/p&gt;

&lt;p&gt;I was inspired and blown away by the totally outside-the-box thinking of the Iñupiat people to preserve their language and culture.&lt;/p&gt;
&lt;h2 id="a-closing-thought"&gt;
  
    A closing thought
  
&lt;/h2&gt;

&lt;p&gt;AI has endangered more languages than ever before. However, there are still ways that we, as minority language speakers, can preserve our languages. We can create the high quality data required to train the models, push corporations and governments to accommodate and support minority languages, or think completely differently about how to make our language relevant in the AI age.&lt;/p&gt;

&lt;p&gt;We are at an inflection point and we need to take ownership and accountability to get our languages onto the ark.&lt;/p&gt;

&lt;aside class="related-articles"&gt;&lt;h2&gt;If you enjoyed this post, you might also like:&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://thoughtbot.com/blog/how-to-use-chatgpt-to-find-custom-software-consultants"&gt;How to Use ChatGPT to Find Custom Software Consultants&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://thoughtbot.com/blog/using-machine-learning-to-answer-questions-from-internal-documentation"&gt;Using Machine Learning to Answer Questions from Internal Documentation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://thoughtbot.com/blog/diversity-equity-inclusion-and-building-great-teams-one-recruiter-s-rambling-thoughts-on-inclusive-hiring"&gt;Diversity, Equity, Inclusion and Building Great Teams - One Recruiter’s Rambling Thoughts on Inclusive Hiring&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;/aside&gt;
</content>
  <summary>How AI creates both opportunities and existential risks for minority languages, and what we can do to protect them.</summary>
  <thoughtbot:auto_social_share>true</thoughtbot:auto_social_share>
</entry>
</feed>
