<aside> ℹ️ This presentation at PAIRS 2026 Online on 17th February 2026 16:00 UTC. Registered participants will receive zoom links to join the session via e-mail.

🎙️ Thread on PAIRS Discussion Server (Discord) (register first)

</aside>

Abstract

AI systems are increasingly used in education, healthcare, welfare administration, and crisis response. However, as these systems reach multilingual users, a critical issue emerges: AI performance and the effectiveness of built-in safeguards vary significantly across languages. Most evaluations still prioritise English, leaving other languages with weaker testing, lower reliability, and higher risks of harmful or misleading responses. This inconsistency undermines trust, safety, and the rights of individuals, especially when AI tools are used to inform or make decisions about people’s access to services, legal status, and basic needs.

In this presentation, I will:

Introduce the Multilingual AI Lab (https://www.multilingualailab.com/), developed during my Mozilla Foundation Fellowship. It is an open research and practice platform enabling researchers, civil society organisations, and developers to conduct participatory benchmarking, multilingual safety evaluations, and documentation of AI harms and failures across languages.
Present a humanitarian case study assessing the performance of large language models (LLMs) in refugee and migrant information scenarios. Working with Respond Crisis Translation (https://respondcrisistranslation.org/), we evaluated LLMs in language pairs: Arabic vs. English, Farsi (Iranian Persian) vs. English, Pashto vs. English, Kurdish (Sorani) vs. English, on scenarios involving legal aid, healthcare access, employment, asylum procedures, and essential services.
Share key findings, showing where language disparities are greatest in terms of factual accuracy, actionability of information, tone and empathy, harmful content, and refusal rates, across both language pairs and models. These results reveal how safeguards built into AI systems often function unevenly depending on the language.
Compare human evaluation with LLM self-evaluation (LLM-as-judge), a method commonly used by AI labs to claim scalable and objective assessment. In our humanitarian scenarios, we show how LLM-as-judge often lacks contextual understanding of displacement, leading to false confidence in AI systems’ readiness for deployment in high-stakes settings.
Explain our participatory methodology, including scenario co-design, evaluation rubric development, collaboration with Respond Crisis Translation, and how findings are used for advocacy. This shows a practical model of participatory AI evaluation, connecting research with real-world impact.
Highlight the value of research tool-building for participatory AI, not only to measure performance, but to enable evidence-based advocacy, collectively identify system failures, and co-create safeguards with affected communities.