Yong Zheng-Xin

I am a fourth-year Ph.D. student in Computer Science at Brown University, advised by Prof. Stephen Bach. I am fortunate to have interned/collaborated with amazing researchers at Meta GenAI, FAIR, and Cohere Labs. My research interest is to create safe and intelligent systems to serve everyone in the world.

I currently work on understanding how reasoning training generalizes for safety alignment and multilinguality. Here are some recent work:

test-time scaling of multilingual reasoning with only English training (preprint).

I also actively work on safety alignment. I discovered that low-resource languages can jailbreak GPT-4 (⭑Best Paper Award, NeurIPS 2023 SoLaR Workshop; featured on New Scientist), which became the seminal work for multilingual red-teaming. My follow-up work uses mechanistic interpretability to study crosslingual generalization of detoxification alignment training (EMNLP 2024 Findings) and adversarial attacks (NAACL 2025 Findings). We also recently released a survey on multilingual AI safety (preprint).

Previously, I worked on multilingual LLMs and low-resource NLP. I co-developed Aya model (⭑Best Paper Award, ACL 2024) and worked on making speech models robust to accents (INTERSPEECH 2025). I investigated how pretrained LLMs can learn low-resource languages through language adaptation (ACL 2023) and synthetic data (EMNLP 2024 Findings). Besides, I also studied language-mixing behaviors (EMNLP 2023 CALCS, EMNLP 2023 Findings, ACL 2023 Findings) and contributed to Southeast Asian NLP as a Malaysian (SEACrowd, CVQA, AACL 2023 Tutorial).

Actively seeking full-time research roles in industry.

Featured work/preprints (see all)

The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It

Zheng-Xin Yong , Beyza Ermis , Marzieh Fadaee , and 2 more authors

arxiv preprint, 2025

PDF Abstract

This paper presents a comprehensive analysis of the linguistic diversity of LLM safety research, highlighting the English-centric nature of the field. Through a systematic review of nearly 300 publications from 2020–2024 across major NLP conferences and workshops at *ACL, we identify a significant and growing language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention. We further observe that non-English languages are rarely studied as a standalone language and that English safety research exhibits poor language documentation practice. To motivate future research into multilingual safety, we make several recommendations based on our survey, and we then pose three concrete future directions on safety evaluation, training data generation, and crosslingual safety generalization. Based on our survey and proposed directions, the field can develop more robust, inclusive AI safety practices for diverse global populations.
Crosslingual Reasoning through Test-Time Scaling

Zheng-Xin Yong , M. Farid Adilazuarda , Jonibek Mansurov , and 7 more authors

arxiv preprint, 2025

PDF Abstract Code

Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM’s CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Samuele Poppi , Zheng-Xin Yong , Yifei He , and 4 more authors

NAACL Findings, 2025

PDF Abstract

Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks: using a few adversarially chosen instruction-following examples in one language, multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse harmful prompts in other languages). Motivated by this finding, we hypothesize that safety-related information is language-agnostic and propose a new method termed Safety Information Localization (SIL) to identify the safety-related information in the model parameter space. Through SIL, we validate this hypothesis and find that only changing 20% of weight parameters in fine-tuning attacks can break safety alignment across all languages. Furthermore, we provide evidence to the alternative pathways hypothesis for why freezing safety-related parameters does not prevent fine-tuning attacks, and we demonstrate that our attack vector can still jailbreak LLMs adapted to new languages.
Preference Tuning for Toxicity Mitigation Generalizes Across Languages

Xiaochen Li* , Zheng-Xin Yong* , and Stephen H Bach

EMNLP Findings, 2024

PDF Abstract Code

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

Ahmet Üstün* , Viraat Aryabumi* , Zheng-Xin Yong* , and 14 more authors

ACL, 2024 (Best Paper Award)

Also featured in: The Washington Post , The Globe and Mail , SiliconANGLE, etc.

PDF Abstract

Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
Low-Resource Languages Jailbreak GPT-4

Zheng-Xin Yong , Cristina Menghini , and Stephen Bach

NeurIPS Workshop: Socially Responsible Language Modelling Research (SoLaR) , 2023 (Best Paper Award)

Also featured in: New Scientist , UK AI Safety Institute (Scientific Report) , ZDNet, etc.

PDF Abstract

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4’s safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs’ safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

Miscellaneous

✈️ I lived in 6 different countries (USA, UK, S.Korea, Argentina, India, and Germany) for at least four months each when I studied at Minerva University.

🕺 I love dancing. I used to teach a bit of Lindy Hop and salsa. Also learned the full K-pop choreo of “Let’s kill this love” for my undergrad graduation.

👃 Used to work on lung cancer diagnosis research (Google Science Fair 2016 Finalist; work published on Journal of Thoracic Disease 2016 and featured on IEEE Spectrum) before pivoting to AI.