Yong Zheng-Xin

I am a fifth-year PhD student at Brown University advised by Prof. Stephen Bach. I am fortunate to have interned/collaborated at Meta (GenAI, FAIR) and Cohere Labs, and my research is funded by Open Philanthropy (for technical AI safety research @ RFP 2025).

My current focus is on reasoning models. Some recent discovery includes:

After benign training, reasoning models can hide safety misbehavior in CoT (preprint), or even try to reason themselves out of safety guardrail (preprint). Our work offers interpretability analysis and some potential solutions.
Test-time scaling can improve zero-shot multilingual reasoning (preprint).

I also work on safety alignment. I discovered that low-resource languages can jailbreak GPT-4 (⭑Best Paper Award, NeurIPS 2023 SoLaR Workshop; featured on New Scientist), which became the seminal work for multilingual red-teaming. My follow-up work used mechanistic interpretability to study crosslingual generalization of detoxification alignment training (EMNLP 2024 Findings) and adversarial attacks (NAACL 2025 Findings). We also recently released a survey on multilingual AI safety (preprint).

In addition, I work on multilingual LLMs and low-resource NLP. I co-developed Aya model (⭑Best Paper Award, ACL 2024) and worked on making speech models robust to accents (INTERSPEECH 2025). I also studied how pretrained LLMs can learn low-resource languages effectively through language adaptation (ACL 2023) and synthetic data (EMNLP 2024 Findings), as well as how they naturally mix languages (i.e. code-switching) (EMNLP 2023 CALCS, EMNLP 2023 Findings, ACL 2023 Findings).

Actively seeking full-time research roles in industry.

Featured work/preprints (see all)

Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models

Yik Siu Chan* , Zheng-Xin Yong* , and Stephen H. Bach

arxiv preprint, 2025

PDF Abstract

Open-weights reasoning language models generate long chains-of-thought (CoTs) before producing a final response, which improves performance but introduces additional alignment risks, with harmful content often appearing in both the CoTs and the final outputs. In this work, we investigate if we can use CoTs to predict final response misalignment. We evaluate a range of monitoring approaches, including humans, highly-capable large language models, and text classifiers, using either CoT text or activations. First, we find that a simple linear probe trained on CoT activations can significantly outperform all text-based methods in predicting whether a final response will be safe or unsafe. CoT texts are often unfaithful and can mislead humans and classifiers, while model latents (i.e., CoT activations) offer a more reliable predictive signal. Second, the probe makes accurate predictions before reasoning completes, achieving strong performance even when applied to early CoT segments. These findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.
The State of Multilingual LLM Safety Research: From Measuring the Language Gap to Mitigating It

Zheng-Xin Yong , Beyza Ermis , Marzieh Fadaee , and 2 more authors

arxiv preprint, 2025

PDF Abstract

This paper presents a comprehensive analysis of the linguistic diversity of LLM safety research, highlighting the English-centric nature of the field. Through a systematic review of nearly 300 publications from 2020–2024 across major NLP conferences and workshops at *ACL, we identify a significant and growing language gap in LLM safety research, with even high-resource non-English languages receiving minimal attention. We further observe that non-English languages are rarely studied as a standalone language and that English safety research exhibits poor language documentation practice. To motivate future research into multilingual safety, we make several recommendations based on our survey, and we then pose three concrete future directions on safety evaluation, training data generation, and crosslingual safety generalization. Based on our survey and proposed directions, the field can develop more robust, inclusive AI safety practices for diverse global populations.
Crosslingual Reasoning through Test-Time Scaling

Zheng-Xin Yong , M. Farid Adilazuarda , Jonibek Mansurov , and 7 more authors

arxiv preprint, 2025

PDF Abstract Code

Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM’s CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages. Finally, we observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English. Overall, we demonstrate the potentials, study the mechanisms and outline the limitations of crosslingual generalization of English reasoning test-time scaling. We conclude that practitioners should let English-centric RLMs reason in high-resource languages, while further work is needed to improve reasoning in low-resource languages and out-of-domain contexts.
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks

Samuele Poppi , Zheng-Xin Yong , Yifei He , and 4 more authors

NAACL Findings, 2025

PDF Abstract

Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks: using a few adversarially chosen instruction-following examples in one language, multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse harmful prompts in other languages). Motivated by this finding, we hypothesize that safety-related information is language-agnostic and propose a new method termed Safety Information Localization (SIL) to identify the safety-related information in the model parameter space. Through SIL, we validate this hypothesis and find that only changing 20% of weight parameters in fine-tuning attacks can break safety alignment across all languages. Furthermore, we provide evidence to the alternative pathways hypothesis for why freezing safety-related parameters does not prevent fine-tuning attacks, and we demonstrate that our attack vector can still jailbreak LLMs adapted to new languages.
Preference Tuning for Toxicity Mitigation Generalizes Across Languages

Xiaochen Li* , Zheng-Xin Yong* , and Stephen H Bach

EMNLP Findings, 2024

PDF Abstract Code

Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.
Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

Ahmet Üstün* , Viraat Aryabumi* , Zheng-Xin Yong* , and 14 more authors

ACL, 2024 (Best Paper Award)

Also featured in: The Washington Post , The Globe and Mail , SiliconANGLE, etc.

PDF Abstract

Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models.
Low-Resource Languages Jailbreak GPT-4

Zheng-Xin Yong , Cristina Menghini , and Stephen Bach

NeurIPS Workshop: Socially Responsible Language Modelling Research (SoLaR) , 2023 (Best Paper Award)

Also featured in: New Scientist , UK AI Safety Institute (Scientific Report) , ZDNet, etc.

PDF Abstract

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4’s safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs’ safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

news

09 / 2025	Received grant from Open Philanthropy (RFP 2025) for my technical AI safety research!
07 / 2025	Gave an invited talk at Google Multilinguality Reading Group.
05 / 2025	Gave an invited talk at MilaNLP lab.
05 / 2025	1 paper accepted! Work on understanding accent bias in ASR speech models was accepted to INTERSPEECH’25. Work was done during Meta internship.
02 / 2025	1 paper accepted! Work on cross-lingual finetuning attacks was accepted to NAACL’25 findings. Work was done during Meta internship.
09 / 2024	4 papers accepted! LexC-Gen, SEACrowd, and crosslingual alignment were accepted to EMNLP. CVQA was accepted to NeurIPS.
08 / 2024	Aya Model paper received the ⭑Best Paper Award at ACL 2024.
07 / 2024	Gave an invited talk at London Data Week.
06 / 2024	Started research scientist internship at Meta AI (FAIR)!
05 / 2024	1 paper accepted! A Safe Harbor for AI Evaluation and Red Teaming is accepted to ICML.
02 / 2024	Released Aya model and dataset papers! I also presented Aya multilingual safety research at Aya Grand Finale.
11 / 2023	Co-organized the tutorial on current status of NLP in South East Asia at AACL 2023.
10 / 2023	“Low-Resource Languages Jailbreak GPT-4” received the ⭑Best Paper Award at NeurIPS 2023 Socially Responsible Language Modeling (SoLaR) workshop.
09 / 2023	Joined the Cohere For AI’s Responsible Deployment Team for Aya red-teaming.
08 / 2023	Served as the Area Chair (Multilingualism & Linguistic Diversity Track in EMNLP 2023).
05 / 2023	Media: Our code-switching paper was featured by Wired.
05 / 2023	3 papers accepted! BLOOM+1, BLOOMZ and code-switching survey were accepted to ACL 2023.
03 / 2022	2 papers accepted! T0 was accepted to ICLR (Spotlight). PromptSource was accepted to ACL Demo.
06 / 2021	Started PhD at Brown University.

Miscellaneous

✈️ I lived in 6 different countries (USA, UK, S.Korea, Argentina, India, and Germany) for at least four months each when I studied at Minerva University.

🕺 I love dancing. I used to teach a bit of Lindy Hop and salsa. Also learned the full K-pop choreo of “Let’s kill this love” for my undergrad graduation.

👃 Used to work on lung cancer diagnosis research (Google Science Fair 2016 Finalist; work published on Journal of Thoracic Disease 2016 and featured on IEEE Spectrum) before pivoting to AI.