Michael J. Ryan, William Held, Diyi Yang" /> Unintended Impacts of Alignment on Global Representation | The Stanford AI Lab Blog

Unintended Impacts of Alignment on Global Representation

Michael J. Ryan, William Held, Diyi Yang

November 12, 2024

***Figure 1:*** Country rewards for Starling 7B Reward Model prompted with "User: Where are you from? Assistant: I am from {country}." Starling assigns higher rewards to English-speaking Western nations and lower rewards to countries in the Middle East/Africa.

Before being deployed for user-facing applications, developers align Large Language Models (LLMs) to user preferences through a variety of procedures, such as Reinforcement Learning From Human Feedback (RLHF) and Direct Preference Optimization (DPO). Current evaluations of these procedures focus on benchmarks of instruction following, reasoning, and truthfulness. However, human preferences are not universal, and aligning to specific preference sets may have unintended effects. We explore how alignment impacts performance along three axes of global representation: English dialects, multilingualism, and opinions from and about countries worldwide.

Dialects: First we notice that aligning an LLM with supervised finetuneing and preference tuning create a disparity in performance across various English Dialects. We show these results in Figure 2. We find all base models perform roughly equivalently at intent detection across dialects. However for Mistral to Mistral SFT, Indian English accuracy increased by 15.2%, Nigerian English accuracy increased by 20.3% and American English accuracy increased by 29.3%. The SFT process introduced a 14% gap in performance between Indian English and US English.

***Figure 2:*** *Alignment of LLMs improves intent detection across dialects. However, the benefits are outsized in USA English versus Indian and Nigerian English.*

Language: We find that supervised fine-tuning (SFT) and preference tuning consistently improve the model’s multilingual capabilities for models such as Tulu and Starling. Interestingly, we noticed that 14.1% of Tulu’s training data is in a non-English language. Conversely, Zephyr, which decreases performance across most languages, was trained on 99.9% of the English data. Just a tiny amount of multilingual data can be highly beneficial. We include these results in Figure 3.

***Figure 3:*** Alignment of LLMs improves multilingual reading comprehension consistently for Tulu and Starling. We find that 14.1% of Tulu’s training data is non-English, offering a possible explanation for this improvement.

Opinions: Finally, we test how opinions of language models change based on supervised finetuning and preference tuning. In particular we explore how the LLM responses to survey questions change over various stages of alignment versus global respondents. We showcase these results in Figure 4.

***Figure 3:*** Alignment of LLMs steers models towards US/Western preferences and opinions. The SFT and PT models tend to agree more with the US than with Jordan, China, and Nigeria. For Germany and Australia they tend to have about equal agreement with the US.

We find that both supervised finetuning and preference tuning tend to steer models towards US preferences and opinions rather than other countries such as Jordan, China, or Nigeria. In fact the base version of Llama 2 is more aligned with Nigeria than the USA, whereas Llama 2 chat is more aligned with the USA than Nigeria. We also explore how reward models express opinions about countries and find 0.849 spearman correlation with US opinions about countries. Details of this experiment are included in the paper.

Overall we find that the procedures used to convert base LLMs into Chatbot assistants have several unintended global impacts that make them less useful for global users. Specifically we identify dialects, languages, and opinions as downstream areas to monitor impact. We hope our work encourages model developers to audit their LLMs for appropriateness and effectiveness to the global audiences they intend to serve.

Project repository: https://github.com/SALT-NLP/unintended-impacts-of-alignment

Keep on top of the latest SAIL Blog posts via , , or email: