Language Model Personalization via Reward Factorization

1MIT    2Boston University    *Equal Contribution

Introduction

Large language models (LLMs) have revolutionized natural language processing, enabling applications from AI-assisted writing to conversational agents. However, one major limitation remains: personalization. Most current models assume a universal preference model, failing to adapt to individual user needs. Traditionally, RLHF optimizes LLMs using preference data aggregated across many users. While this ensures broad alignment with human values, it lacks the ability to tailor responses to individual users.

For example, a business professional might prefer concise and formal responses, while a student may value detailed and explanatory answers. Training separate models per user is computationally expensive and requires large amounts of user-specific data—often impractical in real-world settings.

In this paper, we introduce a new approach to efficiently personalize LLMs using a low-dimensional reward factorization framework. Our method allows models to align with user-specific preferences using as few as 10 preference examples, significantly improving user satisfaction compared to standard RLHF.

In our human experiment, we were able to personalized GPT4o responses to the user preference, leading to a 67% win rate against the non-personalized responses.

Our Solution: Reward Factorization for Personalization

Our framework, PReF, assumes that user preferences lie in a shared, low-dimensional space. Instead of learning a separate reward function for each user, we represent user-specific preferences as a linear combination of base reward functions:

Reward Factorization

This factorization enables us to personalize responses efficiently without requiring extensive user data. The key insight is that personalization reduces to estimating a small set of user-specific weights λi, rather than learning an entirely new reward model per user.

The Personalization Pipeline

Step 1

We first collect a dataset where multiple users annotate response preferences. This dataset captures the diversity of user preferences across various tasks. Using logistic matrix factorization, we uncover shared preference dimensions, allowing us to represent user-specific reward functions as a weighted combination of these base functions. This structured approach reduces the need for large-scale individual user data while maintaining personalization accuracy.

Step 2

Once the base functions are learned, we estimate a new user's preferences using only 10–20 preference examples. By showing the user carefully selected response pairs and recording their choices, we refine their preference model iteratively. We employ an active learning strategy, selecting response pairs that will minimize uncertainty and so minimize the number of examples required while ensuring accurate personalization.

Step 3

Instead of retraining or fine-tuning an LLM for each user, we dynamically adjust model outputs during inference. We achieve this using inference-time alignment algorithms, which condition the model's response generation on the user’s inferred preference weights. This enables real-time adaptation to individual user preferences without incurring the high computational costs of retraining.

Experimental Results

BibTeX

@misc{zhao2025llmsrecognizepreferencesevaluating,
      title={Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs}, 
      author={Siyan Zhao and Mingyi Hong and Yang Liu and Devamanyu Hazarika and Kaixiang Lin},
      year={2025},
      eprint={2502.09597},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.09597}, 
}