Language Model Personalization via Reward Factorization

Large language models (LLMs) have revolutionized natural language processing, enabling applications from AI-assisted writing to conversational agents. However, one major limitation remains: personalization. Most current models assume a universal preference model, failing to adapt to individual user needs. Traditionally, RLHF optimizes LLMs using preference data aggregated across many users. While this ensures broad alignment with human values, it lacks the ability to tailor responses to individual users.

For example, a business professional might prefer concise and formal responses, while a student may value detailed and explanatory answers. Training separate models per user is computationally expensive and requires large amounts of user-specific data—often impractical in real-world settings.

In this paper, we introduce a new approach to efficiently personalize LLMs using a low-dimensional reward factorization framework. Our method allows models to align with user-specific preferences using as few as 10 preference examples, significantly improving user satisfaction compared to standard RLHF.

In our human experiment, we were able to personalized GPT4o responses to the user preference, leading to a 67% win rate against the non-personalized responses.

Our Solution: Reward Factorization for Personalization

Our framework, PReF, assumes that user preferences lie in a shared, low-dimensional space. Instead of learning a separate reward function for each user, we represent user-specific preferences as a linear combination of base reward functions:

This factorization enables us to personalize responses efficiently without requiring extensive user data. The key insight is that personalization reduces to estimating a small set of user-specific weights \(\lambda_i\), rather than learning an entirely new reward model per user.

We first collect a dataset where multiple users annotate response preferences. This dataset captures the diversity of user preferences across various tasks. Using logistic matrix factorization, we uncover shared preference dimensions, allowing us to represent user-specific reward functions as a weighted combination of these base functions. This structured approach reduces the need for large-scale individual user data while maintaining personalization accuracy.

Once the base functions are learned, we estimate a new user's preferences using only 10–20 preference examples. By showing the user carefully selected response pairs and recording their choices, we refine their preference model iteratively. We employ an active learning strategy, selecting response pairs that will minimize uncertainty and so minimize the number of examples required while ensuring accurate personalization.

Instead of retraining or fine-tuning an LLM for each user, we dynamically adjust model outputs during inference. We achieve this using inference-time alignment algorithms, which condition the model's response generation on the user’s inferred preference weights. This enables real-time adaptation to individual user preferences without incurring the high computational costs of retraining.

Using our Reward Factorization framework leads to x30 improvement in data efficiency compared to training a separate reward model per user

ROC AUC and winrates for varying number of user answers on the Attributes (left) and PRISM (right) datasets. Our method quickly achieves high ROC AUC and winrates, outperforming baselines by a large margin.

(A) Effect of L2 regularization and SVD initialization on model performance. We see that both choices are crucial in reducing instabilities in training. (B) Varying the feature dimension d. (C) Comparison of two methods for selecting the next responses to show the user when collecting preferences. Our uncertainty-based method outperforms random selection.

BibTeX

@article{shenfeld2025language,
  title={Language Model Personalization via Reward Factorization},
  author={Shenfeld, Idan and Faltings, Felix and Agrawal, Pulkit and Pacchiano, Aldo},
  journal={arXiv preprint arXiv:2503.06358},
  year={2025}
}

Language Model Personalization via Reward Factorization

Introduction

Our Solution: Reward Factorization for Personalization

The Personalization Pipeline

Experimental Results

The Benefits of Personalization

Ablations

BibTeX