When in-distribution gains fail: reward models under preference shift

A reward model can post a healthy gain on its own test set and still add almost nothing on a held-out safety dataset. A new study measures that gap, and uses the RAIL benchmark to expose it.

How a weak-to-strong reward model is trained and tested, with Representation Anchoring shown in teal

Key takeaways

Weak-to-strong training lets a small "teacher" model supervise a larger "student" model. It is a leading proposal for overseeing systems too capable for direct human checking.

A new paper from the National University of Singapore and collaborators shows that strong students can look successful on their training distribution while transferring poorly to other preference datasets.

The cause is representational: fine-tuning on weak labels can pull the model toward quirks of the source dataset rather than broadly useful preference features.

Their fix, Representation Anchoring, keeps the student close to the pretrained model's representations and improves out-of-distribution transfer without sacrificing in-distribution accuracy.

The RAIL dataset is used as one of three independent harmlessness benchmarks, both as training data and as a held-out target. In one safety setting, only the anchored model transferred to RAIL with a positive gain.

The practical lesson: a single in-distribution score overstates how aligned a reward model really is. Evaluate across datasets and domains.

Introduction

As AI systems take on tasks that humans cannot easily check, a hard question follows: how do you supervise a model that may know more than its supervisor? One leading answer is weak-to-strong generalization, where a weaker model produces the training signal for a stronger one. If the strong student can extract the useful structure from imperfect supervision, it can exceed the teacher rather than inherit its ceiling [1].

A recent paper, "When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift," tests whether that promise holds when the evaluation data does not match the training data [2]. The authors, from the National University of Singapore, VinUniversity, and Nanyang Technological University, focus on reward models, the components that score model responses inside preference-based alignment pipelines. Their result is a useful caution. A reward model can record a strong gain on its own held-out test set and still fail to carry that improvement to a different dataset in the same category.

This article walks through what the study measured, why the failure happens, and what the proposed fix does. It also explains where the RAIL dataset fits, since the authors use it as an independent harmlessness benchmark. The headline for anyone shipping aligned models is simple: in-distribution accuracy is necessary, but on its own it is weak evidence of safety.

Background: weak-to-strong supervision and reward models

Weak-to-strong generalization is a concrete version of the scalable oversight problem. A weak supervisor is trained on gold labels, then used to label data for a stronger student. The student never sees the gold labels directly. The hope is that the student's broader pretrained knowledge lets it generalize past the teacher's mistakes [1].

This idea matters because the alternative does not scale. Human labelers cannot reliably judge every output of a system that operates at machine speed and across a huge range of topics, so the field looks for ways to turn limited, trustworthy supervision into broad, reliable behavior. Reward modeling is a natural place to study the question, since the reward model is the compact artifact that encodes what counts as a good response.

A reward model is the object being trained here. It assigns a scalar score to a prompt and response pair, and preferences are derived from score differences using the standard Bradley-Terry formulation [4]. In reinforcement learning from human feedback, these scores stand in for human judgment, so the quality and robustness of the reward model shapes everything downstream.

Two terms matter for the rest of this piece. In-distribution evaluation tests the model on held-out data from the same dataset it trained on. Out-of-distribution evaluation, often shortened to OOD, tests the model on a different dataset that shares the same broad goal, such as harmlessness, but differs in prompts, response styles, and annotation conventions. The study's central move is to insist on the second kind of test, which most prior weak-to-strong work skipped.

Note: why preference datasets differ even within one goal. Two harmlessness datasets can disagree in subtle ways. They use different prompts, different writing styles, and different labeling instructions. A reward model that latches onto the surface patterns of one dataset can look accurate there while missing the underlying notion of "harmless" that should carry across all of them.

How the study tested transfer

The authors define a zero-shot preference-domain shift protocol. Within a broad category such as helpfulness or harmlessness, they train a weak-to-strong reward model on one dataset, then evaluate it on the held-out split of that same dataset and on every other dataset in the category. No target-dataset examples are used during training, so target performance reflects genuine transfer rather than adaptation.

The datasets and models

For helpfulness, the study uses Anthropic Helpful from HH-RLHF, HelpSteer3-Preference, and UltraFeedback [5][7][8]. For harmlessness, it uses Anthropic Harmless from HH-RLHF, PKU-SafeRLHF, and RAIL, the values-grounded dataset from our own RAIL in the Wild research [5][6][3]. Each dataset takes a turn as the training source and as a held-out target. Experiments run on two model families: Llama-3.2-1B to Llama-3.1-8B, and Qwen3-1.7B to Qwen3-8B, with all reward models trained using LoRA and averaged over three seeds [2].

Metrics that separate looking good from transferring well

Raw accuracy hides the problem the authors care about, so they report three transfer-aware metrics. Weak-to-Strong Raw Gain, or WRG, measures how much the strong student beats the weak teacher on the source dataset. Absolute OOD Gain, or AOG, measures the same improvement on an unseen target dataset. Net Transfer Score, or NTS, subtracts any in-distribution regression from the OOD gain, so a model cannot earn credit for transfer that it bought by collapsing on its source domain [2].

Read together, the three metrics tell a fuller story. A method with high WRG but low AOG learned the source dataset, not the goal. A method with high AOG but a large in-distribution drop, reflected in a low NTS, traded away reliability where it was supposed to be strongest.

Key findings

Finding 1: in-distribution success can mask out-of-distribution failure

The central result is that strong in-distribution gains do not reliably predict transfer. Standard weak-to-strong training and a confidence-based variant both reach solid in-distribution numbers, then transfer unevenly to unseen datasets. In one harmlessness setting with the Qwen family, the standard method trained on Anthropic Harmless posted the best in-distribution raw gain in its group, yet transferred to the RAIL benchmark with an absolute OOD gain of only 0.52 [2]. The model had learned to score its own dataset well without carrying the safety signal across.

The pattern is not limited to safety data. A reward model trained on the HelpSteer3 helpfulness dataset performs well in-distribution, then loses substantial accuracy when judged on Anthropic Helpful, a different helpfulness set with its own style and labels [2]. Same goal, different distribution, weaker transfer.

Finding 2: the failure is representational, not just noisy labels

The authors argue the problem is not simply that weak labels are imperfect. Instead, fine-tuning on a single source dataset can pull the strong model's internal representations toward features specific to that dataset, away from the broadly useful preference representations the pretrained model already held. They call this representation drift, and they support it with the observation that preserving intermediate representations improves transfer. If the failure were only noisy labels, that intervention would not help as much as it does [2].

Finding 3: Representation Anchoring improves transfer without wrecking accuracy

Their proposed method, Representation Anchoring, adds a frozen copy of the pretrained strong model as a training-time reference. During fine-tuning, an anchoring term penalizes the student's response-token hidden states for drifting too far from the reference, while the usual preference loss still teaches the task. The reference is discarded at inference, so the deployed reward model keeps its standard form and adds no serving cost [2].

Concretely, the anchoring term measures the squared distance between the student's and the reference model's hidden states on the response tokens, averaged over those tokens, then adds it to the preference loss with a weight that controls its strength. The authors test two placements: anchoring the final layer, which sits closest to the scoring head, and anchoring middle layers, which leaves the upper layers freer to adapt. The final-layer version is the default, since it stores hidden states from only one layer and costs less [2].

The intuition is a balance. Let the model adapt to the weak preference signal, but do not let it distort the general features that make transfer possible. Across domains and both model families, the anchored model delivers the most consistent gains under both in-distribution and out-of-distribution evaluation.

Absolute out-of-distribution gain to the RAIL benchmark when training on Anthropic Harmless: only the anchored model transfers above the weak teacher

That single setting is the cleanest illustration, but it is not the whole picture, and the honest version includes a real trade-off. The table below shows the harmlessness results for the Llama student across two training sources, comparing the in-distribution raw gain with the gain transferred to RAIL.

Method	WRG, trained on Anthropic Harmless	AOG to RAIL	WRG, trained on PKU-SafeRLHF	AOG to RAIL
Naive W2S	3.09	0.00	4.91	2.26
Confidence-based	3.26	-0.31	5.08	2.47
SEAM	-6.66	-0.51	7.94	6.27
Anchor	3.39	+0.31	5.49	4.62

Two patterns stand out. The anchored model is the only one that stays positive on both axes in both settings, holding its in-distribution gain while transferring to RAIL. SEAM, a baseline that uses the pretrained model to generate annotation rationales, can transfer strongly from PKU-SafeRLHF, reaching the highest gain to RAIL in that column, but it collapses in-distribution when trained on Anthropic Harmless, with a raw gain of -6.66 [2]. That is exactly the trade the Net Transfer Score is designed to catch.

The trade-off in one view. "Looks aligned" means high accuracy on the dataset the model trained on: easy to report, easy to over-trust. "Aligned enough to transfer" means it holds up on independent datasets it never saw, measured by gains that survive an in-distribution regression check.

Finding 4: a lighter anchor transfers better than a heavy one

Two ablation studies clarify how the method behaves. Varying the anchoring weight across three settings shows that a lighter touch works best: as the coefficient decreases, in-distribution and out-of-distribution scores both improve consistently. Anchoring too hard pins the model to its pretrained features and blocks the adaptation it needs to learn the preference task [2].

The second ablation compares where to anchor. With a light coefficient, anchoring the final layer gives the strongest in-distribution gain and the best transfer to Anthropic Helpful, while a middle-layer variant edges it out on one target dataset [2]. The practical reading is that the last-layer default captures most of the benefit at the lowest cost, and that the choice of layer is a tuning knob rather than a make-or-break decision.

Where RAIL fits, and what this validates

RAIL appears in this study as an independent harmlessness benchmark, chosen alongside Anthropic Harmless and PKU-SafeRLHF [2]. That role matters. RAIL in the Wild built a measurable, eight-dimension framework for the normative behavior of language models and applied it to Anthropic's Values in the Wild dataset of more than 308,000 anonymized Claude conversations [3]. The result was a values-grounded preference signal rather than a narrow stylistic one.

When an external research team selects a dataset as one of the few benchmarks that define a safety category, it is treating that dataset as a credible, distinct measure of the underlying goal. The fact that several methods struggle to transfer to RAIL, while a method built specifically to preserve general preference features succeeds, is evidence that RAIL is testing something real and not redundant with the other harmlessness sets. For a framework whose purpose is to operationalize responsible AI evaluation, independent third-party use is among the strongest signals of utility.

The broader implication reaches every team that trains or audits reward models. If you evaluate only on your training distribution, you are measuring memorization as much as alignment. A more honest protocol trains on one source and reports transfer to independent datasets, ideally including a values-grounded set such as RAIL. This connects directly to the case for multidimensional measurement that runs through the eight dimensions of the RAIL framework, since a single aggregate score on a single dataset can hide failures that only appear under distribution shift.

In practice, this suggests a short checklist when validating a reward model. Train on your source data, then report transfer to at least one independent dataset in the same category. Prefer transfer-aware metrics that penalize source-domain regression, so a model cannot earn credit for gains it bought by getting worse where it should be strongest. And treat a values-grounded set as a distinct axis rather than a duplicate of style-based preference data. None of this asks you to abandon in-distribution accuracy. It asks only that a single number stop standing in for the whole claim.

Limitations and open questions

The authors are careful about scope, and that honesty is worth carrying into any summary. Their evaluation uses offline pairwise reward-model accuracy across a fixed set of helpfulness and harmlessness datasets. It does not measure best-of-n selection, online policy optimization, or the quality of open-ended generations that a reward model would ultimately shape [2]. Offline accuracy is a clean proxy, but it is still a proxy.

There is also a cost. Representation Anchoring adds a second forward pass through the frozen reference model during training, which raises memory and time requirements, even though it adds nothing at inference [2]. Finally, the experiments cover specific model scales and two preference categories. Whether the pattern and the fix hold across larger models and other normative axes, such as fairness or privacy, remains open. These are reasonable directions for follow-up work, including evaluations that bring more of RAIL's dimensions into the transfer test.

Conclusion

The clearest message of this work is methodological. In-distribution gains from weak-to-strong training can overstate how reliable a reward model will be once the data shifts, and the cause is partly representational rather than just noisy labels. Representation Anchoring offers a practical way to keep more of that reliability by preserving the pretrained model's general features during fine-tuning.

For practitioners, the takeaway is to test transfer, not just fit, and to include independent, values-grounded benchmarks when doing so. That RAIL serves as one of those benchmarks here is a small but real marker of the framework's role in the wider evaluation landscape.

About the author. Sumit Verma is a researcher at Responsible AI Labs, where he works on practical evaluation methods for the safety and values of language models. He is the lead author of RAIL in the Wild, which operationalized the eight-dimension RAIL framework on Anthropic's Values in the Wild dataset.

References

Burns, C., et al. (2023). Weak-to-strong generalization: eliciting strong capabilities with weak supervision. arXiv:2312.09390. arxiv.org/abs/2312.09390

Le, K., Cao, T., Nguyen, P., Nguyen, C.-D., Luu, A.-T., Chunyan, M., Ng, S.-K., and Nguyen, T. (2026). When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift. arXiv:2605.25629. arxiv.org/abs/2605.25629

Verma, S., Prasun, P., Jaiswal, A., and Kumar, P. (2025). RAIL in the Wild: Operationalizing Responsible AI Evaluation Using Anthropic's Value Dataset. arXiv:2505.00204. arxiv.org/abs/2505.00204

Bradley, R. A., and Terry, M. E. (1952). Rank analysis of incomplete block designs: the method of paired comparisons. Biometrika, 39(3/4), 324 to 345.

Bai, Y., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (HH-RLHF). arXiv:2204.05862. arxiv.org/abs/2204.05862

Ji, J., et al. (2025). PKU-SafeRLHF: a safety alignment preference dataset.

Wang, Z., et al. (2025). HelpSteer3-Preference: open human-annotated preference data across diverse tasks and languages.

Cui, G., et al. (2024). UltraFeedback: boosting language models with scaled AI feedback. arXiv:2310.01377. arxiv.org/abs/2310.01377