Back to Knowledge Hub
Research

RAIL-HH-10K: The First Large-Scale Multi-Dimensional Safety Dataset

The First Large-Scale Safety Dataset with 99.5% Multi-Dimensional Annotation Coverage

RAIL Research Team
November 3, 2025
12 min read

As organisations accelerate the deployment of generative AI, the ethical performance of these systems is no longer a peripheral concern; it is a key component of product quality and brand trust. Responsible AI Labs' RAIL-HH-10K dataset was released to operationalise this ethical evaluation, offering 10k conversational tasks annotated across eight ethical dimensions—fairness, safety, reliability, transparency, privacy, accountability, inclusivity and user-impact—plus an overall RAIL score.

The dataset card positions RAIL-HH-10K as the first large-scale safety dataset with 99.5% multi-dimensional annotation coverage, a step change from previous datasets that covered only 40–70% of relevant norms. With open access under an MIT licence, it provides an invaluable foundation for reinforcement learning from human feedback (RLHF), direct preference optimisation (DPO) and broader responsible-AI research.

The 8 Dimensions of RAIL Score

text
┌─────────────────────────────────────────────────────────────┐
│                    RAIL-HH-10K Dataset                      │
│                    10,000 Examples                          │
│                 99.5% Coverage Across                       │
└─────────────────────────────────────────────────────────────┘
                             │
                             ▼
        ┌────────────────────────────────────────┐
        │    8 Ethical Dimensions (0-10 each)    │
        └────────────────────────────────────────┘
                             │
        ┌────────────────────┴────────────────────┐
        │                                         │
        ▼                                         ▼
┌──────────────┐                          ┌──────────────┐
│   Fairness   │                          │    Safety    │
│   Reliability│                          │ Transparency │
│     Privacy  │                          │Accountability│
│  Inclusivity │                          │ User Impact  │
└──────────────┘                          └──────────────┘
        │                                         │
        └────────────────────┬────────────────────┘
                             ▼
                  ┌──────────────────┐
                  │  Overall Score   │
                  │   (0-10 scale)   │
                  └──────────────────┘

What's in the dataset?

Each row captures a unique dialogue scenario: a context (previous turns), a user prompt, a rejected answer and a chosen answer, along with scores and explanations for each ethical dimension and the overall RAIL score.

Dataset Structure:

SplitSizeFeaturesNotes
Train8,000 rows73 columnsPrimary corpus used to model ethical preferences
Validation1,000 rows73 columnsHeld-out set for hyper-parameter tuning
Test1,000 rows73 columnsFinal evaluation set

On average, contexts are ~56 words and prompts ~13 words, while rejected answers are ~56 words and chosen answers ~38 words. Shorter responses often correlate with higher ethical scores, suggesting that conciseness reinforces clarity and safety.

Quantifying the ethical uplift

To understand how human feedback improves AI responses, we aggregated scores across all 10k examples. The comparison shows the average rejected score, chosen score and improvement for each ethical dimension.

Key takeaways:

  • The overall RAIL score increases by ~2.24 points (on a 0–10 scale) when moving from the rejected to the chosen answer, with 98% of examples improving.
  • Safety and user-impact see the largest gains (+3.50 and +3.18 points). Accountability and fairness follow closely, reflecting substantial improvements in how responsibly the assistant addresses harmful or illegal requests.
  • Transparency and privacy show more modest improvements (+1.24 and +1.53 points) but still benefit from the curation process. Even high-scoring dimensions like privacy have room for optimisation.
  • A strong negative correlation (≈ –0.78) between a rejected score and its improvement means that lower-scoring answers benefit most from human interventions.
  • By-dimension summary

    The table below quantifies the average rejected score, chosen score, mean improvement and the proportion of examples where the improvement is positive. Higher numbers indicate better ethical quality.

    DimensionMean Rejected ScoreMean Chosen ScoreMean ImprovementShare of Positive Improvements
    Overall RAIL4.426.66+2.2498%
    Fairness4.366.95+2.6074.7%
    Safety3.316.82+3.5085.1%
    Reliability4.406.42+2.0280.8%
    Transparency4.485.72+1.2470.7%
    Privacy6.548.07+1.5368.0%
    Accountability3.465.78+2.3285.5%
    Inclusivity4.346.27+1.9378.2%
    User-impact3.396.57+3.1888.2%

    These numbers showcase the "bang for the buck" delivered by human feedback: even dimensions with relatively high rejected scores (e.g., privacy) still exhibit meaningful gains, while weaker dimensions (safety, user-impact) see dramatic improvements.

    Correlations across ethical dimensions

    Do improvements in one dimension correlate with gains in others? To explore this, we computed the correlation matrix of improvement values across the eight dimensions.

    Several patterns emerge:

  • Fairness, reliability and accountability improvements are highly correlated, suggesting that interventions that reduce bias also enhance trustworthiness and clarify responsibility.
  • Safety and user-impact improvements correlate, reflecting shared underlying risks (violence, illegal behaviour, harm to people). Addressing one dimension often mitigates the other.
  • Privacy shows weaker correlation with other dimensions, indicating that privacy concerns can be addressed independently without strongly affecting other ethical attributes.
  • Beyond numbers: reducing ethical issues

    Numeric scores are complemented by natural-language explanations and lists of issues. On average, rejected answers contain ~1.9 issues per dimension, while chosen answers reduce this count by 0.5–0.86, depending on the category.

    The most frequent problems flagged in rejected answers include:

  • Fairness & user-impact: "Encourages violence," "Promotes violence," "Encourages theft," "Encourages harmful behaviour."
  • Safety: "Encourages dangerous behaviour/actions" and "Encourages harmful behaviour."
  • Reliability: "Misleading information" and "Potential misinformation."
  • Transparency: "Ambiguous intent," "Lack of context" or "No context provided."
  • Privacy: "Potential misuse of information" and "Encourages invasion of privacy."
  • Accountability: "No responsibility taken for consequences" and "Encourages evasion of law."
  • By systematically removing these failure modes, the chosen responses demonstrate that models can be steered toward safer, more responsible behaviour without sacrificing utility.

    Illustrative example: addressing stereotypes with nuance

    Numbers alone do not convey how transformative ethical curation can be. To illustrate, consider a fictionalised scenario inspired by the dataset's structure (the real dataset contains sensitive language, so this example is paraphrased).

    FieldSanitised ContentCommentary
    Context*Human:* "I'm frustrated because a colleague consistently arrives late and it's affecting the team."
    *Assistant:* "That's a tough situation—have you spoken with them about the impact on others?"
    Sets up a delicate workplace issue without attributing negative traits to any group.
    Prompt"Should I call them out in front of the team to make a point about punctuality?"The user is seeking guidance on confronting the issue.
    Rejected answer"Yes, embarrassing them publicly will teach them a lesson and show everyone else that tardiness won't be tolerated."This response endorses shaming, lacks empathy and fails fairness and safety checks.
    Chosen answer"It's understandable to want accountability, but public shaming often backfires and erodes trust. A private conversation to understand their challenges and set clear expectations is more constructive and supports a respectful team culture."The preferred response de-escalates tension, promotes empathy and accountability, and aligns with fairness and user-impact goals.

    In this example, the rejected answer models the kind of unhelpful advice that can surface in uncensored responses, while the chosen answer exemplifies the improvements captured by RAIL-HH-10K annotations. The stronger response emphasises psychological safety and constructive feedback—traits that contribute to higher scores across the fairness, safety, accountability and user-impact dimensions.

    Why RAIL-HH-10K matters

  • Near-complete ethical coverage: With annotations covering almost every dimension for every example, researchers can model multi-objective trade-offs rather than focusing on single metrics.
  • Rich contextual information: Scores, confidences, explanations and issue lists enable both quantitative and qualitative analyses, facilitating the development of interpretable reward models and evaluators.
  • Open and adaptable: The MIT licence and open distribution make it easy to integrate into RLHF pipelines, comparative benchmarks or fairness audits.
  • Implications for practitioners

  • Integrate multi-dimensional rewards. Models trained on single-objective rewards may miss safety, fairness or accountability nuances. Incorporating all eight dimensions yields more holistic behaviours.
  • Prioritise low-performing areas. Safety, user-impact and accountability show the greatest room for improvement. Focusing data collection and reward shaping on these areas can accelerate progress.
  • Use brevity as a heuristic. Encouraging concise, direct answers may enhance safety and transparency while reducing the risk of hallucinations or harmful tangents.
  • Addressing cultural context

    Note: RAIL-HH-10K currently reflects Western/English perspectives.

    Our solution:

  • Configurable weights — organisations can adjust dimension weights to reflect their own ethical context.
  • Universal core — threats and severe harms (e.g., violence, exploitation) carry high global consensus.
  • Local fine-tuning — add regional data and culturally specific scenarios for better contextual accuracy.
  • We don't claim "universal" standards; rather, RAIL-HH-10K offers a starting framework that teams can adapt. Next: expanding to multi-lingual datasets and diversifying annotation teams to ensure broader cultural alignment.

    Conclusion

    RAIL-HH-10K exemplifies how structured human feedback can measurably enhance the ethical quality of AI systems. By leveraging multi-dimensional annotations, organisations can go beyond simple toxicity filters and build reward models that optimise for fairness, safety, reliability and more, all at once.

    The dataset's strong improvements across most dimensions, coupled with a reduction in harmful issues, illustrate that responsible AI is not an abstract ideal but an achievable engineering objective. As you evaluate or fine-tune conversational models, RAIL-HH-10K provides a robust benchmark and a practical toolkit for aligning AI behaviour with your organisation's ethical commitments.


    Access the dataset: RAIL-HH-10K on Hugging Face

    Learn more: Dataset Documentation