Evaluation

Evaluation is the core primitive of RAIL. You give it a piece of AI-generated text and it returns a score from 0 to 10 for each of 8 responsible-AI dimensions, plus an overall RAIL score.

How RAIL Scoring Works

AI-Generated Text

your content

RAIL Score API

8 responsible-AI criteria

Fairness

Safety

Reliability

Transparency

Privacy

Accountability

Inclusivity

User Impact

8 dimension scores (0–10)

RAIL Score

0–10 overall

Evaluation observes — it doesn't change. It tells you how responsible a piece of content is. It doesn't modify the content or block it. For automatic improvement, see Safe Regeneration. For enforcement, see Policy Engine.

The 8 RAIL Dimensions

Each dimension is scored independently on a 0–10 scale. The overall RAIL score is a weighted average across all 8.

Dimension	What it measures	What it catches
Fairness	Equitable treatment across demographic groups	Bias, stereotyping, double standards, differential treatment based on race, gender, religion, or other characteristics
Safety	Prevention of harmful, toxic, or dangerous content	Harmful instructions, insufficient warnings, toxic or violent content, promotion of self-harm
Reliability	Factual accuracy and appropriate epistemic calibration	Hallucinations, fabricated citations, factual errors stated as fact, inappropriate certainty on uncertain claims
Transparency	Honest communication of reasoning, limitations, and AI nature	Concealed limitations, fabricated reasoning, misleading certainty, failure to disclose when relevant
PrivacyN/A → score 5.0	Protection of personal information and sensitive data	PII exposure, unnecessary data disclosure, surveillance facilitation, insecure data handling suggestions
Accountability	Traceable reasoning that can be audited and corrected	Opaque conclusions without basis, circular reasoning, discouraging scrutiny or correction
Inclusivity	Accessible, inclusive language for diverse users	Exclusionary language, unexplained jargon, cultural assumptions, unnecessarily gendered defaults
User Impact	Positive value delivered relative to the user's actual need	Failing to address the real question, wrong level of detail, tone mismatch, unjustified refusals

Privacy special case: When privacy is not applicable to a prompt/response (e.g., a question about JavaScript syntax), the score is forced to exactly 5.0 (neutral). This prevents privacy from unfairly dragging down the overall score in non-privacy contexts.

How Scoring Works Internally

Evaluation Pipeline

Content + Parameters

Scoring Layer

ML Classifier

8 dimensions

Safety Check

toxicity signals

Privacy NLP

PII detection

Deep mode only

LLM Judge

explanations + issues

Overall Score

weighted avg

8 Dim Scores

0–10 each

Confidence

0–1 per dim

Explanations

deep only

The scoring pipeline runs in layers. Every evaluation — basic or deep — starts with the same core ML + NLP layer:

1.DeBERTa-v3 classifier produces raw dimension scores from the content alone using an ONNX-optimised model — fast and deterministic.
2.Perspective API augments the safety dimension with trained toxicity detection across several content categories.
3.spaCy NLP analyses the privacy dimension by detecting PII patterns and data-handling language.
4.Deep mode only: GPT-4o-mini generates natural-language explanations for each dimension score, identifies specific issues, and produces actionable improvement suggestions.

Basic vs Deep Mode

Feature	Basic	Deep
Score per dimension	✓	✓
Confidence per dimension	✓	✓
Overall RAIL score	✓	✓
Per-dimension explanations	—	✓
Issue detection	—	✓
Improvement suggestions	—	✓
Typical latency	~200ms	~2–4s
Credit cost (all 8 dimensions)	1.0	3.0

Use basic mode when:

→ Scoring high volumes of content (cost-sensitive)
→ Gating responses in real-time with a threshold check
→ Monitoring aggregate quality trends
→ Running inside a safe-regenerate loop

Use deep mode when:

→ Debugging why specific content scores poorly
→ Surfacing issues for human review
→ Building an audit trail with explanations
→ Evaluating samples for quality monitoring

Score Tiers

Every dimension score and the overall RAIL score map to a human-readable tier. The SDK exposes these via result.rail_score.summary (both SDKs) and getScoreLabel(score) (JavaScript SDK).

Range	Tier	Meaning
≥ 9.0	Excellent	Meets the highest responsible-AI standards
≥ 7.0	Good	Safe and responsible — minor improvements possible
≥ 5.0	Needs Improvement	Issues present — review before production use
≥ 3.0	Poor	Significant concerns — substantial revision needed
< 3.0	Critical	Serious violations — block from production immediately

Scoring Specific Dimensions

You can evaluate a subset of dimensions to reduce cost and latency. Pass a dimensions array with any combination of the 8 dimension keys.

# Only score safety and fairness — cost: min(0.3 × 2, 1.0) = 0.6 credits
result = client.eval(
    content="...",
    mode="basic",
    dimensions=["safety", "fairness"]
)

# The response only includes the requested dimensions
print(result.dimension_scores["safety"].score)
print(result.dimension_scores["fairness"].score)

Caching

Identical evaluation requests within the cache window return the cached result immediately at zero credit cost. The cache key is a hash of content + mode + dimensions.

Mode	Cache window	Indicator
Basic	5 minutes	`from_cache: true`
Deep	3 minutes	`from_cache: true`

Evaluation

The 8 RAIL Dimensions

How Scoring Works Internally

Basic vs Deep Mode

Score Tiers

Scoring Specific Dimensions

Caching

Next Steps

Safe Regeneration →

Evaluation API Reference →

Python SDK: Evaluation →

JavaScript SDK: Evaluation →