Evaluation
Evaluation is the core primitive of RAIL. You give it a piece of AI-generated text and it returns a score from 0 to 10 for each of 8 responsible-AI dimensions, plus an overall RAIL score.
How RAIL Scoring Works
your content
8 responsible-AI criteria
8 dimension scores (0–10)
0–10 overall
Evaluation observes — it doesn't change. It tells you how responsible a piece of content is. It doesn't modify the content or block it. For automatic improvement, see Safe Regeneration. For enforcement, see Policy Engine.
The 8 RAIL Dimensions
Each dimension is scored independently on a 0–10 scale. The overall RAIL score is a weighted average across all 8.
| Dimension | What it measures | What it catches |
|---|---|---|
| Fairness | Equitable treatment across demographic groups | Bias, stereotyping, double standards, differential treatment based on race, gender, religion, or other characteristics |
| Safety | Prevention of harmful, toxic, or dangerous content | Harmful instructions, insufficient warnings, toxic or violent content, promotion of self-harm |
| Reliability | Factual accuracy and appropriate epistemic calibration | Hallucinations, fabricated citations, factual errors stated as fact, inappropriate certainty on uncertain claims |
| Transparency | Honest communication of reasoning, limitations, and AI nature | Concealed limitations, fabricated reasoning, misleading certainty, failure to disclose when relevant |
| PrivacyN/A → score 5.0 | Protection of personal information and sensitive data | PII exposure, unnecessary data disclosure, surveillance facilitation, insecure data handling suggestions |
| Accountability | Traceable reasoning that can be audited and corrected | Opaque conclusions without basis, circular reasoning, discouraging scrutiny or correction |
| Inclusivity | Accessible, inclusive language for diverse users | Exclusionary language, unexplained jargon, cultural assumptions, unnecessarily gendered defaults |
| User Impact | Positive value delivered relative to the user's actual need | Failing to address the real question, wrong level of detail, tone mismatch, unjustified refusals |
5.0 (neutral). This prevents privacy from unfairly dragging down the overall score in non-privacy contexts.How Scoring Works Internally
Evaluation Pipeline
Scoring Layer
8 dimensions
toxicity signals
PII detection
explanations + issues
weighted avg
0–10 each
0–1 per dim
deep only
The scoring pipeline runs in layers. Every evaluation — basic or deep — starts with the same core ML + NLP layer:
- 1.DeBERTa-v3 classifier produces raw dimension scores from the content alone using an ONNX-optimised model — fast and deterministic.
- 2.Perspective API augments the safety dimension with trained toxicity detection across several content categories.
- 3.spaCy NLP analyses the privacy dimension by detecting PII patterns and data-handling language.
- 4.Deep mode only: GPT-4o-mini generates natural-language explanations for each dimension score, identifies specific issues, and produces actionable improvement suggestions.
Basic vs Deep Mode
| Feature | Basic | Deep |
|---|---|---|
| Score per dimension | ✓ | ✓ |
| Confidence per dimension | ✓ | ✓ |
| Overall RAIL score | ✓ | ✓ |
| Per-dimension explanations | — | ✓ |
| Issue detection | — | ✓ |
| Improvement suggestions | — | ✓ |
| Typical latency | ~200ms | ~2–4s |
| Credit cost (all 8 dimensions) | 1.0 | 3.0 |
Use basic mode when:
- → Scoring high volumes of content (cost-sensitive)
- → Gating responses in real-time with a threshold check
- → Monitoring aggregate quality trends
- → Running inside a safe-regenerate loop
Use deep mode when:
- → Debugging why specific content scores poorly
- → Surfacing issues for human review
- → Building an audit trail with explanations
- → Evaluating samples for quality monitoring
Score Tiers
Every dimension score and the overall RAIL score map to a human-readable tier. The SDK exposes these via result.rail_score.summary (both SDKs) and getScoreLabel(score) (JavaScript SDK).
| Range | Tier | Meaning |
|---|---|---|
| ≥ 9.0 | Excellent | Meets the highest responsible-AI standards |
| ≥ 7.0 | Good | Safe and responsible — minor improvements possible |
| ≥ 5.0 | Needs Improvement | Issues present — review before production use |
| ≥ 3.0 | Poor | Significant concerns — substantial revision needed |
| < 3.0 | Critical | Serious violations — block from production immediately |
Scoring Specific Dimensions
You can evaluate a subset of dimensions to reduce cost and latency. Pass a dimensions array with any combination of the 8 dimension keys.
# Only score safety and fairness — cost: min(0.3 × 2, 1.0) = 0.6 credits
result = client.eval(
content="...",
mode="basic",
dimensions=["safety", "fairness"]
)
# The response only includes the requested dimensions
print(result.dimension_scores["safety"].score)
print(result.dimension_scores["fairness"].score)Caching
Identical evaluation requests within the cache window return the cached result immediately at zero credit cost. The cache key is a hash of content + mode + dimensions.
| Mode | Cache window | Indicator |
|---|---|---|
| Basic | 5 minutes | from_cache: true |
| Deep | 3 minutes | from_cache: true |
Next Steps
Safe Regeneration →
Use evaluation scores to automatically improve content
Evaluation API Reference →
Full parameter reference and request/response schemas
Python SDK: Evaluation →
SDK methods, custom weights, and middleware patterns
JavaScript SDK: Evaluation →
TypeScript-first client API with full type support