Documentation
← All Use Cases

Building a Responsible AI Chatbot

Part 1 of 2 — Setup, basic evaluation, deep analysis, and understanding scores.

PythonOpenAIrail-score-sdk v2.3.0
10 min read

1. The Setup

We are building a customer support chatbot for a fictional SaaS product called "CloudDash", a cloud monitoring dashboard. The chatbot answers questions about pricing, features, and troubleshooting. Along the way, we will add RAIL Score evaluation at every layer to ensure the chatbot's responses are safe, accurate, fair, and helpful.

Architecture Overview

User Message
Your Chatbot (Python)
OpenAI / Gemini
RAIL Score API
8 Dimension Scores
+
Policy Check
Safe Response to User

Install dependencies

Terminal
pip install "rail-score-sdk[openai,google,langfuse]" openai google-genai

Environment variables

Create a .env file:

.env
RAIL_API_KEY=YOUR_RAIL_API_KEY
OPENAI_API_KEY=sk-your_openai_key
GEMINI_API_KEY=your_gemini_key

# Optional: for Part 2 (Langfuse observability)
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com
Get your RAIL API key: Sign up at responsibleailabs.ai/dashboard. The free tier includes 100 credits to follow this entire tutorial.

2. Build the Basic Chatbot

Start with a basic chatbot using OpenAI directly — no RAIL integration yet. This is the foundation we will layer scoring onto.

chatbot.py
import openai
import os

openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

SYSTEM_PROMPT = """You are CloudDash Support, a helpful assistant for
CloudDash — a cloud monitoring dashboard. Answer questions about
pricing, features, setup, and troubleshooting. Be concise and accurate.
If you don't know something, say so."""


def chat(user_message: str, history: list[dict] = None) -> str:
    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
    if history:
        messages.extend(history)
    messages.append({"role": "user", "content": user_message})

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        temperature=0.3,
    )
    return response.choices[0].message.content


reply = chat("What pricing plans do you offer?")
print(reply)

This works, but we have zero visibility into response quality. Is this response safe? Is it factually accurate? Does it contain any bias? We have no way to know, until we add RAIL Score.

3. Add RAIL Score Evaluation

The simplest way to add RAIL evaluation is with RailScoreClient. One call gives us scores across all 8 RAIL dimensions.

chatbot_with_eval.py
from rail_score_sdk import RailScoreClient
import os

rail = RailScoreClient(api_key=os.getenv("RAIL_API_KEY"))

reply = chat("What pricing plans do you offer?")

result = rail.eval(content=reply, mode="basic")

print(f"Overall Score: {result.rail_score.score}")
print(f"Confidence:    {result.rail_score.confidence}")
print()
for dim_name, dim_score in result.dimension_scores.items():
    print(f"  {dim_name:15s} {dim_score.score}")

Basic vs Deep Evaluation

Basic Mode (1 credit)

Content
RAIL API
Overall + 8 Scores

~200ms

Deep Mode (3 credits)

Content
RAIL API (LLM Judge)
Scores + Explanations + Issues

~2-4s

Interpreting the results

DimensionScoreWhat it means
Safety9.2No harmful content, appropriate for all users
User Impact9.0Directly answers the question at the right detail level
Inclusivity8.7Accessible language, no exclusionary terms
Fairness8.5Equitable treatment, no demographic bias
Accountability8.1Clear reasoning, traceable claims
Transparency8.0Honest representation of knowledge
Reliability7.8Mostly accurate, but pricing details are synthetic
Privacy5.0Not applicable — no PII involved
Privacy = 5.0 means "not applicable." RAIL returns 5.0 (neutral) when privacy is irrelevant to the content being evaluated.

4. Deep Evaluation

Basic mode gives you scores. Deep mode gives you the why: per-dimension explanations, detected issues, and improvement suggestions.

deep_eval.py
result = rail.eval(content=reply, mode="deep")

print(f"Overall: {result.rail_score.score}")
print()

for dim_name, detail in result.dimension_scores.items():
    print(f"--- {dim_name} (score: {detail.score}) ---")
    print(f"  Explanation: {detail.explanation}")
    if detail.issues:
        print(f"  Issues: {', '.join(detail.issues)}")
    if detail.suggestions:
        print(f"  Suggestion: {detail.suggestions[0]}")
    print()

Basic vs Deep

BasicDeep
Cost1 credit3 credits
ScoresOverall + 8 dimensionsOverall + 8 dimensions
ExplanationsNoYes, per dimension
Issue detectionNoYes
Best forHigh-volume, real-time checksDebugging, auditing, post-hoc analysis
Cost-saving tip: Use basic mode for every response in production, and deep mode selectively — e.g., when a basic score drops below your threshold, or as a periodic audit on a sample of responses.

You have the basics. Part 2 covers production features: drop-in LLM provider wrappers, policy enforcement (block/regenerate), multi-turn session tracking, and Langfuse observability.

Continue to Part 2