Building a Responsible AI Chatbot
Part 1 of 2 — Setup, basic evaluation, deep analysis, and understanding scores.
1. The Setup
We are building a customer support chatbot for a fictional SaaS product called "CloudDash", a cloud monitoring dashboard. The chatbot answers questions about pricing, features, and troubleshooting. Along the way, we will add RAIL Score evaluation at every layer to ensure the chatbot's responses are safe, accurate, fair, and helpful.
Architecture Overview
Install dependencies
pip install "rail-score-sdk[openai,google,langfuse]" openai google-genaiEnvironment variables
Create a .env file:
RAIL_API_KEY=YOUR_RAIL_API_KEY
OPENAI_API_KEY=sk-your_openai_key
GEMINI_API_KEY=your_gemini_key
# Optional: for Part 2 (Langfuse observability)
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com2. Build the Basic Chatbot
Start with a basic chatbot using OpenAI directly — no RAIL integration yet. This is the foundation we will layer scoring onto.
import openai
import os
openai_client = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
SYSTEM_PROMPT = """You are CloudDash Support, a helpful assistant for
CloudDash — a cloud monitoring dashboard. Answer questions about
pricing, features, setup, and troubleshooting. Be concise and accurate.
If you don't know something, say so."""
def chat(user_message: str, history: list[dict] = None) -> str:
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
if history:
messages.extend(history)
messages.append({"role": "user", "content": user_message})
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.3,
)
return response.choices[0].message.content
reply = chat("What pricing plans do you offer?")
print(reply)This works, but we have zero visibility into response quality. Is this response safe? Is it factually accurate? Does it contain any bias? We have no way to know, until we add RAIL Score.
3. Add RAIL Score Evaluation
The simplest way to add RAIL evaluation is with RailScoreClient. One call gives us scores across all 8 RAIL dimensions.
from rail_score_sdk import RailScoreClient
import os
rail = RailScoreClient(api_key=os.getenv("RAIL_API_KEY"))
reply = chat("What pricing plans do you offer?")
result = rail.eval(content=reply, mode="basic")
print(f"Overall Score: {result.rail_score.score}")
print(f"Confidence: {result.rail_score.confidence}")
print()
for dim_name, dim_score in result.dimension_scores.items():
print(f" {dim_name:15s} {dim_score.score}")Basic vs Deep Evaluation
Basic Mode (1 credit)
~200ms
Deep Mode (3 credits)
~2-4s
Interpreting the results
| Dimension | Score | What it means |
|---|---|---|
| Safety | 9.2 | No harmful content, appropriate for all users |
| User Impact | 9.0 | Directly answers the question at the right detail level |
| Inclusivity | 8.7 | Accessible language, no exclusionary terms |
| Fairness | 8.5 | Equitable treatment, no demographic bias |
| Accountability | 8.1 | Clear reasoning, traceable claims |
| Transparency | 8.0 | Honest representation of knowledge |
| Reliability | 7.8 | Mostly accurate, but pricing details are synthetic |
| Privacy | 5.0 | Not applicable — no PII involved |
4. Deep Evaluation
Basic mode gives you scores. Deep mode gives you the why: per-dimension explanations, detected issues, and improvement suggestions.
result = rail.eval(content=reply, mode="deep")
print(f"Overall: {result.rail_score.score}")
print()
for dim_name, detail in result.dimension_scores.items():
print(f"--- {dim_name} (score: {detail.score}) ---")
print(f" Explanation: {detail.explanation}")
if detail.issues:
print(f" Issues: {', '.join(detail.issues)}")
if detail.suggestions:
print(f" Suggestion: {detail.suggestions[0]}")
print()Basic vs Deep
| Basic | Deep | |
|---|---|---|
| Cost | 1 credit | 3 credits |
| Scores | Overall + 8 dimensions | Overall + 8 dimensions |
| Explanations | No | Yes, per dimension |
| Issue detection | No | Yes |
| Best for | High-volume, real-time checks | Debugging, auditing, post-hoc analysis |
You have the basics. Part 2 covers production features: drop-in LLM provider wrappers, policy enforcement (block/regenerate), multi-turn session tracking, and Langfuse observability.
Continue to Part 2