Problem
Standard LLM benchmarks measure accuracy, fluency, and task completion but do not capture responsible AI dimensions like fairness, safety, privacy, or accountability. Teams deploying models in production need structured evaluation across these dimensions.
What RAIL Score provides
RAIL Score is a responsible AI evaluation API that scores model outputs across 8 dimensions, each on a 0-10 scale:
| Dimension |
What it measures |
| Fairness |
Equitable treatment, absence of bias |
| Safety |
Prevention of harmful content |
| Reliability |
Factual accuracy, consistency |
| Transparency |
Clear reasoning, disclosed limitations |
| Privacy |
PII protection, data minimization |
| Accountability |
Traceable decisions, auditable reasoning |
| Inclusivity |
Accessible, culturally aware language |
| User Impact |
Value delivered to the end user |
Python SDK on PyPI: pip install rail-score-sdk
Integration with LightEval
Uses MetricGrouping with a SampleLevelComputation subclass. All 8 dimensions + overall score appear as separate named metrics in LightEval results.
pip install rail-score-sdk
export RAIL_API_KEY="rail_..."
lighteval accelerate \
"model_name=HuggingFaceH4/zephyr-7b-beta" \
"rail_score:default|0" \
--custom-tasks custom_rail_score_task.py
A complete custom task file is available in the linked PR.
Resources
Problem
Standard LLM benchmarks measure accuracy, fluency, and task completion but do not capture responsible AI dimensions like fairness, safety, privacy, or accountability. Teams deploying models in production need structured evaluation across these dimensions.
What RAIL Score provides
RAIL Score is a responsible AI evaluation API that scores model outputs across 8 dimensions, each on a 0-10 scale:
Python SDK on PyPI:
pip install rail-score-sdkIntegration with LightEval
Uses
MetricGroupingwith aSampleLevelComputationsubclass. All 8 dimensions + overall score appear as separate named metrics in LightEval results.A complete custom task file is available in the linked PR.
Resources