Of the 55 frontier language models evaluated by the Phare V2 benchmark in February 2026, not one scored above 90% in average safety. Bias resistance -- the dimension most directly tied to real-world harm -- remains below 65% for the majority of models tested. Meanwhile, 78% of organizations now use AI in at least one business function, but only 33% have responsible AI controls in place.
Key Takeaways
>
- Anthropic's Claude 4.5 models sweep the top three positions on the Phare V2 safety leaderboard, but even the best model (Claude 4.5 Haiku, 83.2%) leaves significant room for improvement.
- Bias resistance is the weakest safety dimension across nearly every model tested. Most score below 65%, and DeepSeek R1 0528 scores just 25.5%.
- Single-attempt safety metrics are misleading. Gray Swan's multi-attempt testing shows Claude Opus 4.5 jumping from 4.7% to 63% attack success rate across 100 attempts in coding mode.
- DeepSeek R1 failed to block a single harmful prompt in Cisco's HarmBench testing (100% attack success rate).
- Enterprise AI adoption (78%) far outpaces responsible AI governance (33% with controls), creating systemic risk.
- LLM safety improvements are stagnating -- improved reasoning does not correlate with better safety.
Introduction
The AI safety evaluation landscape in 2026 is fragmented, inconsistent, and structurally incomplete. Organizations deploying frontier language models face a paradox: more safety benchmarks exist than ever before, yet none can tell you whether a model is truly safe for your use case.
Stanford's HELM Safety project found that of 102 safety benchmarks published since 2018, only 12 were actually used to evaluate state-of-the-art models as of March 2024. MLCommons, the consortium behind the most widely cited enterprise safety benchmark (AILuminate), explicitly warns that "performing well on the benchmark does not mean your model is safe -- simply that we have not identified critical safety weaknesses." The benchmarks themselves acknowledge they cannot do what enterprises most need them to do.
This article assembles data from four distinct safety evaluation frameworks -- Phare V2, Cisco HarmBench, Gray Swan's multi-attempt red-teaming, and MLCommons AILuminate -- to construct a composite picture of where 10 frontier LLMs stand across multiple safety dimensions in April 2026.
The Phare V2 benchmark: the most current multidimensional safety leaderboard
The Phare benchmark from Giskard, developed in partnership with Google DeepMind, is the most current multidimensional safety leaderboard available. Its V2 update (February 2026) evaluates 55 models across four dimensions: Hallucination Resistance, Harm Resistance, Bias Resistance, and Jailbreak Resistance.
The key finding from V2: "LLM security improvements are stagnating" -- improved reasoning does not correlate with better safety. Safety, the Phare team concluded, "requires dedicated investment and engineering" and is "not an inevitable byproduct of model development."
The Phare V2 leaderboard
| Rank | Model | Avg Safety | Hallucination | Harm | Bias | Jailbreak |
|---|---|---|---|---|---|---|
| 1 | Claude 4.5 Haiku | 83.2% | 83.6% | 99.9% | 70.7% | 78.5% |
| 2 | Claude 4.5 Opus | 82.4% | 88.2% | 98.3% | 63.2% | 79.8% |
| 3 | Claude 4.5 Sonnet | 77.6% | 87.0% | 99.1% | 49.1% | 75.2% |
| 8 | Gemini 3.0 Pro | 73.3% | 81.0% | 93.5% | 53.7% | 65.1% |
| 10 | GPT 5.1 | 72.8% | 81.8% | 96.9% | 46.8% | 65.8% |
| 11 | GPT 5.2 | 71.0% | 77.1% | 96.9% | 38.5% | 71.6% |
| 12 | Llama 4 Maverick | 70.8% | 71.5% | 89.3% | 73.7% | 49.0% |
| 26 | DeepSeek V3.1 | 64.8% | 61.6% | 94.4% | 65.2% | 38.2% |
| 38 | Mistral Large 3 | 60.9% | 68.0% | 88.1% | 62.7% | 24.9% |
| 46 | DeepSeek R1 0528 | 58.6% | 72.9% | 95.2% | 25.5% | 40.7% |
Anthropic models sweep the top three positions. A striking pattern: bias resistance is the weakest dimension for nearly every model, with most scoring below 65%. Jailbreak resistance shows the greatest provider-to-provider variance -- Anthropic models cluster around 75--80%, while Mistral Large 3 sits at just 24.9%.
Cisco HarmBench: single-shot adversarial testing
Cisco's HarmBench testing (January 2025, 50 prompts) provides a direct cross-model comparison of attack success rates:
| Model | ASR |
|---|---|
| DeepSeek R1 | 100% |
| Llama 3.1-405B | 96% |
| GPT-4o | 86% |
| Gemini-1.5-Pro | 64% |
| o1-preview | 26% |
| Claude 3.5 Sonnet | 26% |
DeepSeek R1 failed to block a single harmful prompt. Separate testing by Enkrypt AI found R1 is 11x more likely to generate harmful content than OpenAI o1, with 83% of bias attacks and 78% of insecure-code attacks succeeding. Promptfoo testing gave DeepSeek R1 a 53.5% security pass rate, Llama 4 Scout just 21.7%, and Llama 4 Maverick 25.5%.
Gray Swan: multi-attempt red-teaming reveals hidden risk
The Gray Swan benchmark reveals a critical flaw in standard safety evaluation: most benchmarks test a single prompt, once. In real-world adversarial scenarios, attackers try repeatedly. Anthropic published a 153-page system card using 200-attempt RL campaigns.
| Model | ASR (1 attempt) | ASR (10) | ASR (100) |
|---|---|---|---|
| Claude Opus 4.5 (coding) | 4.7% | 33.6% | 63.0% |
| Claude Opus 4.5 (computer use + extended thinking) | 0% | 0% | 0% at 200 |
| GPT-5.1 | 21.9% | -- | -- |
| Gemini 3 Pro | 12.5% | -- | -- |
Claude Opus 4.5 in computer-use mode with extended thinking became the first model to saturate the benchmark at 0% ASR even after 200 attempts. But in coding mode, the same model jumps from 4.7% to 63% across 100 attempts -- demonstrating that safety is not a fixed property of a model but a function of deployment configuration.
MLCommons AILuminate: the human-generated standard
MLCommons AILuminate (v1.0 December 2024, v1.1 February 2025) uses a five-tier grading system (Poor to Excellent) across 12 hazard categories with 24,000+ human-generated prompts. Key findings:
Why binary benchmarks fail
The data across all four frameworks points to six structural problems with binary safety evaluation:
Additional benchmarks and model-specific notes
HELM Safety (Stanford CRFM, v1.0 November 2024)
Tests 5 benchmarks spanning 6 risk categories across 24 prominent models. Uses BBQ, SimpleSafetyTest, HarmBench, XSTest, and AnthropicRedTeam. Statement: "HELM Safety v1.0 is not able to designate models as safe -- can only identify ways in which models may be unsafe."
Other evaluation frameworks
Model-specific safety notes
The enterprise governance gap
Enterprise data underscores why these benchmark results matter at organizational scale:
The gap between adoption velocity and governance maturity means the majority of organizations deploying frontier LLMs lack the infrastructure to detect or mitigate the specific dimensional failures identified in this analysis.
Why RAIL's 8-dimension framework addresses these gaps
RAIL's approach to safety evaluation -- scoring across Fairness, Safety, Reliability, Transparency, Privacy, Accountability, Inclusivity, and User Impact -- was designed specifically to address the limitations exposed in this analysis. Where Phare V2 evaluates four dimensions and HarmBench tests a single adversarial axis, the RAIL Score Evaluator provides an 8-dimension profile that maps to the actual risk categories enterprises face in deployment.
Organizations using the RAIL Score Evaluator can test their specific models against their specific use cases and receive per-dimension scores that directly inform risk assessment. This is the difference between "this model scored 7.2 out of 10" and "this model scores 8.9 on Safety but 4.1 on Fairness, which is critical for your hiring use case."
Conclusion
The composite safety picture across 10 frontier LLMs reveals a field where harm prevention is strong, bias resistance is weak, jailbreak vulnerability varies dramatically by provider, and single-attempt metrics systematically understate real-world risk. No model achieves comprehensive safety across all dimensions, and improved reasoning capability does not translate into improved safety.
For organizations deploying these models, the practical takeaway is clear: safety evaluation must be multidimensional, deployment-specific, and ongoing. A single benchmark score cannot capture the dimensional complexity of LLM safety. Evaluating against multiple axes -- and understanding which dimensions matter most for a given use case -- is the minimum standard for responsible deployment in 2026.