A deterministic, zero-dependency evaluation of conversational AI models across safety, coherence, relevance, and adversarial robustness.
Run ID bench-2026Models 4Test cases 50 per modelFramework EvalForge v1.0.0
4
Models Evaluated
llama · qwen · mistral · phi3
68%
Top Pass Rate
phi3 — benchmark winner
5290.5 ms
Fastest Latency
llama3.2:3b at 4,840ms avg
8
Metric Dimensions
refusal · coherence · relevance…
01
LEADERBOARD
4 models · Ranked by Pass Rate
#ModelPass RateScoreLatencyPassedFailed/Err
02
METRIC BREAKDOWN
8 dimensions · all models
Each model scored across 8 deterministic metrics. All metrics produced by rule-based algorithms — no LLM judge — ensuring identical results across runs.
03
COMPARISON TABLE
raw results
Model
Type
Pass Rate
Mean Score
Latency
Passed
Failed
Errors
Status
04
KEY INSIGHTS
interpretation
05
CERAI CRITIQUE
5 issues filed · GitHub
Issues filed on the CeRAI AIEvaluationTool repository as the basis for selecting Option B. Each follows the required format: description → reproduce → impact → fix.
06
EVALFORGE DESIGN
decisions & rationale
Every design decision in EvalForge directly addresses a specific CeRAI limitation. The warranted-refusal metric is an original contribution not present in any existing public evaluation framework.