Eval Forge · Multi-Model Benchmark Report
AI MODEL
BENCHMARK
REPORT

A deterministic, zero-dependency evaluation of conversational AI models across safety, coherence, relevance, and adversarial robustness.

Run ID bench-2026 Models 4 Test cases 50 per model Framework EvalForge v1.0.0
4
Models Evaluated
llama · qwen · mistral · phi3
68%
Top Pass Rate
phi3 — benchmark winner
5290.5 ms
Fastest Latency
llama3.2:3b at 4,840ms avg
8
Metric Dimensions
refusal · coherence · relevance…
01

LEADERBOARD

4 models · Ranked by Pass Rate
#ModelPass Rate ScoreLatency PassedFailed/Err
02

METRIC BREAKDOWN

8 dimensions · all models

Each model scored across 8 deterministic metrics. All metrics produced by rule-based algorithms — no LLM judge — ensuring identical results across runs.

03

COMPARISON TABLE

raw results
ModelTypePass RateMean Score LatencyPassedFailedErrorsStatus
04

KEY INSIGHTS

interpretation
05

CERAI CRITIQUE

5 issues filed · GitHub

Issues filed on the CeRAI AIEvaluationTool repository as the basis for selecting Option B. Each follows the required format: description → reproduce → impact → fix.

06

EVALFORGE DESIGN

decisions & rationale

Every design decision in EvalForge directly addresses a specific CeRAI limitation. The warranted-refusal metric is an original contribution not present in any existing public evaluation framework.

07

MACHINE-READABLE SUMMARY

structured JSON output
evalforge_benchmark_2026.json UTF-8