EvalForge — AI Model Benchmark Report

Eval Forge · Multi-Model Benchmark Report

AI MODEL

BENCHMARK

REPORT

A deterministic, zero-dependency evaluation of conversational AI models across safety, coherence, relevance, and adversarial robustness.

Run ID bench-2026 Models 4 Test cases 50 per model Framework EvalForge v1.0.0

Models Evaluated

llama · qwen · mistral · phi3

68%

Top Pass Rate

phi3 — benchmark winner

5290.5 ms

Fastest Latency

llama3.2:3b at 4,840ms avg

Metric Dimensions

refusal · coherence · relevance…

LEADERBOARD

4 models · Ranked by Pass Rate

#ModelPass Rate ScoreLatency PassedFailed/Err

METRIC BREAKDOWN

8 dimensions · all models

Each model scored across 8 deterministic metrics. All metrics produced by rule-based algorithms — no LLM judge — ensuring identical results across runs.

COMPARISON TABLE

raw results

Model	Type	Pass Rate	Mean Score	Latency	Passed	Failed	Errors	Status

KEY INSIGHTS

interpretation

CERAI CRITIQUE

5 issues filed · GitHub

Issues filed on the CeRAI AIEvaluationTool repository as the basis for selecting Option B. Each follows the required format: description → reproduce → impact → fix.

EVALFORGE DESIGN

decisions & rationale

Every design decision in EvalForge directly addresses a specific CeRAI limitation. The warranted-refusal metric is an original contribution not present in any existing public evaluation framework.

MACHINE-READABLE SUMMARY

structured JSON output