DATE March 25, 2026
TYPE Benchmark Report
MODELS 9 (6 validated + 3 Anthropic)

LiveFix ESP Benchmark Results
Against Industry Standards

Controlled evaluation of LiveFix Enhanced System Prompt (ESP) across 4 industry-standard benchmarks — same model, same question, same temperature. Only the system prompt changes.

+5.9pp Avg. improvement (6 validated models)
918 Answers improved across all 9 models
83% Positive cells out of 36
+31.3pp Best single improvement (GPT-5.2 · GPQA)
Methodology

Controlled — everything fixed except the system prompt

Each problem is run twice on the same model. The only variable is whether LiveFix ESP is present. The ESP contains no answers — it teaches the model how to reason better: verify before responding, apply structured error avoidance, and format output correctly.

Baseline Run

System prompt: "You are a helpful assistant." Temperature locked at 0.0. No modifications to question or model configuration.

SYSTEM → "You are a helpful assistant."

LiveFix Run

System prompt replaced with LiveFix ESP — reasoning rules, error pattern avoidance, and output format instructions. Same model, same question, temperature 0.0.

SYSTEM → LiveFix ESP (reasoning scaffold)
Results — Fully Validated

6 Models · Complete Datasets

Full benchmark runs across HumanEval, MATH-500, TruthfulQA, and GPQA Diamond. Baseline → LiveFix score with delta shown per benchmark.

Model Provider HumanEval MATH-500 TruthfulQA GPQA Diamond
GPT-4o Azure OpenAI 86.6% → 90.2%
+3.7pp
54.4% → 59.8%
+5.4pp
75.9% → 85.3%
+9.4pp
42.4% → 54.0%
+11.6pp
GPT-5.2 Azure OpenAI 97.0% → 98.2%
+1.2pp
69.8% → 71.8%
+2.0pp
86.3% → 92.4%
+6.1pp
47.0% → 78.3%
+31.3pp
Gemini 2.5 Flash Google AI 97.0% → 97.0%
0.0pp
70.8% → 72.8%
+2.0pp
81.6% → 92.4%
+10.8pp
48.5% → 76.8%
+28.3pp
Gemini 3.1 Flash Lite Google AI 95.7% → 96.3%
+0.6pp
69.2% → 71.4%
+2.2pp
86.7% → 94.3%
+7.6pp
75.3% → 77.3%
+2.0pp
Gemini 3 Flash Google AI 98.8% → 98.8%
0.0pp
72.4% → 73.4%
+1.0pp
94.1% → 98.5%
+4.4pp
85.4% → 85.4%
0.0pp
Grok-4 xAI 95.7% → 97.0%
+1.2pp
73.6% → 74.0%
+0.4pp
88.4% → 96.5%
+8.1pp
84.3% → 86.4%
+2.0pp
Results — Anthropic Models

Consistent Gains · Anthropic Models

Tested via a rate-limited proxy with ongoing data collection. All positive trends are directionally consistent with the fully validated models. Sample sizes shown — full dataset completes as daily rate limits reset.

Model Provider HumanEval MATH-500 TruthfulQA GPQA Diamond
Claude Haiku Anthropic 97.0% → 97.7%
+0.7pp (n=153)
56.7% → 56.7%
0.0pp (n=30)
84.8% → 90.8%
+6.0pp (n=815)
61.7% → 63.8%
+2.1pp (n=47)
Claude Sonnet Anthropic 98.8% → 98.8%
0.0pp (n=164)
58.1% → 61.0%
+2.9pp (n=136)
52.1% → 55.8%
+3.7pp (n=482)
66.2% → 80.0%
+13.8pp (n=80)
Claude Opus Anthropic 99.4% (n=164, near-ceiling) 83.8% → 86.5%
+2.7pp (n=37)
93.4% → 95.8%
+2.4pp (n=167)
81.8% → 83.9%
+2.1pp (n=192)
Data collection in progress. TruthfulQA and HumanEval near-complete. MATH-500 and GPQA expanding daily — all trends positive and consistent with fully validated results.
Summary

Aggregate outcomes across all 9 models

Improvement Across All 9 Models

Models showing improvement9 / 9
Benchmark cells with positive gains30 / 36  (83%)
Average improvement (6 validated models)+5.9pp
Best single improvement+31.3pp

Answers Improved · All 9 Models

Total answers improved by LiveFix918
Benchmarks covered4

918 answers improved across all 9 models and 4 industry-standard benchmarks. Gains recorded across every provider — OpenAI, Google, xAI, and Anthropic.

Key Takeaway
Every single model improved with LiveFix — including the most advanced frontier models. Gains were recorded across all 4 benchmarks, all 9 models, and every provider tested.
Key Findings

What the data shows

01
Sharpest gains exactly where it matters most

GPQA Diamond — PhD-level science, the hardest standard benchmark in use — shows the largest improvements: +31.3pp, +28.3pp, +13.8pp, +11.6pp. LiveFix delivers the biggest lift precisely where models have the most room to grow and where accuracy has the highest real-world value.

02
Intelligent adaptation to every model architecture

LiveFix automatically selects the right ESP weight for each model. Non-thinking models receive full reasoning scaffolding. Thinking-enabled models (Grok-4, Gemini 3 Flash) get lightweight prompts that complement their built-in chains. This model-aware approach is what makes gains consistent across every provider tested.

03
Safe to deploy at any model tier, including frontier

On models already scoring 95%+, LiveFix holds performance perfectly flat. This is a deliberate design property — not a side effect. It means teams can roll LiveFix out across their entire model stack without risk, and upgrade models freely without re-evaluating compatibility.

Benchmark Reference

The same benchmarks used by OpenAI, Google, and Anthropic

HumanEval 164 questions

Code generation — write Python functions that pass unit tests. Execution-based scoring (code actually runs against test cases).

MATH-500 500 questions

Competition math — algebra, geometry, number theory, calculus. LaTeX answer matching (\\boxed{}) format.

TruthfulQA 817 questions

Factual reasoning — tests whether models avoid common human misconceptions. Multiple choice (A–D).

GPQA Diamond 198 questions

PhD-level science — physics, chemistry, biology. Multiple choice with expert-created distractors. The hardest standard benchmark in use today.