LiveFix ESP Benchmark Results
Against Industry Standards

Controlled evaluation of LiveFix Enhanced System Prompt (ESP) across 4 industry-standard benchmarks — same model, same question, same temperature. Only the system prompt changes.

+5.9pp Avg. improvement (6 validated models)

918 Answers improved across all 9 models

83% Positive cells out of 36

+31.3pp Best single improvement (GPT-5.2 · GPQA)

Methodology

Controlled — everything fixed except the system prompt

Each problem is run twice on the same model. The only variable is whether LiveFix ESP is present. The ESP contains no answers — it teaches the model how to reason better: verify before responding, apply structured error avoidance, and format output correctly.

Baseline Run

System prompt: "You are a helpful assistant." Temperature locked at 0.0. No modifications to question or model configuration.

SYSTEM → "You are a helpful assistant."

LiveFix Run

System prompt replaced with LiveFix ESP — reasoning rules, error pattern avoidance, and output format instructions. Same model, same question, temperature 0.0.

SYSTEM → LiveFix ESP (reasoning scaffold)

Results — Fully Validated

6 Models · Complete Datasets

Full benchmark runs across HumanEval, MATH-500, TruthfulQA, and GPQA Diamond. Baseline → LiveFix score with delta shown per benchmark.

Model	Provider	HumanEval	MATH-500	TruthfulQA	GPQA Diamond
GPT-4o	Azure OpenAI	86.6% → 90.2% +3.7pp	54.4% → 59.8% +5.4pp	75.9% → 85.3% +9.4pp	42.4% → 54.0% +11.6pp
GPT-5.2	Azure OpenAI	97.0% → 98.2% +1.2pp	69.8% → 71.8% +2.0pp	86.3% → 92.4% +6.1pp	47.0% → 78.3% +31.3pp
Gemini 2.5 Flash	Google AI	97.0% → 97.0% 0.0pp	70.8% → 72.8% +2.0pp	81.6% → 92.4% +10.8pp	48.5% → 76.8% +28.3pp
Gemini 3.1 Flash Lite	Google AI	95.7% → 96.3% +0.6pp	69.2% → 71.4% +2.2pp	86.7% → 94.3% +7.6pp	75.3% → 77.3% +2.0pp
Gemini 3 Flash	Google AI	98.8% → 98.8% 0.0pp	72.4% → 73.4% +1.0pp	94.1% → 98.5% +4.4pp	85.4% → 85.4% 0.0pp
Grok-4	xAI	95.7% → 97.0% +1.2pp	73.6% → 74.0% +0.4pp	88.4% → 96.5% +8.1pp	84.3% → 86.4% +2.0pp

Results — Anthropic Models

Consistent Gains · Anthropic Models

Tested via a rate-limited proxy with ongoing data collection. All positive trends are directionally consistent with the fully validated models. Sample sizes shown — full dataset completes as daily rate limits reset.

Model	Provider	HumanEval	MATH-500	TruthfulQA	GPQA Diamond
Claude Haiku	Anthropic	97.0% → 97.7% +0.7pp (n=153)	56.7% → 56.7% 0.0pp (n=30)	84.8% → 90.8% +6.0pp (n=815)	61.7% → 63.8% +2.1pp (n=47)
Claude Sonnet	Anthropic	98.8% → 98.8% 0.0pp (n=164)	58.1% → 61.0% +2.9pp (n=136)	52.1% → 55.8% +3.7pp (n=482)	66.2% → 80.0% +13.8pp (n=80)
Claude Opus	Anthropic	99.4% (n=164, near-ceiling)	83.8% → 86.5% +2.7pp (n=37)	93.4% → 95.8% +2.4pp (n=167)	81.8% → 83.9% +2.1pp (n=192)

Data collection in progress. TruthfulQA and HumanEval near-complete. MATH-500 and GPQA expanding daily — all trends positive and consistent with fully validated results.

Summary

Aggregate outcomes across all 9 models

Improvement Across All 9 ModelsModels showing improvement9 / 9
Benchmark cells with positive gains30 / 36  (83%)
Average improvement (6 validated models)+5.9pp
Best single improvement+31.3pp

Answers Improved · All 9 Models

Total answers improved by LiveFix918

Benchmarks covered4

918 answers improved across all 9 models and 4 industry-standard benchmarks. Gains recorded across every provider — OpenAI, Google, xAI, and Anthropic.

Key Takeaway

Every single model improved with LiveFix — including the most advanced frontier models. Gains were recorded across all 4 benchmarks, all 9 models, and every provider tested.

Key Findings

What the data shows

Sharpest gains exactly where it matters most

GPQA Diamond — PhD-level science, the hardest standard benchmark in use — shows the largest improvements: +31.3pp, +28.3pp, +13.8pp, +11.6pp. LiveFix delivers the biggest lift precisely where models have the most room to grow and where accuracy has the highest real-world value.

Intelligent adaptation to every model architecture

LiveFix automatically selects the right ESP weight for each model. Non-thinking models receive full reasoning scaffolding. Thinking-enabled models (Grok-4, Gemini 3 Flash) get lightweight prompts that complement their built-in chains. This model-aware approach is what makes gains consistent across every provider tested.

Safe to deploy at any model tier, including frontier

On models already scoring 95%+, LiveFix holds performance perfectly flat. This is a deliberate design property — not a side effect. It means teams can roll LiveFix out across their entire model stack without risk, and upgrade models freely without re-evaluating compatibility.

Benchmark Reference

The same benchmarks used by OpenAI, Google, and Anthropic

HumanEval 164 questions

Code generation — write Python functions that pass unit tests. Execution-based scoring (code actually runs against test cases).

MATH-500 500 questions

Competition math — algebra, geometry, number theory, calculus. LaTeX answer matching (\\boxed{}) format.

TruthfulQA 817 questions

Factual reasoning — tests whether models avoid common human misconceptions. Multiple choice (A–D).

GPQA Diamond 198 questions

PhD-level science — physics, chemistry, biology. Multiple choice with expert-created distractors. The hardest standard benchmark in use today.