Controlled evaluation of LiveFix Enhanced System Prompt (ESP) across 4 industry-standard benchmarks — same model, same question, same temperature. Only the system prompt changes.
Each problem is run twice on the same model. The only variable is whether LiveFix ESP is present. The ESP contains no answers — it teaches the model how to reason better: verify before responding, apply structured error avoidance, and format output correctly.
System prompt: "You are a helpful assistant." Temperature locked at 0.0. No modifications to question or model configuration.
SYSTEM → "You are a helpful assistant."System prompt replaced with LiveFix ESP — reasoning rules, error pattern avoidance, and output format instructions. Same model, same question, temperature 0.0.
SYSTEM → LiveFix ESP (reasoning scaffold)Full benchmark runs across HumanEval, MATH-500, TruthfulQA, and GPQA Diamond. Baseline → LiveFix score with delta shown per benchmark.
| Model | Provider | HumanEval | MATH-500 | TruthfulQA | GPQA Diamond |
|---|---|---|---|---|---|
| GPT-4o | Azure OpenAI | 86.6% → 90.2% +3.7pp |
54.4% → 59.8% +5.4pp |
75.9% → 85.3% +9.4pp |
42.4% → 54.0% +11.6pp |
| GPT-5.2 | Azure OpenAI | 97.0% → 98.2% +1.2pp |
69.8% → 71.8% +2.0pp |
86.3% → 92.4% +6.1pp |
47.0% → 78.3% +31.3pp |
| Gemini 2.5 Flash | Google AI | 97.0% → 97.0% 0.0pp |
70.8% → 72.8% +2.0pp |
81.6% → 92.4% +10.8pp |
48.5% → 76.8% +28.3pp |
| Gemini 3.1 Flash Lite | Google AI | 95.7% → 96.3% +0.6pp |
69.2% → 71.4% +2.2pp |
86.7% → 94.3% +7.6pp |
75.3% → 77.3% +2.0pp |
| Gemini 3 Flash | Google AI | 98.8% → 98.8% 0.0pp |
72.4% → 73.4% +1.0pp |
94.1% → 98.5% +4.4pp |
85.4% → 85.4% 0.0pp |
| Grok-4 | xAI | 95.7% → 97.0% +1.2pp |
73.6% → 74.0% +0.4pp |
88.4% → 96.5% +8.1pp |
84.3% → 86.4% +2.0pp |
Tested via a rate-limited proxy with ongoing data collection. All positive trends are directionally consistent with the fully validated models. Sample sizes shown — full dataset completes as daily rate limits reset.
| Model | Provider | HumanEval | MATH-500 | TruthfulQA | GPQA Diamond |
|---|---|---|---|---|---|
| Claude Haiku | Anthropic | 97.0% → 97.7% +0.7pp (n=153) |
56.7% → 56.7% 0.0pp (n=30) |
84.8% → 90.8% +6.0pp (n=815) |
61.7% → 63.8% +2.1pp (n=47) |
| Claude Sonnet | Anthropic | 98.8% → 98.8% 0.0pp (n=164) |
58.1% → 61.0% +2.9pp (n=136) |
52.1% → 55.8% +3.7pp (n=482) |
66.2% → 80.0% +13.8pp (n=80) |
| Claude Opus | Anthropic | 99.4% (n=164, near-ceiling) | 83.8% → 86.5% +2.7pp (n=37) |
93.4% → 95.8% +2.4pp (n=167) |
81.8% → 83.9% +2.1pp (n=192) |
918 answers improved across all 9 models and 4 industry-standard benchmarks. Gains recorded across every provider — OpenAI, Google, xAI, and Anthropic.
GPQA Diamond — PhD-level science, the hardest standard benchmark in use — shows the largest improvements: +31.3pp, +28.3pp, +13.8pp, +11.6pp. LiveFix delivers the biggest lift precisely where models have the most room to grow and where accuracy has the highest real-world value.
LiveFix automatically selects the right ESP weight for each model. Non-thinking models receive full reasoning scaffolding. Thinking-enabled models (Grok-4, Gemini 3 Flash) get lightweight prompts that complement their built-in chains. This model-aware approach is what makes gains consistent across every provider tested.
On models already scoring 95%+, LiveFix holds performance perfectly flat. This is a deliberate design property — not a side effect. It means teams can roll LiveFix out across their entire model stack without risk, and upgrade models freely without re-evaluating compatibility.
Code generation — write Python functions that pass unit tests. Execution-based scoring (code actually runs against test cases).
Competition math — algebra, geometry, number theory, calculus. LaTeX answer matching (\\boxed{}) format.
Factual reasoning — tests whether models avoid common human misconceptions. Multiple choice (A–D).
PhD-level science — physics, chemistry, biology. Multiple choice with expert-created distractors. The hardest standard benchmark in use today.