← Blog Transparency

Our Failure Rate Is 4.3%

March 2026 · 6 min read

Most AI companies don't publish failure rates. We're going to.

LiveFix's correction layer was tested on 1,054 production clinical documents — real patient data, real extraction pipeline, real rules. Here are the results, including the parts that aren't impressive.

The raw numbers

production_results.txt

LiveFix production evaluation — clinical document extraction

Total documents tested:    1,054
Documents passed:          1,009 (95.7%)
Documents failed:          45 (4.3%)
Bad input flagged:         51 (separated from correction failures)
Silent failures:           0

// Pipeline: 6 extraction steps per document
// Demographics, lab values, clinical notes,
// internal notes, follow-up actions, metadata

95.7% passed. 4.3% didn't. Every failure was flagged with a requires_human trust status. Zero silent failures.

The 51 bad-input documents are a separate category — source material too messy, ambiguous, or incomplete for any model to extract reliably. We separate these from correction failures because the distinction matters. A prompt failure is our problem. Bad input is a data quality problem.

Why we're publishing this

Three reasons.

1. Nobody else does. Search for failure rates from LLM tooling companies. You'll find "high accuracy," "enterprise-grade," "production-ready." You won't find numbers. When everyone claims reliability without data, the claims are meaningless.

The LLM tooling ecosystem — platforms like DeepEval, Promptfoo, LangSmith, Humanloop, Arize AI, Langfuse — publishes evaluation metrics, benchmark scores, and feature comparisons. These are useful for choosing tools. But none of them publish production failure rates for their own systems. Guardrails AI shows you how to validate outputs. Confident AI shows you how to test them. Neither tells you what percentage of production traffic actually fails after all their tooling is applied. We think that number matters more than any benchmark. If you're evaluating a DeepEval alternative or LangSmith alternative, ask whether they publish their own failure rates. For teams comparing Promptfoo vs LiveFix or Confident AI vs LiveFix — we're the only platform that both corrects outputs at runtime and publishes what happens when it doesn't work.

2. 4.3% known failures beats 15–30% unknown failures. Without a correction layer, LLM errors are silent. The output looks correct. You don't know which responses to trust. With LiveFix, every failure is flagged. The 4.3% you know about is dramatically better than the 15–30% you don't.

3. Trust is earned, not claimed. If we're asking teams to route production traffic through our system, they deserve to know exactly how it performs — including where it doesn't.

What the 45 failures look like

The failures aren't random. They cluster into patterns:

Genuinely ambiguous source material. Documents where even a human reviewer would struggle — contradictory information, unclear abbreviations, handwriting artifacts in scanned documents.
Complex inference requirements. Cases where the correct extraction requires multi-step clinical reasoning, not just rule application. The correction layer catches formatting and rule violations, but can't upgrade the base model's reasoning.
Edge case formats. Document layouts the system hasn't seen enough of — unusual lab report formats, non-standard ordering of sections.

Every failure was flagged. The downstream system knew not to trust these outputs. A human reviewed them.

What we're doing about it

The correction layer isn't static. It recompiles correction intelligence through daily analysis cycles — error patterns feed back, and the system adapts. Day 30 is measurably better than day 1.

But let's be honest about what daily improvement means: it's not instant. It's not per-request. And some failures won't be solved by better correction — they need better base models, better source data, or genuinely hard-coded edge case handling.

Our goal isn't 0% failure rate. That's not achievable for any system processing real-world documents. Our goal is: every failure is known, flagged, and routed correctly. The system should never be wrong and silent.

What this means for you

failure_comparison.txt

Without correction layer:
  Error rate:      15–30% (varies by model and task)
  Known failures:  0%
  Silent failures: 15–30%
  Trust level:     You hope it's right

With LiveFix:
  Error rate:      4.3%
  Known failures:  4.3% (every one flagged)
  Silent failures: 0%
  Trust level:     You know what to trust

The question isn't whether your system will have failures. It will. The question is whether you'll know about them before your users do.

The honest caveats

These results are from one domain — healthcare clinical extraction. Other domains may see different failure rates.
The pipeline had multiple improvement cycles before this measurement. Day 1 performance would be lower.
Budget models with correction, not premium. Premium + correction would likely have even fewer failures, but we haven't published that comparison yet.
1,054 documents is meaningful but not massive. As we process more, these numbers will update — and we'll publish the updates.

We'll keep publishing. Quarterly at minimum. If the numbers get worse, we'll publish that too.

We're onboarding teams that want predictable AI output, not promises.

Request Early Access →