← Blog Founder Notes

What We Learned Processing Thousands of Clinical Documents

March 2026 · 8 min read

I run a healthcare AI startup that automates clinical inbox processing for physicians. We process thousands of clinical documents monthly — lab reports, pathology results, clinical notes — across 27 providers at 5 clinics using Athena Health EHR systems.

Here's what I've learned about LLM reliability that doesn't show up in benchmarks or blog posts about prompt engineering.

The errors that don't show up in testing

Every error pattern we've built correction logic for exists because it burned us in production first. Not in testing. In production, with real patient data, where a physician was relying on the output.

Decimal shifts are the most dangerous error. An A1c of 8.2% is clinically significant — it means poorly controlled diabetes. An A1c of 0.82% is physiologically impossible. The LLM confidently extracts 0.82. Valid number. Wrong by 10x. Passes every schema check. The physician would catch it — but the whole point is to save physicians time, not create new things for them to verify.

Laterality swaps are invisible. "Left knee" vs "right knee." The model read the document correctly. Then swapped sides in the extraction. Both are valid anatomical references. The schema is happy. The patient gets the wrong follow-up.

Temporal confusion is subtle. A document mentions a patient's current medication (metformin) and a discontinued medication (glipizide). The model extracts glipizide as current. It's in the document. It's a real medication name. But the temporal context — "discontinued 6 months ago" — got lost in extraction.

The "helpful" LLM is a liability. This one surprised me most. The model tries to be helpful. A patient-facing note should summarize lab results. The model adds: "You may want to discuss adjusting your blood thinner with your doctor." Sounds reasonable. But the clinic's rules explicitly prohibit medication suggestions in patient communications. The model is being helpful in a way that creates legal liability.

What clinicians actually need

Before a physician trusts automated extraction, they need three things. Not features. Not dashboards. Three specific things:

1. Every field must be traceable to the source. If the extraction says A1c is 8.2%, the physician needs to know it came from line 14 of the lab report, not from the model's "understanding" of what the value should be. Hallucinated values are unacceptable — not as a preference, as a clinical requirement.

2. Uncertainty must be explicit. If the model isn't confident about a value, it needs to say so — not guess and present the guess as fact. A trust status of needs_review is infinitely more useful than a confidently wrong value. Physicians are trained to handle uncertainty. They're not trained to handle systems that look certain but aren't.

3. The system must never be wrong and silent. A physician can handle "I couldn't extract this field — please review manually." A physician cannot handle a system that silently puts 0.82 where 8.2 should be. The failure mode that destroys trust isn't being wrong. It's being wrong without flagging it.

Why we built the correction layer

We didn't set out to build a correction layer for the LLM industry. We set out to stop shipping wrong data to physicians.

The initial approach was the obvious one: use the best available model. We ran Sonnet 4.5 and Opus 4.6. Good accuracy — roughly 95%+ on our rule validation suite. But at $4,500 per month for a startup doing $140K ARR, the cost wasn't sustainable.

Budget models were dramatically cheaper but unusable — 40–50% pass rate. Half the documents failing. Not an option.

So we built what we needed: a system that embeds verification criteria into the LLM's generation process. The model checks its own work during generation — catching the decimal shift, the laterality swap, the forbidden medication suggestion — and corrects them before the response reaches anyone.

The result: budget models matching premium accuracy at 75% lower cost. 95.7% pass rate on 1,054 documents. Every failure flagged. Zero silent failures.

The patterns that generalize

Healthcare is where we built this, but the error patterns aren't unique to healthcare:

The correction layer doesn't know it's processing healthcare documents. It knows it's enforcing rules against LLM output. The rules are different per domain. The correction mechanism is the same.

Why general-purpose eval tools aren't enough for healthcare. Teams evaluating a DeepEval alternative or Promptfoo alternative for clinical workflows quickly discover that generic pass/fail metrics miss the domain-specific failures — wrong laterality, imprecise decimal values, temporal confusion — that matter in medicine

The LLM tooling market has standardized around evaluation and observability. Platforms like DeepEval, Promptfoo, LangSmith, and Confident AI help teams test prompts, catch regressions, and monitor production traffic. Guardrails AI and NeMo Guardrails add runtime validation — checking outputs against rules before returning them to users.

These tools are designed for general-purpose LLM applications: chatbots, summarization, code generation. In healthcare, the requirements are different. A failed chatbot response is a bad user experience. A failed clinical extraction is a patient safety event. The error taxonomy — decimal precision, laterality, temporal confusion, forbidden content — isn't covered by standard evaluation metrics like hallucination scores or answer relevancy.

What healthcare needs is a correction layer that understands clinical-domain validation rules and applies them during generation — not after. The model needs to know, while it's generating, that 0.82 is impossible for an A1c value, that laterality must match the source document, and that medication suggestions are prohibited in patient communications. That's what we built.

What I'd tell another founder

If you're building LLM-powered products that handle real data for real users:

Test with real production data, not curated test sets. Our test set accuracy was always higher than production accuracy. Real documents are messy, inconsistent, and surprising in ways test sets aren't.

Measure failures, not just accuracy. 95.7% accuracy means 45 failures out of 1,054. You need to know what those 45 look like, why they failed, and whether they were flagged. Aggregate pass rates hide the important details.

Separate bad input from bad output. 51 of our documents were flagged as bad input — source material too messy for any model. If you count those as system failures, you're optimizing for the wrong thing.

Your users don't want 100% accuracy. They want to know what to trust. A physician who knows "this extraction is verified" and "this one needs review" can work efficiently. A physician who doesn't know which outputs to trust can't trust any of them.

We're onboarding teams building LLM products for real users.

Request Early Access →