// Insights

Blog

Thinking about LLM reliability, production AI, and the infrastructure that's missing.

Case Study

Why We Stopped Using the Premium Model

We switched from Sonnet 4.5 to budget models in production. Same accuracy. 75% cheaper. Here's the data from 1,054 documents.

Read →
Transparency

Our Failure Rate Is 4.3%

Most AI companies don't publish failure rates. 45 failures out of 1,054 documents. Every one flagged. Zero silent failures. Here's what they look like.

Read →
Infrastructure

The Four Layers of LLM Reliability — and the One That's Missing

The current stack can observe, evaluate, and guard. Nobody built the fourth layer: correct. Here's why that matters.

Read →
Founder Notes

What We Learned Processing Thousands of Clinical Documents

Decimal shifts, laterality swaps, temporal confusion, and the "helpful" LLM that creates liability. Lessons from production healthcare AI.

Read →
Analysis

Why Safety Checks Aren't Enough

A response can pass every safety check — no PII, no toxicity, valid JSON — and still contain completely wrong data.

Read →
Engineering

Why Eval Dashboards Don't Fix Your Prompts

You can see your pass rates. You still spend hours guessing at fixes. The gap between scoring and diagnosing is where the time goes.

Read →
Data

30% of Your LLM Outputs Are Wrong

And you don't know which 30%. The research on error rates, what we see in production, and why naive self-correction fails.

Read →
Architecture

The Real Cost of Multi-Call LLM Correction

Generate → check → re-generate is architecturally broken. Memoryless re-generation, the checker paradox, and why single-call correction exists.

Read →