Thinking about LLM reliability, production AI, and the infrastructure that's missing.
We switched from Sonnet 4.5 to budget models in production. Same accuracy. 75% cheaper. Here's the data from 1,054 documents.
Read → TransparencyMost AI companies don't publish failure rates. 45 failures out of 1,054 documents. Every one flagged. Zero silent failures. Here's what they look like.
Read → InfrastructureThe current stack can observe, evaluate, and guard. Nobody built the fourth layer: correct. Here's why that matters.
Read → Founder NotesDecimal shifts, laterality swaps, temporal confusion, and the "helpful" LLM that creates liability. Lessons from production healthcare AI.
Read → AnalysisA response can pass every safety check — no PII, no toxicity, valid JSON — and still contain completely wrong data.
Read → EngineeringYou can see your pass rates. You still spend hours guessing at fixes. The gap between scoring and diagnosing is where the time goes.
Read → DataAnd you don't know which 30%. The research on error rates, what we see in production, and why naive self-correction fails.
Read → ArchitectureGenerate → check → re-generate is architecturally broken. Memoryless re-generation, the checker paradox, and why single-call correction exists.
Read →