Most AI projects in healthcare don’t fail in evaluation. They fail somewhere between the pilot demo and the third Tuesday of go-live, when the on-call physician finds an edge case the validation set never saw.
This is a recurring pattern, and it’s the reason we write these notes.
The pilot is the easy part
A 90-day pilot in a sandboxed clinic with engaged champions and curated data tells you almost nothing about whether your system survives the open ocean of routine clinical workflow. We’ve seen models with great pilot numbers fall apart against three structural realities:
- Data drift on real schedules. The pilot ran on January’s data. Flu season hits in February.
- Workflow gravity. Whatever step you added gets dropped first when the schedule slips.
- Override fatigue. If the system asks for confirmation more than it’s wrong, clinicians stop reading the prompt.
What we measure differently
We track three numbers from week one of every deployment, and they’re the only ones we put on a dashboard the clinical lead actually opens:
- Acceptance rate — what fraction of suggestions are taken without modification.
- Modification rate — what fraction are taken with edits (this is where the model is learning the local style).
- Override rate with reason codes — the only signal that tells you why the system is wrong.
The third metric is the one most teams skip, and it’s the one that tells you whether to retrain, retune the prompt, or pull the integration entirely.Internal post-mortem, Q1 deployment
What this means for your roadmap
If you’re planning a healthcare AI deployment in 2026, the question we’d ask first isn’t about the model. It’s about the feedback loop you’ll have on day 30, day 60, and day 90 — and who owns the decision to keep going.
Build that first. The model is the easy part.