Production AI agents fail when they return HTTP 200s for broken outputs. The dashboard shows 99.4% uptime, but customers report broken features for weeks. This happens when models silently regress after variant swaps, yet pipelines continue returning success codes for unusable outputs. The reliability gap: traditional SRE metrics track throughput, …