RELIABILITY
Evals
Every agent runs a pinned test suite each hour against gold examples —
with drift, regressions, and pass-rate posted to #cascade-evals.
Changes to prompts, tools, or models don't ship without a green eval.
RUNS THIS WEEK
2,880
Across 9 agents
OVERALL PASS
93.0%
+0.4 vs. last week
FLAGGED
2
Needs review
AVG LATENCY
1.2s
p50 per example
PASS RATE · LAST 30 RUNS
91%
Pass Partial Fail
Recent runs
| Partial | Fail | Status | ||||||
|---|---|---|---|---|---|---|---|---|
| Churn Predictor | 3h ago | 240 | 221 | 13 | 6 | 92% | ▲ +0.3 | DONE |
| Support Triage | 4h ago | 420 | 399 | 14 | 7 | 95% | ▲ +0.1 | DONE |
| Revenue Forensics | 5h ago | 120 | 101 | 13 | 6 | 84% | ▼ -1.4 | FLAGGED |
| Inventory Oracle | 6h ago | 180 | 169 | 8 | 3 | 94% | ▲ +0.2 | DONE |
| Cart Recovery Strategist | 7h ago | 500 | 480 | 14 | 6 | 96% | ▲ +0.4 | DONE |
| Compliance Watchdog | Mon | 160 | 154 | 4 | 2 | 96% | — 0.0 | DONE |
| Review Sentinel | Mon | 200 | 182 | 12 | 6 | 91% | ▲ +0.3 | DONE |
| Ad Spend Optimizer | Mon | 140 | 123 | 11 | 6 | 88% | ▼ -0.9 | FLAGGED |
| Launch Orchestrator | Mon | 80 | 70 | 6 | 4 | 87% | ▼ -0.2 | DONE |
| Churn Predictor | Yesterday | 240 | 216 | 16 | 8 | 90% | ▼ -0.4 | DONE |
| Support Triage | Yesterday | 420 | 395 | 16 | 9 | 94% | — 0.0 | DONE |
| Inventory Oracle | Yesterday | 180 | 167 | 9 | 4 | 93% | ▼ -0.1 | DONE |