Chats / Eval runs
RELIABILITY

Evals

Every agent runs a pinned test suite each hour against gold examples — with drift, regressions, and pass-rate posted to #cascade-evals. Changes to prompts, tools, or models don't ship without a green eval.

RUNS THIS WEEK
2,880
Across 9 agents
OVERALL PASS
93.0%
+0.4 vs. last week
FLAGGED
2
Needs review
AVG LATENCY
1.2s
p50 per example
PASS RATE · LAST 30 RUNS
91%
Pass Partial Fail

Recent runs

PartialFailStatus
Churn Predictor3h ago240221136 92% +0.3 DONE
Support Triage4h ago420399147 95% +0.1 DONE
Revenue Forensics5h ago120101136 84% -1.4 FLAGGED
Inventory Oracle6h ago18016983 94% +0.2 DONE
Cart Recovery Strategist7h ago500480146 96% +0.4 DONE
Compliance WatchdogMon16015442 96% 0.0 DONE
Review SentinelMon200182126 91% +0.3 DONE
Ad Spend OptimizerMon140123116 88% -0.9 FLAGGED
Launch OrchestratorMon807064 87% -0.2 DONE
Churn PredictorYesterday240216168 90% -0.4 DONE
Support TriageYesterday420395169 94% 0.0 DONE
Inventory OracleYesterday18016794 93% -0.1 DONE
Maxed Out