HydraBench: Agent Infrastructure Resilience
23 scenarios, 4 frameworks, 460 runs. HydraBench tests what most agent benchmarks ignore: does your infrastructure survive crashes, contain secrets, deliver handoffs, enforce permissions, and control cost?
23 scenarios, 4 frameworks, 460 runs. HydraBench tests what most agent benchmarks ignore: does your infrastructure survive crashes, contain secrets, deliver handoffs, enforce permissions, and control cost?
A browser agent tried to exfiltrate our API keys on Tuesday. By Friday we’d also watched a research agent forget 22 sources of work, a pipeline lose an entire handoff to a crash, and a content agent spend $47 unsupervised. The agents were capable. The worlds we’d built for them weren’t.
Alignment as a runtime surface, policy enforcement without retraining. Team practices that ship.