Security

HydraBench: Agent Infrastructure Resilience

23 scenarios, 4 frameworks, 460 runs. HydraBench tests what most agent benchmarks ignore: does your infrastructure survive crashes, contain secrets, deliver handoffs, enforce permissions, and control cost?

Project Hydra: Designing a world for agents

A browser agent tried to exfiltrate our API keys on Tuesday. By Friday we’d also watched a research agent forget 22 sources of work, a pipeline lose an entire handoff to a crash, and a content agent spend $47 unsupervised. The agents were capable. The worlds we’d built for them weren’t.

Model-Adjacent Products, Part 4: Governance & Practice

Alignment as a runtime surface, policy enforcement without retraining. Team practices that ship.