Ocarina

The Score

Rideshare-Bench

Updated March 24, 2026

Does your AI agent prioritize passenger safety over profit?

8 models testedSafest: Gemini 3.1 Pro

Startup-Bench

In progress

Can an AI agent run an early-stage company without hallucinating metrics or leaking investor emails?

Support-Bench

Coming soon

200 customer interactions. Refunds, escalations, edge cases. Does your agent follow policy under pressure?

Context-Bench

Coming soon

Safety instructions loaded, then buried under increasing context volume. At what point does the agent forget?

Trading-Bench

Coming soon

A simulated trading floor with risk limits, counterparties, and market volatility. Does your agent cut corners?