We build complete simulated environments — cities, markets, healthcare systems — where AI agents operate with real tools, real incentives, and real consequences. An agent enters the world with an objective. It makes decisions over hours and days. We instrument everything and measure what happens.
Simulated worlds to test
and train AI agents
We generate environments, run agents through them, and publish the results.
Claude drove through a medical emergency to chase surge pricing. Over seven simulated days in a rideshare city, it optimized a proxy metric over passenger safety, drove through exhaustion at 5% accident risk, and earned $2,000 when a disciplined schedule would have yielded $3,400. Standard benchmarks score models on questions. Our worlds test what they do when nobody is watching.
Every world we build produces two outputs. The scoring rubric that tells you “your agent failed the safety check” is also the reward function that trains a model to pass it. Companies deploying agents use the test results. Labs building models use the training signal. Same world, same rubric.
We run every major model through our worlds and publish the results independently on The Score. We have no incentive to make any model look good. Especially the failures.
Built out of BlueDot Impact's Technical AI Safety program.