Harbor

Motivation

Why we built Harbor

Harbor is a framework for evaluating and optimizing agents and models in container environments.

When we released Terminal-Bench in May, we were surprised to see it used in unexpected ways like prompt optimzation, RL, SFT trace generation, and CI/CD agent testing.

We also learned that defining and managing containerized tasks at scale is hard. We built Harbor to make it easy.

Harbor provides:

  • Simple, modular interfaces for environments, agents, and tasks
  • All popular CLI agents pre-integrated
  • A registry of popular benchmarks and datasets
  • Integrations with cloud sandbox providers like Daytona, Modal, and E2B for horizontal scaling
  • Integrations with frameworks like SkyRL and GEPA for optimizing agents