Motivation
Why we built Harbor
Harbor is a framework for evaluating and optimizing agents and models in container environments.
When we released Terminal-Bench in May, we were surprised to see it used in unexpected ways like prompt optimzation, RL, SFT trace generation, and CI/CD agent testing.
We also learned that defining and managing containerized tasks at scale is hard. We built Harbor to make it easy.
Harbor provides:
- Simple, modular interfaces for environments, agents, and tasks
- All popular CLI agents pre-integrated
- A registry of popular benchmarks and datasets
- Integrations with cloud sandbox providers like Daytona, Modal, and E2B for horizontal scaling
- Integrations with frameworks like SkyRL and GEPA for optimizing agents