Motivation

Harbor is a framework for evaluating and optimizing agents and models in container environments.

When we released Terminal-Bench in May, we were surprised to see it used in unexpected ways like building custom evals, optimizing prompts, running RL, generating SFT traces, and CI/CD agent testing.

We also learned that defining and managing containerized tasks at scale is hard. We built Harbor to make it easy.

Harbor provides:

Simple, modular interfaces for environments, agents, and tasks
All popular CLI agents pre-integrated
A registry of popular benchmarks and datasets
Integrations with cloud sandbox providers like Daytona, Modal, E2B, Runloop, Tensorlake, LangSmith, Blaxel, Novita Sandbox, and EC2 for horizontal scaling
Integrations with frameworks like SkyRL and GEPA for optimizing agents