harbor

harbor

Docs Registry Discord

Harbor

Motivation Getting Started Core Concepts Migrating from Terminal-Bench

Agents

Agents Terminus-2 Agent Trajectory Format (ATIF)

Tasks

Task Structure Differences from Terminal-Bench Tutorial

Datasets

Running Terminal-Bench Datasets Registering Datasets Adapters Metrics

Cloud

Cloud Deployments

Use Cases

Contributing

Contributing Roadmap

Roadmap

The future of Harbor

Here's what we're working on and where Harbor is headed.

Benchmarks & Tasks

Adapt all compatible benchmarks into the Harbor registry — Run any major benchmark through Harbor with a single command.
Build new benchmarks — Develop new agent benchmarks like Terminal-Bench 3.0.
Help researchers and companies build & share benchmarks — We want to work directly with users on this.
Multi-turn conversation evaluation — Both step-by-step verification and simulated users.
Task creation tooling — Capture production agent state and map it to tasks.

Integrations

Integrate with more sandbox providers — e.g. Vercel sandbox.
Integrate with Tinker for training — RL on Harbor tasks.
Integrate with SkyRL for training — RL on Harbor tasks.
Support LLM-as-a-judge — Add first-class support for using language models as evaluation judges.

Infrastructure

Improve visualization and analysis tooling — Help people debug their jobs fast.
- Improved sandbox snapshotting — Snapshotting the state of an environment for debugging. (E.g. snapshotting the filesystem, asciinema recordings, etc.)
Build a hosted storage layer — PyPI for Harbor tasks. Make it easy to version, share, distribute, track metrics, and run tasks. This includes sharable private registries.
Build hosted rollout infra — Managed infrastructure for running evaluations at scale.

Contributing

Contributing to Harbor

On this page

Benchmarks & Tasks Integrations Infrastructure