Roadmap
The future of Harbor
Here's what we're working on and where Harbor is headed.
Benchmarks & Tasks
- Adapt all compatible benchmarks into the Harbor registry — Run any major benchmark through Harbor with a single command.
- Build new benchmarks — Develop new agent benchmarks like Terminal-Bench 3.0.
- Help researchers and companies build & share benchmarks — We want to work directly with users on this.
- Multi-turn conversation evaluation — Both step-by-step verification and simulated users.
- Task creation tooling — Capture production agent state and map it to tasks.
Integrations
- Integrate with more sandbox providers — e.g. Vercel sandbox.
- Integrate with Tinker for training — RL on Harbor tasks.
- Integrate with SkyRL for training — RL on Harbor tasks.
- Support LLM-as-a-judge — Add first-class support for using language models as evaluation judges.
Infrastructure
- Improve visualization and analysis tooling — Help people debug their jobs fast.
- Improved sandbox snapshotting — Snapshotting the state of an environment for debugging. (E.g. snapshotting the filesystem, asciinema recordings, etc.)
- Build a hosted storage layer — PyPI for Harbor tasks. Make it easy to version, share, distribute, track metrics, and run tasks. This includes sharable private registries.
- Build hosted rollout infra — Managed infrastructure for running evaluations at scale.