harbor

Roadmap

The future of Harbor

Here's what we're working on and where Harbor is headed.

Benchmarks & Tasks

  • Adapt all compatible benchmarks into the Harbor registry — Run any major benchmark through Harbor with a single command.
  • Build new benchmarks — Develop new agent benchmarks like Terminal-Bench 3.0.
  • Help researchers and companies build & share benchmarks — We want to work directly with users on this.
  • Multi-turn conversation evaluation — Both step-by-step verification and simulated users.
  • Task creation tooling — Capture production agent state and map it to tasks.

Integrations

  • Integrate with more sandbox providers — e.g. Vercel sandbox.
  • Integrate with Tinker for training — RL on Harbor tasks.
  • Integrate with SkyRL for training — RL on Harbor tasks.
  • Support LLM-as-a-judge — Add first-class support for using language models as evaluation judges.

Infrastructure

  • Improve visualization and analysis tooling — Help people debug their jobs fast.
    • Improved sandbox snapshotting — Snapshotting the state of an environment for debugging. (E.g. snapshotting the filesystem, asciinema recordings, etc.)
  • Build a hosted storage layer — PyPI for Harbor tasks. Make it easy to version, share, distribute, track metrics, and run tasks. This includes sharable private registries.
  • Build hosted rollout infra — Managed infrastructure for running evaluations at scale.

On this page