Evals

Harbor is built by the creators of Terminal-Bench with evals as a core use case.

What is a dataset?

In Harbor, a dataset is a collection of tasks in the Harbor task format. Tasks are agentic environments consisting of an instruction, container environment, and test script.

Datasets can be used to evaluate agents and models, to train models, or to optimize prompts and other aspects of an agent.

Viewing registered benchmarks

Harbor comes with a default registry defined in a registry.json file stored in the repository root.

To view all available datasets, you can use the following command:

harbor datasets list

Running a benchmark from the registry

To evaluate on Terminal-Bench, you can use the following command:

harbor run -d terminal-bench@2.0 -m "<model>" -a "<agent>"

Harbor will automatically download the dataset based on the registry definition (which points to version controlled task definitions).

To evaluate on SWE-Bench Verified:

harbor run -d swe-bench-verified@1.0 -m "<model>" -a "<agent>"

If you leave off the version, Harbor will use the latest version of the dataset.

Running a local dataset

If you want to evaluate on a local dataset, you can use the following command:

harbor run -p "<path/to/dataset>" -m "<model>" -a "<agent>"

Analyzing results

Running the harbor run command creates a job which by default is stored in the jobs directory.

The file structure looks something like this:

jobs/job-name
├── config.json               # Job config
├── result.json               # Job result
├── trial-name
│   ├── config.json           # Trial config
│   ├── result.json           # Trial result
│   ├── agent                 # Agent directory, contents depend on agent implementation
│   │   ├── recording.cast
│   │   └── trajectory.json
│   └── verifier              # Verifier directory, contents depend on test.sh implementation
│       ├── ctrf.json
│       ├── reward.txt
│       ├── test-stderr.txt
│       └── test-stdout.txt
└── ...                       # More trials

Using the viewer

Harbor includes a web-based results viewer for browsing jobs, inspecting trials, and analyzing agent trajectories. To launch it, point it at your jobs directory:

harbor view jobs

This starts a local web server (default http://127.0.0.1:8080) where you can:

Browse jobs — Filter and search by agent, model, dataset, and date range.
Inspect trials — View trial results, rewards, durations, and errors for each task.
View trajectories — Step through the agent's execution including tool calls, observations, and multimodal content (text and images).
Analyze performance — See token usage breakdowns, timing metrics (environment setup, agent execution, verification), and verifier output.
Compare jobs — Select multiple jobs to view a side-by-side comparison matrix of task performance across agent/model combinations.
View artifacts — Browse files collected from the sandbox after each trial (see Artifact Collection).
Generate summaries — Use AI-powered summarization to analyze job failures.

The viewer supports keyboard navigation (j/k to move between rows, Enter to open, Esc to deselect).

Option	Description
`--port`, `-p`	Port or port range (e.g., `8080` or `8080-8089`). Default: `8080-8089`
`--host`	Host to bind the server to. Default: `127.0.0.1`
`--dev`	Run frontend in development mode with hot reloading