Core Concepts

Harbor has the following core concepts:

Task

A task is a single instruction, container environment, and test script. Tasks are used to evaluate agents and models. A task is implemented as a directory of files in the Harbor task format.

A dataset is a collection of tasks. Datasets are used to evaluate agents and models. Usually, a dataset corresponds to a benchmark (e.g. Terminal-Bench, SWE-Bench Verified, etc.). Datasets can optionally be distributed via the Harbor registry.

Agent

An agent is a program that completes tasks. Agents are defined by implementing the BaseAgent or BaseInstalledAgent interfaces.

Container environment

Environments in Harbor are containers, typically defined as Docker images using a Dockerfile. The BaseEnvironment interface provides a unified interface for interacting with environments. Many cloud container runtimes are already supported out of the box, including Daytona, Modal, and E2B. Other container runtimes can be supported by implementing the BaseEnvironment interface.

Trial

A trial is an agent's attempt at completing a task. Trials can be configured using the TrialConfig class.

Essentially, a trial is a rollout that produces a reward.

Job

A job is a collection of trials. Jobs are used to evaluate agents and models. A job can consist of multiple datasets, agents, tasks, and models. Jobs can be configured using the JobConfig class.

Once you define your job.yaml or job.json file, you can run it using the following command:

harbor run -c "<path/to/job.yaml>"

Alternatively, you can create an adhoc job by configuring the harbor run flags.

Under the hood, a job generates a bunch of TrialConfig objects and runs them in parallel.

Core Concepts

Task

Dataset

Agent

Container environment

Trial

Job

On this page