harbor

Datasets

Adapters (Agent Guide)

Comprehensive adapter spec for AI agents building Harbor adapters. Contains full schemas, directory structures, commands, and validation criteria.

Looking for the human-readable guide?

This page is the comprehensive spec optimized for AI agents. For a concise walkthrough, see the Adapters (Human Guide).

Purpose

An adapter translates an existing benchmark into Harbor's task format. This document is the authoritative reference for building one. Follow steps 1-9 in order.

Check the Adapter List for available benchmarks. Contact Lin Shi or join Discord #adapters-announcements for coordination. The team covers API costs for parity experiments.

Quick Start

harbor dataset list                              # list available datasets
harbor adapter init                              # interactive scaffold
harbor adapter init my-adapter --name "My Name"  # non-interactive scaffold

Required Directory Structures

Generated task directory (one per task)

<adapter-name>/
└── <task-id>/
    ├── task.toml              # task configuration and metadata
    ├── instruction.md         # task instructions for the agent
    ├── environment/
    │   └── Dockerfile         # container environment definition
    ├── solution/
    │   └── solve.sh           # oracle solution script
    └── tests/
        ├── test.sh            # test execution script
        └── test_*.py          # (optional) pytest test files

Task naming requirement: Every generated task.toml must contain a name field. Harbor uses this field to identify the task when it's added to a dataset; tasks without a name cannot be registered. Adapter code is responsible for deriving a valid, unique, registry-safe name for every task: sanitize upstream identifiers (lowercase, replace spaces/slashes/special characters with hyphens). See §Step 8 Naming rules for the full naming contract, and the task format for the rest of the task structure. Each generated directory must contain at minimum task.toml, instruction.md, environment/Dockerfile, solution/solve.sh, and tests/test.sh.

Adapter code directory

Generated by harbor adapter init, this is a Python package using src layout:

harbor/adapters/<adapter-name>/
├── .python-version            # Python version (optional, created by uv init)
├── pyproject.toml             # Python package config (created by uv init)
├── README.md                  # final documentation (step 9)
├── adapter_metadata.json      # structured metadata (step 9)
├── parity_experiment.json     # parity results (step 6)
├── run_<adapter-name>.yaml    # reference config to run the full adapted dataset
└── src/
    └── <adapter_name>/         # adapter-name with dashes → underscores
        ├── __init__.py
        ├── adapter.py         # main logic: parse benchmark, generate task dirs
        ├── main.py            # CLI entry point (must support --output-dir)
        └── task-template/     # template files copied into each task
            ├── task.toml
            ├── instruction.md
            ├── environment/
            │   └── Dockerfile
            ├── solution/
            │   └── solve.sh
            └── tests/
                └── test.sh

Key requirements for main.py

  • Must support --output-dir to specify where generated tasks are written.
  • Must support --limit, --overwrite, and --task-ids flags.
  • Run via uv run python -m <adapter_name>.main --output-dir <path>.

Step 1. Understand the Original Benchmark

Identify these four components for every task in the benchmark:

ComponentWhat to find
InstructionsHow tasks are described; what information agents receive
EnvironmentsDocker setup, system dependencies, file structures
TestsEvaluation method: deterministic unit tests, LLM-as-a-Judge, etc.
SolutionsOracle/reference solutions; if none exist, whether LLM generation is feasible

Study the benchmark's repository, documentation, and code structure.

Step complete when: You can describe, for each task, the instruction text, environment setup, test/verification method, and oracle solution.


Step 2. Fork and Develop Adapter Code

Run harbor adapter init to scaffold your adapter. The generated README.md doubles as a requirements checklist.

git clone https://github.com/{your-github-username}/harbor.git
cd harbor
git checkout -b {your-adapter-name}-adapter

Develop your adapter under adapters/{adapter-name}/. Refer to existing adapters in that directory.

Adapter component reference

ComponentDescription
src/<adapter_name>/adapter.pyCore logic: parse benchmark data, generate task directories.
src/<adapter_name>/main.pyCLI entry point. Must support --output-dir, --limit, --overwrite, --task-ids.
src/<adapter_name>/task-template/Template files copied into each generated task.
parity_experiment.jsonParity results (see Step 6 for full schema)
run_{adapter-name}.yamlReference config to run the full adapted dataset
README.mdWrite last before PR submission. Fill in the README generated by harbor adapter init.
Metrics / RewardsHarbor supports multiple float-valued metrics as rewards (RL-compatible). Use the same metrics as the original benchmark.

Targeting Windows containers. Adapters that emit Windows-targeted tasks must set [environment].os = "windows" in task.toml and ship solve.bat / test.bat instead of .sh. Users who need PowerShell can call it from within a .bat file. Linux is the default and requires no change. See Windows tasks.

Task file reference

task.toml: Every task must include this configuration file. The name field is required for registry. The version field must stay "1.0". Adjust timeouts to match your benchmark's complexity.

version = "1.0"

[task]
name = "my-benchmark/task-001"

[metadata]
author_name = "Original benchmark authors' names"
author_email = "benchmark-authors@email.com"
difficulty = "medium"
category = "programming"
tags = ["debugging", "python"]

[agent]
timeout_sec = 1800.0

[verifier]
timeout_sec = 120.0

[environment]
build_timeout_sec = 600.0
cpus = 1
memory_mb = 2048
storage_mb = 10240

For LLM-as-a-Judge verifiers, add a [verifier.env] section to pass the judge model and API key:

[verifier.env]
OPENAI_API_KEY = "${OPENAI_API_KEY}"
MODEL_NAME = "openai/gpt-5-2025-08-07"

tests/test.sh: Must write a numeric reward (integer or float, 0 to 1) to /logs/verifier/reward.txt. Harbor mounts /logs/verifier/ at runtime. The script should exit 0 on success and non-zero on failure.

#!/bin/bash
pytest /tests/test_*.py
if [ $? -eq 0 ]; then
  echo 1 > /logs/verifier/reward.txt
else
  echo 0 > /logs/verifier/reward.txt
fi

instruction.md: Write agent-actionable instructions, not raw benchmark descriptions. The agent reads this file to understand what to do. Include the goal, constraints, expected output location, and any files the agent should modify. Do not include test answers or oracle solutions.

environment/Dockerfile: Set up the container the agent will work in. Install system and Python dependencies, copy any benchmark-specific data files, and set the working directory. The agent and verifier both run inside this container. No terminal-bench or harbor canary string should occur here.

FROM python:3.13-slim
WORKDIR /workspace

RUN apt-get update && apt-get install -y \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install benchmark-specific dependencies
# RUN pip install --no-cache-dir <packages>

# Copy task-specific files
# COPY . /workspace/

GPU tasks

If your benchmark includes tasks that require GPU (e.g., CUDA, Triton kernels), add a docker-compose.yaml in the task's environment/ directory with nvidia device reservations for Docker runs. For cloud/Modal runs, also set gpus in task.toml. See the featurebench adapter for a comprehensive example; it handles 44 GPU tasks across multiple repos with separate CPU/GPU/Modal config files.

Writing run_{adapter-name}.yaml

This config file serves as the single entry point for all experiments: oracle verification, parity runs, and general benchmarking. Keep the oracle agent as the default (uncommented) and include other agents as commented-out alternatives so anyone can quickly switch.

datasets:
  - path: datasets/<adapter-name>

# Default: oracle agent for verification
agents:
  - name: oracle

# Uncomment to run with other agents:
# agents:
#   - name: codex
#     model_name: openai/gpt-5-mini
#
# agents:
#   - name: claude-code
#     model_name: claude-sonnet-4-5-20250929

environment:
  type: docker
  delete: true

orchestrator:
  type: local
  n_concurrent_trials: 4

You can also create additional config files for different scenarios (e.g., parity subsets, CPU-only vs GPU, Modal). For example, featurebench provides featurebench_docker_cpu.yaml, featurebench_docker_gpu.yaml, featurebench_modal.yaml, and featurebench_parity.yaml.

Usage:

# Oracle verification (default)
harbor run -c adapters/<adapter-name>/run_<adapter-name>.yaml

# Switch agent by uncommenting the desired agent block

Rules

  • Prompt modifications (e.g., "write files in place without asking") are acceptable if applied to both the original benchmark and Harbor adapter.
  • Adapting a subset of tasks is acceptable (e.g., only SWE-Bench-Verified). Document all exclusions in the README.

Step complete when: main.py produces a valid task directory for each task containing task.toml, instruction.md, environment/Dockerfile, solution/solve.sh, and tests/test.sh.


Step 3. Verify Oracle Solutions

Run your adapter with the oracle agent and confirm 100% reward on all tasks. Validating oracle solutions is a straightforward way to check:

  1. Adaptation correctness: a wrong adaptation usually causes oracle failures.
  2. Oracle solution bugs: cross-validate by running oracle on the original benchmark side as well to determine whether a failure is due to the solution itself or the Harbor adaptation.
  3. Environment issues: Docker build failures or broken verification tests make tasks impossible for agents to solve, so catching these early is critical.

Run commands

MethodCommandWhen to use
Single taskharbor trial start -p datasets/<adapter-name>/<task-id> -a <agent> -m <model>Testing individual tasks
Entire datasetharbor run -p datasets/<adapter-name> -a <agent> -m <model>Full oracle verification
Config fileharbor run -c adapters/<adapter-name>/<config>.yaml -a <agent> -m <model>Reproducible runs (see example configs)
Registry: single taskharbor run -t <organization-name>/<task-id> -a <agent> -m <model>Post-publish single task
Registry: full datasetharbor run -d <organization-name>/<adapter-name> -a <agent> -m <model>Post-publish full dataset (after Step 8)

Write a reference config YAML for your adapter to ensure reproducibility.

README ordering note: In the final adapter README, list the registry method (Option 5) first; it is the primary user-facing run method. Adapter code and local-path methods are for development/reproduction.

After oracle passes

  1. Create a WIP PR titled [WIP] Adapter: {adapter_name}.
  2. Include a screenshot of the terminal showing 100% oracle pass results.

Broken oracles in the original benchmark

If a fix is straightforward, propose it on the original fork and upstream repo as a GitHub Issue or PR, then document the fix clearly in the adapter README. This is more robust and transparent than patching on the Harbor side. For tasks that cannot be reliably fixed or verified, document them and exclude them from the dataset.

Benchmarks without oracle solutions

Usually we require adapter contributors to build the oracle solutions themselves with the help of AI. Building oracle solutions is a separate process from running parity experiments (Step 5), so they can progress in parallel. However, before running parity, you need to validate that the tasks are theoretically solvable by agents with no environment or test issues.

A recommended approach is to use a cheap agent and model to do a pass over all the tasks. For building oracle solutions, you can also directly use the successful agent solutions from parity experiments and then complete the remaining ones with more powerful AI plus human supervision. If things become more complicated than this, reach out to the team on Discord and discuss case-by-case.

Step complete when: All oracle solutions pass with 100% reward, and a WIP PR titled [WIP] Adapter: {adapter_name} is created with a screenshot of the passing results.


Step 4. Discuss Parity Plans and Implement Agents

Contact the team (e.g., Lin Shi on Discord) before running parity experiments. They determine agents, models, number of runs, and API key provisioning.

Agent implementation scenarios

Scenario 1: Compatible agents exist. The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, Gemini-CLI). No extra work needed; run parity with identical settings on both sides. Example: ADEBench, where the original benchmark already supports Claude Code.

Scenario 2: LLM-based, no compatible agents. Fork the original benchmark, implement a Harbor-compatible agent there, and document it in the fork's README. Example: EvoEval, which forked the repo to add codex agent support for parity.

Scenario 3: Custom agents. The original benchmark uses custom agents unavailable in Harbor. Implement the custom agent under adapters/{name}/ and run parity experiments with it. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. There are two sub-cases here:

  • Some benchmarks require registering a separate dataset for CLI-agent compatibility. Example: BixBench, FinanceAgent (also demonstrates LLM-as-a-Judge verification).
  • Others do not need a separate dataset. Example: MedAgentBench supports a custom HTTPAgent matching the original GET/POST/FINISH interaction semantics.
  • For multi-agent workflows where multiple agents coordinate in parallel (e.g., via Redis messaging and Docker sidecars), see CooperBench. Note that multi-agent benchmarks may not be compatible with standard single-agent CLI agents.

Keep links to any forked repositories and document the approach in the README.

Large or expensive benchmarks

If running the full benchmark is too expensive, run parity on a representative subset. Requirements:

  • Document in README how the subset was selected and that parity ran on a subset.
  • Support --split parity in main.py to generate only the parity subset.
  • Ask the team to publish the parity subset under the parity tag so users can run -d <adapter_name>@parity. See Versioning below.
uv run python -m <adapter_name>.main --split parity --output-dir /path/to/output  # parity subset
uv run python -m <adapter_name>.main --output-dir /path/to/output                 # full dataset

Step complete when: Parity plan is agreed with the team (agents, models, number of runs), and any required agent implementations are working on both the original benchmark and Harbor sides.


Step 5. Run Parity Experiments

Definition

Parity is the property that the Harbor adaptation and the original benchmark measure the same quantity. Under identical conditions — same agent, model, prompts, and configuration — scores on both sides should be statistically indistinguishable. An adapter without demonstrated parity cannot be accepted, because its results would not be comparable to results reported against the upstream benchmark.

Matching criterion

The two sides match if and only if their run-score ranges overlap:

max(side_A_runs) >= min(side_B_runs)  AND  max(side_B_runs) >= min(side_A_runs)

Equivalently, neither side's lowest run may exceed the other side's highest run. Close means with non-overlapping ranges do not constitute a match. Non-overlap should be treated as a systematic adaptation error until evidence indicates otherwise.

Reporting format: mean ± sample SEM

All scalar-uncertainty fields in parity_experiment.json and the README parity table (original, harbor, and any Score ± … column) MUST be reported as mean ± sample standard error of the mean (sample SEM) — not sample standard deviation. Parity is a claim about how well each side estimates the true benchmark score; SEM is the uncertainty of that estimate and shrinks as runs are added, while sample std describes per-run spread and does not. Reporting sample std inflates apparent uncertainty and can mask a genuinely diverging adapter.

For n ≥ 2 runs with per-run scores x₁, x₂, …, xₙ and mean :

sample SEM = sqrt( Σ (xᵢ - x̄)² / ( n (n - 1) ) )

Notes: SEM is undefined for n < 2 (require ≥ 2 runs per side; 3+ preferred). original_runs / harbor_runs are the source of truth — reviewers recompute from them to verify the reported string. Keep units consistent with the raw runs (don't mix 45.2 and 45.2%).

Checklist BEFORE any parity run

Required execution order

Full runs must be completed symmetrically. The required order is: sanity check on 5-10 tasks (both sides) → one full run (both sides) → scale to three runs (both sides). Completing three runs on one side before the other is not permitted, because an adaptation bug discovered later forces the work to be re-run and incurs avoidable cost.

Before starting a parity run, confirm each item below on both the original-benchmark side and the Harbor side, and record the values in the notes field of parity_experiment.json (or the parity notes section of the README) for reproducibility.

#CheckRequirementVerification
1Harbor is on the latest upstreammain is not behind origin/mainRun git fetch origin && git status; pull if behind. Stale checkouts miss recent adapter and agent updates and are a common source of spurious parity failures.
2Agent versionPinned and identical on both sidesFor example, codex@0.1.0. Avoid latest and unpinned installs.
3Model IDExact dated identifier, identical on both sidesFor example, claude-sonnet-4-5-20250929. Avoid floating aliases.
4Installation scriptIdentical on both sidesDiff the install commands (package-manager or curl lines).
5Execution / entry commandIdentical flags, environment variables, and working directoryDiff the launch command used on each side.
6Tools, arguments, and parametersIdentical on both sidesIncludes temperature, max tokens, tool list, system prompt, turn limit, and timeouts.
7Prompt modificationsApplied symmetricallyAny prompt change on one side must be applied on the other.
8Ramp-up ordersanity (5-10 tasks, both sides) → 1 full run (both sides) → 3 runs (both sides)See the execution-order callout above.

Run command

harbor run -p datasets/<adapter-name> -a <agent> -m <model>

Debug playbook (when parity does not match)

If scores do not overlap per the matching criterion, work through the steps below in order. Each step narrows the hypothesis space; re-running without diagnosis is not an acceptable substitute.

#ActionWhen to runSignal to look forDecision
1Resolve errorsAlways firstCrashes, timeouts, Docker build failures, or verifier errors in the logsFix infrastructure and re-run before interpreting scores. A failed run should not be treated as a low score.
2Inspect agent trajectoriesAfter the logs are cleanOn successes, whether the success is legitimate. On failures, whether the failure reflects a genuine capability gap or a harness issue (e.g., incorrect path, misread test output).If a harness issue is found, fix the adapter. Otherwise proceed.
3Overlap analysisAfter trajectories look reasonablePer-task resolution on each side; list tasks resolved on A-only and on B-only.A roughly symmetric split suggests sampling noise (proceed to step 4). A lopsided split suggests systematic adaptation error (proceed to step 7 or return to the adapter).
4Distinguish randomness from systematic errorDisagreements persist across runsTasks that fail on Harbor in every run (or succeed in every run) while flipping on the original side indicate a systematic issue. Disagreements that shuffle across runs indicate noise.Systematic issues typically originate in the environment, instruction, test, or solution. Otherwise proceed to step 5.
5Characterize high-variance tasks individuallyA small number of tasks show large variance but appear correctly adaptedRun those tasks 5-10 times in isolation to estimate their varianceUse the resulting distribution to decide whether variance alone explains the gap. Avoid scaling the full dataset to investigate a few noisy tasks.
6Lock configuration for all repeatsAny time you re-runSame Harbor commit, agent version, Docker base image, and environment variablesLook up and record these values before running. Mixing configurations across repeats invalidates the comparison.
7Diff the agent implementationTasks appear correctly adapted but scores still divergeDifferences in wrappers, tool definitions, prompt construction, retry logic, turn limits, or tool-output handlingAny such difference is a candidate root cause and should be resolved before attributing the gap to the adapter.
8Scale up only after fixesAfter any corrective change5-10 task sanity check → 1 full run (both sides) → 3 runs (both sides)Do not complete three runs on one side before the other. See the execution-order callout above.

Common sources of adaptation error

Any asymmetry between the two sides that is not attributable to sampling noise constitutes adaptation error. Check the following in order:

  • Environment: Dockerfile, installed packages, working directory, file permissions, mounted paths.
  • Instruction: wording changes, missing context, leaked solutions, additional constraints.
  • Tests: differing assertions, tolerances, reward mappings, or timeouts.
  • Solution: oracle passes on one side but not the other.
  • Agent: installation, entry command, tools, arguments, system prompt, retry and turn logic.
  • Harbor version: stale checkout missing a recent adapter or feature fix (revisit checklist item 1).

Step complete when: multiple runs on both sides satisfy the overlap criterion, and any remaining gaps are documented with evidence that they arise from sampling noise rather than adaptation error.


Step 6. Record Parity Results

Create parity_experiment.json in your adapter directory. The file is a JSON array; each entry is one agent+model parity experiment.

parity_experiment.json field reference

FieldTypeRequiredDescription
adapter_namestringYesAdapter name (e.g., "swe-bench")
agentstringYesAgent with version (e.g., "codex@1.0")
modelstringYesFull model identifier (e.g., "gpt-5-2025-06-01")
datestringYesExperiment date (e.g., "2025-06-15")
adapted_benchmark_sizeintegerYesTotal tasks converted by adapter (full set)
parity_benchmark_sizeintegerYesTasks used for parity. Equals adapted_benchmark_size if full set
number_of_runsintegerYesRuns per side. Should be identical for original and Harbor
notesstringNoAdditional explanations
original_parity_repostringYesFork URL for reproducing parity on original benchmark
adapter_prstring[]YesAll adapter PR links in harbor repo
dataset_prstring[]YesAll PR links in harbor-datasets repo
parity_prstring[]YesAll PR links to HuggingFace parity dataset
metricsobject[]YesMetric comparison objects (see below)

metrics entry fields

FieldTypeRequiredDescription
benchmark_namestringYesOriginal benchmark name
metricstringYesMetric name (e.g., "pass@1", "resolve_rate")
originalstringYesmean ± sample SEM on original (e.g., "45.2 ± 1.3"). See Reporting format.
harborstringYesmean ± sample SEM on Harbor (e.g., "44.8 ± 1.1"). See Reporting format.
original_runsnumber[]YesIndividual scores per run on original
harbor_runsnumber[]YesIndividual scores per run on Harbor

Example

[
  {
    "adapter_name": "my-benchmark",
    "agent": "codex@1.0",
    "model": "gpt-5-2025-06-01",
    "date": "2025-06-15",
    "adapted_benchmark_size": 500,
    "parity_benchmark_size": 500,
    "number_of_runs": 3,
    "notes": "None",
    "original_parity_repo": "https://github.com/user/my-benchmark-fork",
    "adapter_pr": ["https://github.com/harbor-framework/harbor/pull/123"],
    "dataset_pr": ["https://github.com/laude-institute/harbor-datasets/pull/45"],
    "parity_pr": ["https://huggingface.co/datasets/harborframework/parity-experiments/discussions/12"],
    "metrics": [
      {
        "benchmark_name": "my-benchmark",
        "metric": "pass@1",
        "original": "45.2 ± 1.3",
        "harbor": "44.8 ± 1.1",
        "original_runs": [44.0, 45.5, 46.1],
        "harbor_runs": [43.8, 45.0, 45.6]
      }
    ]
  }
]

README parity table

Include this table in the adapter README. Scores are mean ± sample SEM as defined in Reporting format:

| Agent | Model | Metric | Runs | Dataset Size | Original (mean ± SEM) | Harbor (mean ± SEM) |
|-------|-------|--------|------|--------------|-----------------------|---------------------|
| codex@1.0 | gpt-5 | pass@1 | 5    | 2000 (100%)  | 45.2 ± 1.3            | 44.8 ± 1.1          |

Also include links to: original benchmark repo, forked repo (if applicable), dataset PR, HuggingFace parity PR, adapter PR.

Step complete when: parity_experiment.json is valid JSON, all required fields are populated, original and harbor are reported as mean ± sample SEM and are consistent with the raw original_runs / harbor_runs arrays, the run-score ranges overlap per the matching criterion, and the README includes the parity summary table with links. If scores diverge significantly, investigate before proceeding.


Step 7. Upload Parity Results

Upload parity and oracle results to harborframework/parity-experiments on HuggingFace.

Recommended: Use the parity upload skill to automate this. It handles sparse checkouts, LFS tracking, and HF-specific PR refs.

Required directory structure

adapters/
└── {adapter_name}/
    ├── README.md
    ├── config.yaml
    ├── original_parity/
    ├── harbor_parity/
    ├── oracle/
    └── results_collection/
        ├── result_{original/harbor}_trial1.json
        ├── result_{original/harbor}_trial2.json
        └── result_{original/harbor}_trial{N}.json

Step complete when: PR to the HuggingFace parity-experiments dataset is submitted with all result files in the expected directory structure.


Step 8. Register the Dataset

A dataset is a collection of tasks with a many-to-many relationship: the same task can appear in multiple datasets, and a dataset can aggregate tasks from multiple adapters. Both datasets and tasks are namespaced as {organization}/{name}, for example {organization}/{dataset} for a dataset and {organization}/{task-id} for a task.

Step 1. Generate the dataset directory with your adapter. Store it in the harbor-datasets GitHub repo, or the HuggingFace mirror if it's too large for GitHub.

git clone https://github.com/{your-github-username}/harbor-datasets.git
cd harbor/adapters/<adapter-name>
uv run python -m <adapter_name>.main --output-dir /path/to/harbor-datasets/datasets/<adapter-name>

Step 2. Create dataset.toml at the root of the dataset directory (e.g., harbor-datasets/datasets/<adapter-name>/dataset.toml).

cd /path/to/harbor-datasets/datasets/<adapter-name>
harbor init  # select "dataset"; enter the dataset name as <org>/<dataset>

Step 3. Edit dataset.toml to fill in the description: parity results summary, adapter author credits (first authors only if the list is long), and acknowledgments.

Step 4. Verify the dataset runs locally before submitting, using the -p (path) parameter:

harbor run -p /path/to/your/dataset

Note: Registry testing (-d) is only available after publishing. Use -p for all pre-publish testing.

Step 5. Open a PR to harbor-datasets with the tasks directory and dataset.toml. Request @Slimshilin for review. The Harbor team publishes the dataset to the registry after approval.

Step 6. After publishing, verify the dataset loads and runs from the registry:

harbor run -d <organization-name>/<adapter-name>

Naming rules

RuleRequirement
Dataset ID<org>/<dataset_name>, e.g., openai/mmmlu. Entered interactively during harbor init.
Task ID<org>/<task_name>. Every generated task.toml must contain a name field. Tasks without a name cannot be added to a dataset.
Choosing <org>Prefer the benchmark's owning organization (e.g., openai/mmmlu). If there is no clear single owner or there are multiple, use the benchmark name itself as the organization (e.g., terminal-bench/terminal-bench).
Name stabilityTask names must be unique within the dataset and stable across adapter runs. Unstable names churn registry digests on republish.
Fallback schemeIf the upstream benchmark lacks stable task identifiers, mint a deterministic scheme in adapter code (e.g., {dataset}-1, {dataset}-2, ...) derived from a reproducible sort of upstream tasks.
SanitizationSanitize upstream identifiers before using them as names: lowercase, replace spaces/slashes/special characters with hyphens, avoid leading/trailing separators.

Agent instruction: before writing dataset.toml, verify every generated task.toml contains a name field. If any are missing, fix main.py and regenerate. Do not hand-edit generated task directories. Treat main.py as the source of truth for task names.

Versioning

Dataset versions are publish-time tags, not a field in dataset.toml. The Harbor team applies tags when publishing to the registry. Users resolve a specific version with -d <org>/<adapter_name>@<tag>. Every publish also receives the latest tag automatically, so -d <org>/<adapter_name> (no @) always points at the newest release.

TagWhen to use
v1.0Default for the first release
v1.1, v2.0, ...Subsequent releases; previous tags stay pinned to their snapshots
verified, lite, ...Mirror upstream naming when the original benchmark has named splits
parityParity subset (generated via --split parity)

To request a version, state the desired tag(s) in your adapter PR description. To cut a new version later (e.g., a bug fix), open a follow-up PR and request the new tag.

Agent instruction: do not add a version key to dataset.toml to control the published version; that does nothing. Do not change version = "1.0" in task.toml; that's the task-config schema version and must stay "1.0". The only way to select a version is to request a tag in the PR description.

Step complete when: Dataset is published to the registry, harbor run -d <organization-name>/<adapter-name> passes oracle tests, and the PR to harbor-datasets is merged.


Step 9. Document and Submit

README requirements

The README is parsed by automation

Downstream tooling parses this README to generate parity summaries and registry metadata, so the template produced by harbor adapter init must be followed as written.

Required:

  • Fill in every section the template defines.
  • Place any additional context — caveats, deviations, or commentary — in the Notes section, or in the notes field of parity_experiment.json or adapter_metadata.json.

Not permitted:

  • Adding new top-level sections.
  • Renaming, reordering, or removing template sections.

Deviations from the template prevent automated parsing; when unsure where to place content, use the Notes section.

The required content must appear in the following template-defined locations:

ContentLocation
Benchmark bugs discovered and how they were handledNotes
Special treatments (prompt modifications, environment adjustments)Notes
Deviations from the original benchmark and the rationaleNotes
Agent implementation details (if custom agents were added)Agents section
Known limitationsNotes
Reproduction scripts for parity experiments (both sides)Parity result section

Update the original benchmark fork's README

If you forked the original benchmark repository for parity (Scenario 2 or 3), update the fork's README to include reproduction scripts for running Harbor parity experiments. This makes it easy for others to reproduce results on the original benchmark side. Also include the reproduction steps in your adapter's README for completeness.

adapter_metadata.json schema

Create harbor/adapters/{adapter_name}/adapter_metadata.json.

Top-level fields:

FieldTypeRequiredDescription
adapter_namestringYesAdapter name
adapter_buildersstring[]YesBuilder names with email, e.g., ["Jane Doe (jane@example.com)"]
original_benchmarkobject[]YesOriginal benchmark split descriptors
harbor_adapterobject[]YesHarbor adapter split descriptors

original_benchmark entry fields:

FieldTypeRequiredDescription
splitstringYesSplit name (use "full" if none)
sizeintegerYesNumber of tasks in Harbor context
harnessstringYes"agent", "llm", or "None"
supported_agentsstring[]YesUse agent@version format. ["None"] if none
adaptablebooleanYesWhether this split can be converted
notesstringNoAdditional clarification. "None" if N/A

harbor_adapter entry fields:

FieldTypeRequiredDescription
splitstringYesCorresponding split. "full" if collective
adapted_benchmark_sizeintegerYesTasks convertible by adapter
parity_benchmark_sizeintegerYesTasks used for parity
parity_sampling_ratenumberYesparity_benchmark_size / adapted_benchmark_size
registry_benchmark_sizeintegerYesExact task count in registry
added_agentsstring[]YesCustom agents added. ["None"] if none
parity_matching_agentsstring[]YesAgents with comparable scores (agent@version+model)
parity_unmatching_agentsstring[]YesAgents without comparable scores. ["None"] if all matched
parity_costsstringYesTotal USD (e.g., "$150")
notesstringNo"None" if N/A

If parity ran across three systems (Harbor ↔ Terminal-Bench ↔ Original), include a "tb_adapter" key with the same structure.

Example

[
  {
    "adapter_name": "my-benchmark",
    "adapter_builders": ["Jane Doe (jane@example.com)"],
    "original_benchmark": [
      {
        "split": "full",
        "size": 500,
        "harness": "agent",
        "supported_agents": ["codex@0.1.0"],
        "adaptable": true,
        "notes": "None"
      }
    ],
    "harbor_adapter": [
      {
        "split": "full",
        "adapted_benchmark_size": 500,
        "parity_benchmark_size": 100,
        "parity_sampling_rate": 0.2,
        "registry_benchmark_size": 500,
        "added_agents": ["None"],
        "parity_matching_agents": ["codex@0.1.0+gpt-5-2025-06-01"],
        "parity_unmatching_agents": ["None"],
        "parity_costs": "$150",
        "notes": "Parity on 100-task subset. Averaged over 3 trials."
      }
    ]
  }
]

Submit

  1. Change PR title from [WIP] Adapter: {adapter_name} to [Ready for Review] Adapter: {adapter_name}.
  2. Request review from @Slimshilin.

Step complete when: PR title is [Ready for Review] Adapter: {adapter_name}, README covers all required sections, adapter_metadata.json passes schema validation, and review is requested from @Slimshilin.


Reference: Terminal-Bench Migration

Important: The Harbor adapter must be isolated from the Terminal-Bench repo. Do not write a mechanical translation script. Write fresh adapter code following the Harbor process.

AspectTerminal-BenchHarbor
Configtask.yamltask.toml
InstructionIn task.yamlSeparate instruction.md
DockerfileRoot levelenvironment/Dockerfile
Solutionsolution.shsolution/solve.sh
Testsrun-tests.sh + tests/test_outputs.pytests/test.sh
Docker Composedocker-compose.yaml in task rootenvironment/docker-compose.yaml for GPU tasks; not needed otherwise
VerificationExit code (pytest)Reward file: /logs/verifier/reward.txt
Output dirtasks/datasets/
RegistryDataset-level dataset_pathTask-level via dataset.toml + harbor init
CLItb run --datasetharbor run -d / -t / -p
MetricsBinary pass/failFloat rewards, multiple metrics

Important: If Terminal-Bench used a tweaked metric, re-implement for the original benchmark metrics.

Migration steps

  1. Convert task.yaml to task.toml + instruction.md
  2. Move files: Dockerfileenvironment/, solution.shsolution/solve.sh, run-tests.shtests/test.sh
  3. For GPU tasks, move docker-compose.yaml into environment/; for non-GPU tasks, remove it
  4. Update test scripts to write rewards to /logs/verifier/reward.txt (Harbor mounts /logs/verifier at runtime)
  5. Update adapter code: change output dir from tasks/ to datasets/, create subdirectories (environment/, solution/, tests/), split instruction into instruction.md, convert YAML generation to TOML
  6. Use harbor init + dataset.toml for registry (replaces the old registry.json)

Registry format conversion

Before (Terminal-Bench registry.json):

{
    "name": "my-adapter",
    "version": "head",
    "description": "...",
    "github_url": "https://github.com/laude-institute/terminal-bench-datasets.git",
    "dataset_path": "datasets/my-adapter",
    "task_id_subset": null
}

After (Harbor):

harbor init  # select "dataset", creates dataset.toml
# Edit dataset.toml with descriptions, authors, credits
# Then submit to Harbor team for publishing

See Step 8 for the full publishing workflow.

task.yaml → task.toml conversion example

Before:

instruction: |
  Your task instruction here...
author_email: original-bench-author@email.com
author_name: Original Bench Author Name
difficulty: hard
category: programming
tags:
  - debugging
  - python
parser_name: swebench
max_agent_timeout_sec: 3000.0
max_test_timeout_sec: 3000.0

After (task.toml):

version = "1.0"

[task]
name = "my-benchmark/task-001"

[metadata]
author_email = "original-bench-author@email.com"
author_name = "Original Bench Author Name"
difficulty = "hard"
category = "programming"
tags = ["debugging", "python"]

[agent]
timeout_sec = 3000.0

[verifier]
timeout_sec = 3000.0

After (instruction.md):

Your task instruction here...

test.sh conversion example

Before (Terminal-Bench):

#!/bin/bash
pytest tests/ > test_results.txt
if [ $? -eq 0 ]; then echo "PASSED" > /tmp/test_marker.txt; else echo "FAILED" > /tmp/test_marker.txt; fi

After (Harbor):

#!/bin/bash
pytest /tests/test_*.py
if [ $? -eq 0 ]; then echo 1 > /logs/verifier/reward.txt; else echo 0 > /logs/verifier/reward.txt; fi

Key differences:

  • Harbor mounts /logs/verifier for test outputs at runtime.
  • Write numeric reward (can be float) to /logs/verifier/reward.txt.
  • Can still use pytest, but final output must be the reward file.

Resources

On this page