Adapters (Agent Guide)
Comprehensive adapter spec for AI agents building Harbor adapters. Contains full schemas, directory structures, commands, and validation criteria.
Looking for the human-readable guide?
This page is the comprehensive spec optimized for AI agents. For a concise walkthrough, see the Adapters (Human Guide).
Purpose
An adapter translates an existing benchmark into Harbor's task format. This document is the authoritative reference for building one. Follow steps 1-9 in order.
Check the Adapter List for available benchmarks. Contact Lin Shi or join Discord #adapters-announcements for coordination. The team covers API costs for parity experiments.
Quick Start
harbor dataset list # list available datasets
harbor adapter init # interactive scaffold
harbor adapter init my-adapter --name "My Name" # non-interactive scaffoldRequired Directory Structures
Generated task directory (one per task)
<adapter-name>/
└── <task-id>/
├── task.toml # task configuration and metadata
├── instruction.md # task instructions for the agent
├── environment/
│ └── Dockerfile # container environment definition
├── solution/
│ └── solve.sh # oracle solution script
└── tests/
├── test.sh # test execution script
└── test_*.py # (optional) pytest test filesTask naming requirement: Every generated task.toml must contain a name field. Harbor uses this field to identify the task when it's added to a dataset; tasks without a name cannot be registered. Adapter code is responsible for deriving a valid, unique, registry-safe name for every task: sanitize upstream identifiers (lowercase, replace spaces/slashes/special characters with hyphens). See §Step 8 Naming rules for the full naming contract, and the task format for the rest of the task structure. Each generated directory must contain at minimum task.toml, instruction.md, environment/Dockerfile, solution/solve.sh, and tests/test.sh.
Adapter code directory
Generated by harbor adapter init, this is a Python package using src layout:
harbor/adapters/<adapter-name>/
├── .python-version # Python version (optional, created by uv init)
├── pyproject.toml # Python package config (created by uv init)
├── README.md # final documentation (step 9)
├── adapter_metadata.json # structured metadata (step 9)
├── parity_experiment.json # parity results (step 6)
├── run_<adapter-name>.yaml # reference config to run the full adapted dataset
└── src/
└── <adapter_name>/ # adapter-name with dashes → underscores
├── __init__.py
├── adapter.py # main logic: parse benchmark, generate task dirs
├── main.py # CLI entry point (must support --output-dir)
└── task-template/ # template files copied into each task
├── task.toml
├── instruction.md
├── environment/
│ └── Dockerfile
├── solution/
│ └── solve.sh
└── tests/
└── test.shKey requirements for main.py
- Must support
--output-dirto specify where generated tasks are written. - Must support
--limit,--overwrite, and--task-idsflags. - Run via
uv run python -m <adapter_name>.main --output-dir <path>.
Step 1. Understand the Original Benchmark
Identify these four components for every task in the benchmark:
| Component | What to find |
|---|---|
| Instructions | How tasks are described; what information agents receive |
| Environments | Docker setup, system dependencies, file structures |
| Tests | Evaluation method: deterministic unit tests, LLM-as-a-Judge, etc. |
| Solutions | Oracle/reference solutions; if none exist, whether LLM generation is feasible |
Study the benchmark's repository, documentation, and code structure.
Step complete when: You can describe, for each task, the instruction text, environment setup, test/verification method, and oracle solution.
Step 2. Fork and Develop Adapter Code
Run harbor adapter init to scaffold your adapter. The generated README.md doubles as a requirements checklist.
git clone https://github.com/{your-github-username}/harbor.git
cd harbor
git checkout -b {your-adapter-name}-adapterDevelop your adapter under adapters/{adapter-name}/. Refer to existing adapters in that directory.
Adapter component reference
| Component | Description |
|---|---|
src/<adapter_name>/adapter.py | Core logic: parse benchmark data, generate task directories. |
src/<adapter_name>/main.py | CLI entry point. Must support --output-dir, --limit, --overwrite, --task-ids. |
src/<adapter_name>/task-template/ | Template files copied into each generated task. |
parity_experiment.json | Parity results (see Step 6 for full schema) |
run_{adapter-name}.yaml | Reference config to run the full adapted dataset |
README.md | Write last before PR submission. Fill in the README generated by harbor adapter init. |
| Metrics / Rewards | Harbor supports multiple float-valued metrics as rewards (RL-compatible). Use the same metrics as the original benchmark. |
Targeting Windows containers. Adapters that emit Windows-targeted tasks must set
[environment].os = "windows"intask.tomland shipsolve.bat/test.batinstead of.sh. Users who need PowerShell can call it from within a.batfile. Linux is the default and requires no change. See Windows tasks.
Task file reference
task.toml: Every task must include this configuration file. The name field is required for registry. The version field must stay "1.0". Adjust timeouts to match your benchmark's complexity.
version = "1.0"
[task]
name = "my-benchmark/task-001"
[metadata]
author_name = "Original benchmark authors' names"
author_email = "benchmark-authors@email.com"
difficulty = "medium"
category = "programming"
tags = ["debugging", "python"]
[agent]
timeout_sec = 1800.0
[verifier]
timeout_sec = 120.0
[environment]
build_timeout_sec = 600.0
cpus = 1
memory_mb = 2048
storage_mb = 10240For LLM-as-a-Judge verifiers, add a [verifier.env] section to pass the judge model and API key:
[verifier.env]
OPENAI_API_KEY = "${OPENAI_API_KEY}"
MODEL_NAME = "openai/gpt-5-2025-08-07"tests/test.sh: Must write a numeric reward (integer or float, 0 to 1) to /logs/verifier/reward.txt. Harbor mounts /logs/verifier/ at runtime. The script should exit 0 on success and non-zero on failure.
#!/bin/bash
pytest /tests/test_*.py
if [ $? -eq 0 ]; then
echo 1 > /logs/verifier/reward.txt
else
echo 0 > /logs/verifier/reward.txt
fiinstruction.md: Write agent-actionable instructions, not raw benchmark descriptions. The agent reads this file to understand what to do. Include the goal, constraints, expected output location, and any files the agent should modify. Do not include test answers or oracle solutions.
environment/Dockerfile: Set up the container the agent will work in. Install system and Python dependencies, copy any benchmark-specific data files, and set the working directory. The agent and verifier both run inside this container. No terminal-bench or harbor canary string should occur here.
FROM python:3.13-slim
WORKDIR /workspace
RUN apt-get update && apt-get install -y \
git \
&& rm -rf /var/lib/apt/lists/*
# Install benchmark-specific dependencies
# RUN pip install --no-cache-dir <packages>
# Copy task-specific files
# COPY . /workspace/GPU tasks
If your benchmark includes tasks that require GPU (e.g., CUDA, Triton kernels), add a docker-compose.yaml in the task's environment/ directory with nvidia device reservations for Docker runs. For cloud/Modal runs, also set gpus in task.toml. See the featurebench adapter for a comprehensive example; it handles 44 GPU tasks across multiple repos with separate CPU/GPU/Modal config files.
Writing run_{adapter-name}.yaml
This config file serves as the single entry point for all experiments: oracle verification, parity runs, and general benchmarking. Keep the oracle agent as the default (uncommented) and include other agents as commented-out alternatives so anyone can quickly switch.
datasets:
- path: datasets/<adapter-name>
# Default: oracle agent for verification
agents:
- name: oracle
# Uncomment to run with other agents:
# agents:
# - name: codex
# model_name: openai/gpt-5-mini
#
# agents:
# - name: claude-code
# model_name: claude-sonnet-4-5-20250929
environment:
type: docker
delete: true
orchestrator:
type: local
n_concurrent_trials: 4You can also create additional config files for different scenarios (e.g., parity subsets, CPU-only vs GPU, Modal). For example, featurebench provides featurebench_docker_cpu.yaml, featurebench_docker_gpu.yaml, featurebench_modal.yaml, and featurebench_parity.yaml.
Usage:
# Oracle verification (default)
harbor run -c adapters/<adapter-name>/run_<adapter-name>.yaml
# Switch agent by uncommenting the desired agent blockRules
- Prompt modifications (e.g., "write files in place without asking") are acceptable if applied to both the original benchmark and Harbor adapter.
- Adapting a subset of tasks is acceptable (e.g., only SWE-Bench-Verified). Document all exclusions in the README.
Step complete when: main.py produces a valid task directory for each task containing task.toml, instruction.md, environment/Dockerfile, solution/solve.sh, and tests/test.sh.
Step 3. Verify Oracle Solutions
Run your adapter with the oracle agent and confirm 100% reward on all tasks. Validating oracle solutions is a straightforward way to check:
- Adaptation correctness: a wrong adaptation usually causes oracle failures.
- Oracle solution bugs: cross-validate by running oracle on the original benchmark side as well to determine whether a failure is due to the solution itself or the Harbor adaptation.
- Environment issues: Docker build failures or broken verification tests make tasks impossible for agents to solve, so catching these early is critical.
Run commands
| Method | Command | When to use |
|---|---|---|
| Single task | harbor trial start -p datasets/<adapter-name>/<task-id> -a <agent> -m <model> | Testing individual tasks |
| Entire dataset | harbor run -p datasets/<adapter-name> -a <agent> -m <model> | Full oracle verification |
| Config file | harbor run -c adapters/<adapter-name>/<config>.yaml -a <agent> -m <model> | Reproducible runs (see example configs) |
| Registry: single task | harbor run -t <organization-name>/<task-id> -a <agent> -m <model> | Post-publish single task |
| Registry: full dataset | harbor run -d <organization-name>/<adapter-name> -a <agent> -m <model> | Post-publish full dataset (after Step 8) |
Write a reference config YAML for your adapter to ensure reproducibility.
README ordering note: In the final adapter README, list the registry method (Option 5) first; it is the primary user-facing run method. Adapter code and local-path methods are for development/reproduction.
After oracle passes
- Create a WIP PR titled
[WIP] Adapter: {adapter_name}. - Include a screenshot of the terminal showing 100% oracle pass results.
Broken oracles in the original benchmark
If a fix is straightforward, propose it on the original fork and upstream repo as a GitHub Issue or PR, then document the fix clearly in the adapter README. This is more robust and transparent than patching on the Harbor side. For tasks that cannot be reliably fixed or verified, document them and exclude them from the dataset.
Benchmarks without oracle solutions
Usually we require adapter contributors to build the oracle solutions themselves with the help of AI. Building oracle solutions is a separate process from running parity experiments (Step 5), so they can progress in parallel. However, before running parity, you need to validate that the tasks are theoretically solvable by agents with no environment or test issues.
A recommended approach is to use a cheap agent and model to do a pass over all the tasks. For building oracle solutions, you can also directly use the successful agent solutions from parity experiments and then complete the remaining ones with more powerful AI plus human supervision. If things become more complicated than this, reach out to the team on Discord and discuss case-by-case.
Step complete when: All oracle solutions pass with 100% reward, and a WIP PR titled [WIP] Adapter: {adapter_name} is created with a screenshot of the passing results.
Step 4. Discuss Parity Plans and Implement Agents
Contact the team (e.g., Lin Shi on Discord) before running parity experiments. They determine agents, models, number of runs, and API key provisioning.
Agent implementation scenarios
Scenario 1: Compatible agents exist. The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, Gemini-CLI). No extra work needed; run parity with identical settings on both sides. Example: ADEBench, where the original benchmark already supports Claude Code.
Scenario 2: LLM-based, no compatible agents. Fork the original benchmark, implement a Harbor-compatible agent there, and document it in the fork's README. Example: EvoEval, which forked the repo to add codex agent support for parity.
Scenario 3: Custom agents. The original benchmark uses custom agents unavailable in Harbor. Implement the custom agent under adapters/{name}/ and run parity experiments with it. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes. There are two sub-cases here:
- Some benchmarks require registering a separate dataset for CLI-agent compatibility. Example: BixBench, FinanceAgent (also demonstrates LLM-as-a-Judge verification).
- Others do not need a separate dataset. Example: MedAgentBench supports a custom HTTPAgent matching the original GET/POST/FINISH interaction semantics.
- For multi-agent workflows where multiple agents coordinate in parallel (e.g., via Redis messaging and Docker sidecars), see CooperBench. Note that multi-agent benchmarks may not be compatible with standard single-agent CLI agents.
Keep links to any forked repositories and document the approach in the README.
Large or expensive benchmarks
If running the full benchmark is too expensive, run parity on a representative subset. Requirements:
- Document in README how the subset was selected and that parity ran on a subset.
- Support
--split parityinmain.pyto generate only the parity subset. - Ask the team to publish the parity subset under the
paritytag so users can run-d <adapter_name>@parity. See Versioning below.
uv run python -m <adapter_name>.main --split parity --output-dir /path/to/output # parity subset
uv run python -m <adapter_name>.main --output-dir /path/to/output # full datasetStep complete when: Parity plan is agreed with the team (agents, models, number of runs), and any required agent implementations are working on both the original benchmark and Harbor sides.
Step 5. Run Parity Experiments
Definition
Parity is the property that the Harbor adaptation and the original benchmark measure the same quantity. Under identical conditions — same agent, model, prompts, and configuration — scores on both sides should be statistically indistinguishable. An adapter without demonstrated parity cannot be accepted, because its results would not be comparable to results reported against the upstream benchmark.
Matching criterion
The two sides match if and only if their run-score ranges overlap:
max(side_A_runs) >= min(side_B_runs) AND max(side_B_runs) >= min(side_A_runs)Equivalently, neither side's lowest run may exceed the other side's highest run. Close means with non-overlapping ranges do not constitute a match. Non-overlap should be treated as a systematic adaptation error until evidence indicates otherwise.
Reporting format: mean ± sample SEM
All scalar-uncertainty fields in parity_experiment.json and the README parity table (original, harbor, and any Score ± … column) MUST be reported as mean ± sample standard error of the mean (sample SEM) — not sample standard deviation. Parity is a claim about how well each side estimates the true benchmark score; SEM is the uncertainty of that estimate and shrinks as runs are added, while sample std describes per-run spread and does not. Reporting sample std inflates apparent uncertainty and can mask a genuinely diverging adapter.
For n ≥ 2 runs with per-run scores x₁, x₂, …, xₙ and mean x̄:
sample SEM = sqrt( Σ (xᵢ - x̄)² / ( n (n - 1) ) )Notes: SEM is undefined for n < 2 (require ≥ 2 runs per side; 3+ preferred). original_runs / harbor_runs are the source of truth — reviewers recompute from them to verify the reported string. Keep units consistent with the raw runs (don't mix 45.2 and 45.2%).
Checklist BEFORE any parity run
Required execution order
Full runs must be completed symmetrically. The required order is: sanity check on 5-10 tasks (both sides) → one full run (both sides) → scale to three runs (both sides). Completing three runs on one side before the other is not permitted, because an adaptation bug discovered later forces the work to be re-run and incurs avoidable cost.
Before starting a parity run, confirm each item below on both the original-benchmark side and the Harbor side, and record the values in the notes field of parity_experiment.json (or the parity notes section of the README) for reproducibility.
| # | Check | Requirement | Verification |
|---|---|---|---|
| 1 | Harbor is on the latest upstream | main is not behind origin/main | Run git fetch origin && git status; pull if behind. Stale checkouts miss recent adapter and agent updates and are a common source of spurious parity failures. |
| 2 | Agent version | Pinned and identical on both sides | For example, codex@0.1.0. Avoid latest and unpinned installs. |
| 3 | Model ID | Exact dated identifier, identical on both sides | For example, claude-sonnet-4-5-20250929. Avoid floating aliases. |
| 4 | Installation script | Identical on both sides | Diff the install commands (package-manager or curl lines). |
| 5 | Execution / entry command | Identical flags, environment variables, and working directory | Diff the launch command used on each side. |
| 6 | Tools, arguments, and parameters | Identical on both sides | Includes temperature, max tokens, tool list, system prompt, turn limit, and timeouts. |
| 7 | Prompt modifications | Applied symmetrically | Any prompt change on one side must be applied on the other. |
| 8 | Ramp-up order | sanity (5-10 tasks, both sides) → 1 full run (both sides) → 3 runs (both sides) | See the execution-order callout above. |
Run command
harbor run -p datasets/<adapter-name> -a <agent> -m <model>Debug playbook (when parity does not match)
If scores do not overlap per the matching criterion, work through the steps below in order. Each step narrows the hypothesis space; re-running without diagnosis is not an acceptable substitute.
| # | Action | When to run | Signal to look for | Decision |
|---|---|---|---|---|
| 1 | Resolve errors | Always first | Crashes, timeouts, Docker build failures, or verifier errors in the logs | Fix infrastructure and re-run before interpreting scores. A failed run should not be treated as a low score. |
| 2 | Inspect agent trajectories | After the logs are clean | On successes, whether the success is legitimate. On failures, whether the failure reflects a genuine capability gap or a harness issue (e.g., incorrect path, misread test output). | If a harness issue is found, fix the adapter. Otherwise proceed. |
| 3 | Overlap analysis | After trajectories look reasonable | Per-task resolution on each side; list tasks resolved on A-only and on B-only. | A roughly symmetric split suggests sampling noise (proceed to step 4). A lopsided split suggests systematic adaptation error (proceed to step 7 or return to the adapter). |
| 4 | Distinguish randomness from systematic error | Disagreements persist across runs | Tasks that fail on Harbor in every run (or succeed in every run) while flipping on the original side indicate a systematic issue. Disagreements that shuffle across runs indicate noise. | Systematic issues typically originate in the environment, instruction, test, or solution. Otherwise proceed to step 5. |
| 5 | Characterize high-variance tasks individually | A small number of tasks show large variance but appear correctly adapted | Run those tasks 5-10 times in isolation to estimate their variance | Use the resulting distribution to decide whether variance alone explains the gap. Avoid scaling the full dataset to investigate a few noisy tasks. |
| 6 | Lock configuration for all repeats | Any time you re-run | Same Harbor commit, agent version, Docker base image, and environment variables | Look up and record these values before running. Mixing configurations across repeats invalidates the comparison. |
| 7 | Diff the agent implementation | Tasks appear correctly adapted but scores still diverge | Differences in wrappers, tool definitions, prompt construction, retry logic, turn limits, or tool-output handling | Any such difference is a candidate root cause and should be resolved before attributing the gap to the adapter. |
| 8 | Scale up only after fixes | After any corrective change | 5-10 task sanity check → 1 full run (both sides) → 3 runs (both sides) | Do not complete three runs on one side before the other. See the execution-order callout above. |
Common sources of adaptation error
Any asymmetry between the two sides that is not attributable to sampling noise constitutes adaptation error. Check the following in order:
- Environment: Dockerfile, installed packages, working directory, file permissions, mounted paths.
- Instruction: wording changes, missing context, leaked solutions, additional constraints.
- Tests: differing assertions, tolerances, reward mappings, or timeouts.
- Solution: oracle passes on one side but not the other.
- Agent: installation, entry command, tools, arguments, system prompt, retry and turn logic.
- Harbor version: stale checkout missing a recent adapter or feature fix (revisit checklist item 1).
Step complete when: multiple runs on both sides satisfy the overlap criterion, and any remaining gaps are documented with evidence that they arise from sampling noise rather than adaptation error.
Step 6. Record Parity Results
Create parity_experiment.json in your adapter directory. The file is a JSON array; each entry is one agent+model parity experiment.
parity_experiment.json field reference
| Field | Type | Required | Description |
|---|---|---|---|
adapter_name | string | Yes | Adapter name (e.g., "swe-bench") |
agent | string | Yes | Agent with version (e.g., "codex@1.0") |
model | string | Yes | Full model identifier (e.g., "gpt-5-2025-06-01") |
date | string | Yes | Experiment date (e.g., "2025-06-15") |
adapted_benchmark_size | integer | Yes | Total tasks converted by adapter (full set) |
parity_benchmark_size | integer | Yes | Tasks used for parity. Equals adapted_benchmark_size if full set |
number_of_runs | integer | Yes | Runs per side. Should be identical for original and Harbor |
notes | string | No | Additional explanations |
original_parity_repo | string | Yes | Fork URL for reproducing parity on original benchmark |
adapter_pr | string[] | Yes | All adapter PR links in harbor repo |
dataset_pr | string[] | Yes | All PR links in harbor-datasets repo |
parity_pr | string[] | Yes | All PR links to HuggingFace parity dataset |
metrics | object[] | Yes | Metric comparison objects (see below) |
metrics entry fields
| Field | Type | Required | Description |
|---|---|---|---|
benchmark_name | string | Yes | Original benchmark name |
metric | string | Yes | Metric name (e.g., "pass@1", "resolve_rate") |
original | string | Yes | mean ± sample SEM on original (e.g., "45.2 ± 1.3"). See Reporting format. |
harbor | string | Yes | mean ± sample SEM on Harbor (e.g., "44.8 ± 1.1"). See Reporting format. |
original_runs | number[] | Yes | Individual scores per run on original |
harbor_runs | number[] | Yes | Individual scores per run on Harbor |
Example
[
{
"adapter_name": "my-benchmark",
"agent": "codex@1.0",
"model": "gpt-5-2025-06-01",
"date": "2025-06-15",
"adapted_benchmark_size": 500,
"parity_benchmark_size": 500,
"number_of_runs": 3,
"notes": "None",
"original_parity_repo": "https://github.com/user/my-benchmark-fork",
"adapter_pr": ["https://github.com/harbor-framework/harbor/pull/123"],
"dataset_pr": ["https://github.com/laude-institute/harbor-datasets/pull/45"],
"parity_pr": ["https://huggingface.co/datasets/harborframework/parity-experiments/discussions/12"],
"metrics": [
{
"benchmark_name": "my-benchmark",
"metric": "pass@1",
"original": "45.2 ± 1.3",
"harbor": "44.8 ± 1.1",
"original_runs": [44.0, 45.5, 46.1],
"harbor_runs": [43.8, 45.0, 45.6]
}
]
}
]README parity table
Include this table in the adapter README. Scores are mean ± sample SEM as defined in Reporting format:
| Agent | Model | Metric | Runs | Dataset Size | Original (mean ± SEM) | Harbor (mean ± SEM) |
|-------|-------|--------|------|--------------|-----------------------|---------------------|
| codex@1.0 | gpt-5 | pass@1 | 5 | 2000 (100%) | 45.2 ± 1.3 | 44.8 ± 1.1 |Also include links to: original benchmark repo, forked repo (if applicable), dataset PR, HuggingFace parity PR, adapter PR.
Step complete when: parity_experiment.json is valid JSON, all required fields are populated, original and harbor are reported as mean ± sample SEM and are consistent with the raw original_runs / harbor_runs arrays, the run-score ranges overlap per the matching criterion, and the README includes the parity summary table with links. If scores diverge significantly, investigate before proceeding.
Step 7. Upload Parity Results
Upload parity and oracle results to harborframework/parity-experiments on HuggingFace.
Recommended: Use the parity upload skill to automate this. It handles sparse checkouts, LFS tracking, and HF-specific PR refs.
Required directory structure
adapters/
└── {adapter_name}/
├── README.md
├── config.yaml
├── original_parity/
├── harbor_parity/
├── oracle/
└── results_collection/
├── result_{original/harbor}_trial1.json
├── result_{original/harbor}_trial2.json
└── result_{original/harbor}_trial{N}.jsonStep complete when: PR to the HuggingFace parity-experiments dataset is submitted with all result files in the expected directory structure.
Step 8. Register the Dataset
A dataset is a collection of tasks with a many-to-many relationship: the same task can appear in multiple datasets, and a dataset can aggregate tasks from multiple adapters. Both datasets and tasks are namespaced as {organization}/{name}, for example {organization}/{dataset} for a dataset and {organization}/{task-id} for a task.
Step 1. Generate the dataset directory with your adapter. Store it in the harbor-datasets GitHub repo, or the HuggingFace mirror if it's too large for GitHub.
git clone https://github.com/{your-github-username}/harbor-datasets.git
cd harbor/adapters/<adapter-name>
uv run python -m <adapter_name>.main --output-dir /path/to/harbor-datasets/datasets/<adapter-name>Step 2. Create dataset.toml at the root of the dataset directory (e.g., harbor-datasets/datasets/<adapter-name>/dataset.toml).
cd /path/to/harbor-datasets/datasets/<adapter-name>
harbor init # select "dataset"; enter the dataset name as <org>/<dataset>Step 3. Edit dataset.toml to fill in the description: parity results summary, adapter author credits (first authors only if the list is long), and acknowledgments.
Step 4. Verify the dataset runs locally before submitting, using the -p (path) parameter:
harbor run -p /path/to/your/datasetNote: Registry testing (-d) is only available after publishing. Use -p for all pre-publish testing.
Step 5. Open a PR to harbor-datasets with the tasks directory and dataset.toml. Request @Slimshilin for review. The Harbor team publishes the dataset to the registry after approval.
Step 6. After publishing, verify the dataset loads and runs from the registry:
harbor run -d <organization-name>/<adapter-name>Naming rules
| Rule | Requirement |
|---|---|
| Dataset ID | <org>/<dataset_name>, e.g., openai/mmmlu. Entered interactively during harbor init. |
| Task ID | <org>/<task_name>. Every generated task.toml must contain a name field. Tasks without a name cannot be added to a dataset. |
Choosing <org> | Prefer the benchmark's owning organization (e.g., openai/mmmlu). If there is no clear single owner or there are multiple, use the benchmark name itself as the organization (e.g., terminal-bench/terminal-bench). |
| Name stability | Task names must be unique within the dataset and stable across adapter runs. Unstable names churn registry digests on republish. |
| Fallback scheme | If the upstream benchmark lacks stable task identifiers, mint a deterministic scheme in adapter code (e.g., {dataset}-1, {dataset}-2, ...) derived from a reproducible sort of upstream tasks. |
| Sanitization | Sanitize upstream identifiers before using them as names: lowercase, replace spaces/slashes/special characters with hyphens, avoid leading/trailing separators. |
Agent instruction: before writing dataset.toml, verify every generated task.toml contains a name field. If any are missing, fix main.py and regenerate. Do not hand-edit generated task directories. Treat main.py as the source of truth for task names.
Versioning
Dataset versions are publish-time tags, not a field in dataset.toml. The Harbor team applies tags when publishing to the registry. Users resolve a specific version with -d <org>/<adapter_name>@<tag>. Every publish also receives the latest tag automatically, so -d <org>/<adapter_name> (no @) always points at the newest release.
| Tag | When to use |
|---|---|
v1.0 | Default for the first release |
v1.1, v2.0, ... | Subsequent releases; previous tags stay pinned to their snapshots |
verified, lite, ... | Mirror upstream naming when the original benchmark has named splits |
parity | Parity subset (generated via --split parity) |
To request a version, state the desired tag(s) in your adapter PR description. To cut a new version later (e.g., a bug fix), open a follow-up PR and request the new tag.
Agent instruction: do not add a version key to dataset.toml to control the published version; that does nothing. Do not change version = "1.0" in task.toml; that's the task-config schema version and must stay "1.0". The only way to select a version is to request a tag in the PR description.
Step complete when: Dataset is published to the registry, harbor run -d <organization-name>/<adapter-name> passes oracle tests, and the PR to harbor-datasets is merged.
Step 9. Document and Submit
README requirements
The README is parsed by automation
Downstream tooling parses this README to generate parity summaries and registry metadata, so the template produced by harbor adapter init must be followed as written.
Required:
- Fill in every section the template defines.
- Place any additional context — caveats, deviations, or commentary — in the Notes section, or in the
notesfield ofparity_experiment.jsonoradapter_metadata.json.
Not permitted:
- Adding new top-level sections.
- Renaming, reordering, or removing template sections.
Deviations from the template prevent automated parsing; when unsure where to place content, use the Notes section.
The required content must appear in the following template-defined locations:
| Content | Location |
|---|---|
| Benchmark bugs discovered and how they were handled | Notes |
| Special treatments (prompt modifications, environment adjustments) | Notes |
| Deviations from the original benchmark and the rationale | Notes |
| Agent implementation details (if custom agents were added) | Agents section |
| Known limitations | Notes |
| Reproduction scripts for parity experiments (both sides) | Parity result section |
Update the original benchmark fork's README
If you forked the original benchmark repository for parity (Scenario 2 or 3), update the fork's README to include reproduction scripts for running Harbor parity experiments. This makes it easy for others to reproduce results on the original benchmark side. Also include the reproduction steps in your adapter's README for completeness.
adapter_metadata.json schema
Create harbor/adapters/{adapter_name}/adapter_metadata.json.
Top-level fields:
| Field | Type | Required | Description |
|---|---|---|---|
adapter_name | string | Yes | Adapter name |
adapter_builders | string[] | Yes | Builder names with email, e.g., ["Jane Doe (jane@example.com)"] |
original_benchmark | object[] | Yes | Original benchmark split descriptors |
harbor_adapter | object[] | Yes | Harbor adapter split descriptors |
original_benchmark entry fields:
| Field | Type | Required | Description |
|---|---|---|---|
split | string | Yes | Split name (use "full" if none) |
size | integer | Yes | Number of tasks in Harbor context |
harness | string | Yes | "agent", "llm", or "None" |
supported_agents | string[] | Yes | Use agent@version format. ["None"] if none |
adaptable | boolean | Yes | Whether this split can be converted |
notes | string | No | Additional clarification. "None" if N/A |
harbor_adapter entry fields:
| Field | Type | Required | Description |
|---|---|---|---|
split | string | Yes | Corresponding split. "full" if collective |
adapted_benchmark_size | integer | Yes | Tasks convertible by adapter |
parity_benchmark_size | integer | Yes | Tasks used for parity |
parity_sampling_rate | number | Yes | parity_benchmark_size / adapted_benchmark_size |
registry_benchmark_size | integer | Yes | Exact task count in registry |
added_agents | string[] | Yes | Custom agents added. ["None"] if none |
parity_matching_agents | string[] | Yes | Agents with comparable scores (agent@version+model) |
parity_unmatching_agents | string[] | Yes | Agents without comparable scores. ["None"] if all matched |
parity_costs | string | Yes | Total USD (e.g., "$150") |
notes | string | No | "None" if N/A |
If parity ran across three systems (Harbor ↔ Terminal-Bench ↔ Original), include a "tb_adapter" key with the same structure.
Example
[
{
"adapter_name": "my-benchmark",
"adapter_builders": ["Jane Doe (jane@example.com)"],
"original_benchmark": [
{
"split": "full",
"size": 500,
"harness": "agent",
"supported_agents": ["codex@0.1.0"],
"adaptable": true,
"notes": "None"
}
],
"harbor_adapter": [
{
"split": "full",
"adapted_benchmark_size": 500,
"parity_benchmark_size": 100,
"parity_sampling_rate": 0.2,
"registry_benchmark_size": 500,
"added_agents": ["None"],
"parity_matching_agents": ["codex@0.1.0+gpt-5-2025-06-01"],
"parity_unmatching_agents": ["None"],
"parity_costs": "$150",
"notes": "Parity on 100-task subset. Averaged over 3 trials."
}
]
}
]Submit
- Change PR title from
[WIP] Adapter: {adapter_name}to[Ready for Review] Adapter: {adapter_name}. - Request review from
@Slimshilin.
Step complete when: PR title is [Ready for Review] Adapter: {adapter_name}, README covers all required sections, adapter_metadata.json passes schema validation, and review is requested from @Slimshilin.
Reference: Terminal-Bench Migration
Important: The Harbor adapter must be isolated from the Terminal-Bench repo. Do not write a mechanical translation script. Write fresh adapter code following the Harbor process.
| Aspect | Terminal-Bench | Harbor |
|---|---|---|
| Config | task.yaml | task.toml |
| Instruction | In task.yaml | Separate instruction.md |
| Dockerfile | Root level | environment/Dockerfile |
| Solution | solution.sh | solution/solve.sh |
| Tests | run-tests.sh + tests/test_outputs.py | tests/test.sh |
| Docker Compose | docker-compose.yaml in task root | environment/docker-compose.yaml for GPU tasks; not needed otherwise |
| Verification | Exit code (pytest) | Reward file: /logs/verifier/reward.txt |
| Output dir | tasks/ | datasets/ |
| Registry | Dataset-level dataset_path | Task-level via dataset.toml + harbor init |
| CLI | tb run --dataset | harbor run -d / -t / -p |
| Metrics | Binary pass/fail | Float rewards, multiple metrics |
Important: If Terminal-Bench used a tweaked metric, re-implement for the original benchmark metrics.
Migration steps
- Convert
task.yamltotask.toml+instruction.md - Move files:
Dockerfile→environment/,solution.sh→solution/solve.sh,run-tests.sh→tests/test.sh - For GPU tasks, move
docker-compose.yamlintoenvironment/; for non-GPU tasks, remove it - Update test scripts to write rewards to
/logs/verifier/reward.txt(Harbor mounts/logs/verifierat runtime) - Update adapter code: change output dir from
tasks/todatasets/, create subdirectories (environment/,solution/,tests/), split instruction intoinstruction.md, convert YAML generation to TOML - Use
harbor init+dataset.tomlfor registry (replaces the oldregistry.json)
Registry format conversion
Before (Terminal-Bench registry.json):
{
"name": "my-adapter",
"version": "head",
"description": "...",
"github_url": "https://github.com/laude-institute/terminal-bench-datasets.git",
"dataset_path": "datasets/my-adapter",
"task_id_subset": null
}After (Harbor):
harbor init # select "dataset", creates dataset.toml
# Edit dataset.toml with descriptions, authors, credits
# Then submit to Harbor team for publishingSee Step 8 for the full publishing workflow.
task.yaml → task.toml conversion example
Before:
instruction: |
Your task instruction here...
author_email: original-bench-author@email.com
author_name: Original Bench Author Name
difficulty: hard
category: programming
tags:
- debugging
- python
parser_name: swebench
max_agent_timeout_sec: 3000.0
max_test_timeout_sec: 3000.0After (task.toml):
version = "1.0"
[task]
name = "my-benchmark/task-001"
[metadata]
author_email = "original-bench-author@email.com"
author_name = "Original Bench Author Name"
difficulty = "hard"
category = "programming"
tags = ["debugging", "python"]
[agent]
timeout_sec = 3000.0
[verifier]
timeout_sec = 3000.0After (instruction.md):
Your task instruction here...test.sh conversion example
Before (Terminal-Bench):
#!/bin/bash
pytest tests/ > test_results.txt
if [ $? -eq 0 ]; then echo "PASSED" > /tmp/test_marker.txt; else echo "FAILED" > /tmp/test_marker.txt; fiAfter (Harbor):
#!/bin/bash
pytest /tests/test_*.py
if [ $? -eq 0 ]; then echo 1 > /logs/verifier/reward.txt; else echo 0 > /logs/verifier/reward.txt; fiKey differences:
- Harbor mounts
/logs/verifierfor test outputs at runtime. - Write numeric reward (can be float) to
/logs/verifier/reward.txt. - Can still use pytest, but final output must be the reward file.
Resources
- Harbor docs: running tasks and jobs
- Harbor repo: examples and configs
- Agent tutorial: creating custom agents
- Discord:
#adapters-spamfor questions