Registry

Browse the datasets available in the Harbor registry.

uvx harbor datasets list
terminal-bench
v2.0
Version 2.0 of Terminal-Bench, a benchmark for testing agents in terminal environments. More tasks, harder, and higher quality than 1.0.
uvx harbor run -d terminal-bench@2.0

89 tasks

swebench-verified
v1.0
A human-validated subset of 500 SWE-bench tasks
uvx harbor run -d swebench-verified@1.0

500 tasks

simpleqa
v1.0
SimpleQA: 4,326 short, fact-seeking questions from OpenAI for evaluating language model factuality. Uses LLM-as-a-judge grading. Source: https://openai.com/index/introducing-simpleqa/
uvx harbor run -d simpleqa@1.0

4326 tasks

termigen-environments
v1.0
3,500+ verified Docker environments for training and evaluating terminal agents, spanning 11 task categories across infrastructure, data/algorithm applications, and specialized domains including software build, system administration, security, data processing, ML/MLOps, algorithms, scientific computing, and more.
uvx harbor run -d termigen-environments@1.0

6132 tasks

openthoughts-tblite
v2.0
OpenThoughts-TBLite: A difficulty-calibrated benchmark of 100 tasks for building terminal agents. By OpenThoughts Agent team, Snorkel AI, Bespoke Labs.
uvx harbor run -d openthoughts-tblite@2.0

100 tasks

dabstep
v1.0
DABstep: Data Agent Benchmark for Multi-step Reasoning. 450 tasks where agents analyze payment transaction data with Python/pandas to answer business questions.
uvx harbor run -d dabstep@1.0

450 tasks

code-contests
v1.0
A competitive programming benchmark from DeepMind that evaluates AI agents' ability to solve algorithmic problems, covering algorithms, data structures, and competitive programming challenges.
uvx harbor run -d code-contests@1.0

44220 tasks

binary-audit
v1.0
An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.
uvx harbor run -d binary-audit@1.0

46 tasks

otel-bench
v1.0
OpenTelemetry Benchmark - evaluates AI agents' ability to instrument applications with OpenTelemetry tracing across multiple languages.
uvx harbor run -d otel-bench@1.0

26 tasks

seta-env
v1.0
CAMEL SETA Environment for RL training
uvx harbor run -d seta-env@1.0

4008 tasks

vmax-tasks
v1.0
A collection of 1,043 validated real-world bug-fixing tasks from popular open-source JavaScript projects including Vue.js, Docusaurus, Redux, and Chalk. Each task presents an authentic bug report with reproduction steps and expected behavior.
uvx harbor run -d vmax-tasks@1.0

1387 tasks

mmmlu
vparity
MMMLU (Multilingual MMLU) parity validation subset with 10 tasks per language across 15 languages (150 tasks total). Evaluates language models' subject knowledge and reasoning across multiple languages using multiple-choice questions covering 57 academic subjects.
uvx harbor run -d mmmlu@parity

150 tasks

swe-gen-js
v1.0
SWE-gen-JS: 1000 JavaScript/TypeScript bug fix tasks from 30 open-source GitHub repos, generated using SWE-gen.
uvx harbor run -d swe-gen-js@1.0

1000 tasks

reasoning-gym-hard
vparity
Reasoning Gym benchmark (hard difficulty). Original benchmark: https://github.com/open-thought/reasoning-gym
uvx harbor run -d reasoning-gym-hard@parity

288 tasks

reasoning-gym-easy
vparity
Reasoning Gym benchmark (easy difficulty). Original benchmark: https://github.com/open-thought/reasoning-gym
uvx harbor run -d reasoning-gym-easy@parity

288 tasks

swe-lancer-diamond
vall
Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Both manager and individual contributor tasks.
uvx harbor run -d swe-lancer-diamond@all

463 tasks

swe-lancer-diamond
vmanager
Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Only the manager tasks.
uvx harbor run -d swe-lancer-diamond@manager

265 tasks

swe-lancer-diamond
vic
Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Only the individual contributor SWE tasks.
uvx harbor run -d swe-lancer-diamond@ic

198 tasks

terminal-bench-sample
v2.0
A sample of tasks from Terminal-Bench 2.0.
uvx harbor run -d terminal-bench-sample@2.0

10 tasks

lawbench
v1.0
LawBench: Benchmarking Legal Knowledge of Large Language Models
uvx harbor run -d lawbench@1.0

1000 tasks

crustbench
v1.0
CRUST-bench: 100 C-to-safe-Rust transpilation tasks from real-world C repositories.
uvx harbor run -d crustbench@1.0

100 tasks

bixbench-cli
v1.5
bixbench-cli - A benchmark for evaluating AI agents on bioinformatics and computational biology tasks. (Adapted for CLI execution)
uvx harbor run -d bixbench-cli@1.5

205 tasks

spider2-dbt
v1.0
Spider 2.0-DBT is a comprehensive code generation agent task that includes 68 examples. Solving these tasks requires models to understand project code, navigating complex SQL environments and handling long contexts, surpassing traditional text-to-SQL challenges.
uvx harbor run -d spider2-dbt@1.0

64 tasks

algotune
v1.0
AlgoTune: 154 algorithm optimization tasks focusing on speedup-based scoring from the AlgoTune benchmark.
uvx harbor run -d algotune@1.0

154 tasks

ineqmath
v1.0
This adapter brings IneqMath, the dev set of the first inequality-proof Q\&A benchmark for LLMs, into Harbor, enabling standardized evaluation of models on mathematical reasoning and proof construction.
uvx harbor run -d ineqmath@1.0

100 tasks

ds-1000
vhead
DS-1000 is a code generation benchmark with 1000 realistic data science problems across seven popular Python libraries.
uvx harbor run -d ds-1000@head

1000 tasks

hello-world
v1.0
A simple example task to create a hello.txt file with 'Hello, world!' as content.
uvx harbor run -d hello-world@1.0

1 tasks

bixbench
v1.5
BixBench - A benchmark for evaluating AI agents on bioinformatics and computational biology tasks.
uvx harbor run -d bixbench@1.5

205 tasks

strongreject
vparity
StrongReject benchmark for evaluating LLM safety and jailbreak resistance. Parity subset with 150 tasks (50 prompts * 3 jailbreaks).
uvx harbor run -d strongreject@parity

150 tasks

arc_agi_2
v1.0
ARC-AGI-2: A benchmark measuring abstract reasoning through visual grid puzzles requiring rule inference and generalization.
uvx harbor run -d arc_agi_2@1.0

167 tasks

humanevalfix
v1.0
HumanEvalFix: 164 Python code repair tasks from HumanEvalPack.
uvx harbor run -d humanevalfix@1.0

164 tasks

mmau
v1.0
MMAU: 1000 carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music.
uvx harbor run -d mmau@1.0

1000 tasks

mlgym-bench
v1.0
Evaluates agents on ML tasks across computer vision, RL, tabular ML, and game theory.
uvx harbor run -d mlgym-bench@1.0

12 tasks

replicationbench
v1.0
ReplicationBench - A benchmark for evaluating AI agents on reproducing computational results from astrophysics research papers. Adapted from Christine8888/replicationbench-release.
uvx harbor run -d replicationbench@1.0

90 tasks

gpqa-diamond
v1.0
GPQA Diamond subset: 198 graduate-level multiple-choice questions in biology, physics, and chemistry for evaluating scientific reasoning.
uvx harbor run -d gpqa-diamond@1.0

198 tasks

swtbench-verified
v1.0
SWTBench Verified - Software Testing Benchmark for code generation
uvx harbor run -d swtbench-verified@1.0

433 tasks

evoeval
v1.0
EvoEval_difficult: 100 challenging Python programming tasks evolved from HumanEval.
uvx harbor run -d evoeval@1.0

100 tasks

livecodebench
v6.0
A subset of 100 sampled tasks from the release_v6 version of LiveCodeBench tasks.
uvx harbor run -d livecodebench@6.0

100 tasks

aider-polyglot
v1.0
A polyglot coding benchmark that evaluates AI agents' ability to perform code editing and generation tasks across multiple programming languages.
uvx harbor run -d aider-polyglot@1.0

225 tasks

terminal-bench-pro
v1.0
Terminal-Bench Pro (Public Set) is an extended benchmark dataset for testing AI agents in real terminal environments. From compiling code to training models and setting up servers, Terminal-Bench Pro evaluates how well agents can handle real-world, end-to-end tasks autonomously.
uvx harbor run -d terminal-bench-pro@1.0

200 tasks

swesmith
v1.0
SWE-smith is a synthetically generated dataset of software engineering tasks derived from GitHub issues for training and evaluating code generation models.
uvx harbor run -d swesmith@1.0

100 tasks

swebenchpro
v1.0
SWE-bench Pro: A multi-language software engineering benchmark with 731 instances covering Python, JavaScript/TypeScript, and Go. Evaluates AI systems' ability to resolve real-world bugs and implement features across diverse production codebases. Original benchmark: https://github.com/scaleapi/SWE-bench_Pro-os. Adapter details: https://github.com/laude-institute/harbor/tree/main/adapters/swebenchpro
uvx harbor run -d swebenchpro@1.0

731 tasks

sldbench
v1.0
SLDBench: A benchmark for scaling law discovery with symbolic regression tasks.
uvx harbor run -d sldbench@1.0

8 tasks

compilebench
v1.0
Version 1.0 of CompileBench, a benchmark on real open-source projects against dependency hell, legacy toolchains, and complex build systems.
uvx harbor run -d compilebench@1.0

15 tasks

autocodebench
vlite200
Adapter for AutoCodeBench (https://github.com/Tencent-Hunyuan/AutoCodeBenchmark).
uvx harbor run -d autocodebench@lite200

200 tasks

usaco
v2.0
USACO: 304 Python programming problems from USACO competition.
uvx harbor run -d usaco@2.0

304 tasks

aime
v1.0
American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning and problem-solving capabilities. Contains 60 competition-level mathematics problems from AIME 2024, 2025-I, and 2025-II competitions.
uvx harbor run -d aime@1.0

60 tasks

codepde
v1.0
CodePDE evaluates code generation capabilities on scientific computing tasks, specifically focusing on Partial Differential Equation (PDE) solving.
uvx harbor run -d codepde@1.0

5 tasks