Registry
Browse the datasets available in the Harbor registry.
uvx harbor datasets listterminal-bench
Version 2.0 of Terminal-Bench, a benchmark for testing agents in terminal environments. More tasks, harder, and higher quality than 1.0.
uvx harbor run -d terminal-bench@2.089 tasks
swebench-verified
A human-validated subset of 500 SWE-bench tasks
uvx harbor run -d swebench-verified@1.0500 tasks
ade-bench
Analytics Data Engineer Bench: tasks evaluating AI agents on dbt/SQL data analytics engineering bugs. Original benchmark: https://github.com/dbt-labs/ade-bench.
uvx harbor run -d ade-bench@1.048 tasks
medagentbench
MedAgentBench: 300 patient-specific clinically-derived tasks across 10 categories in a FHIR-compliant interactive healthcare environment.
uvx harbor run -d medagentbench@1.0300 tasks
quixbugs
QuixBugs is a multi-lingual program repair benchmark with 40 Python and 40 Java programs, each containing a single-line defect. Tasks cover algorithms and data structures including sorting, graph, dynamic programming, math, and string/array operations.
uvx harbor run -d quixbugs@1.080 tasks
qcircuitbench
QCircuitBench evaluates agents on quantum algorithm design using quantum programming languages. Original benchmark: https://github.com/EstelYang/QCircuitBench.
uvx harbor run -d qcircuitbench@1.028 tasks
bfcl_parity
BFCL parity subset: 123 stratified sampled tasks for validating Harbor adapter equivalence with original BFCL benchmark.
uvx harbor run -d bfcl_parity@1.0123 tasks
bfcl
Berkeley Function-Calling Leaderboard: 3,641 function calling tasks for evaluating LLM tool use capabilities across simple, multiple, parallel, and irrelevance categories.
uvx harbor run -d bfcl@1.03641 tasks
labbench
LAB-Bench FigQA: 181 scientific figure reasoning tasks in biology from Future-House LAB-Bench.
uvx harbor run -d labbench@1.0181 tasks
satbench
SATBench is a benchmark for evaluating the logical reasoning capabilities of LLMs through logical puzzles derived from Boolean satisfiability (SAT) problems.
uvx harbor run -d satbench@1.02100 tasks
financeagent
Finance Agent is a tool for financial research and analysis that leverages large language models and specialized financial tools to answer complex queries about companies, financial statements, and SEC filings.Original benchmark: https://github.com/vals-ai/finance-agent
uvx harbor run -d financeagent@public50 tasks
bigcodebench-hard-complete
BigCodeBench-Hard complete benchmark adapter for Harbor - challenging Python programming tasks with reward-based verification
uvx harbor run -d bigcodebench-hard-complete@1.0.0145 tasks
deveval
DevEval benchmark: comprehensive evaluation of LLMs across software development lifecycle (implementation, unit testing, acceptance testing) for 21 real-world repositories across Python, C++, Java, and JavaScript
uvx harbor run -d deveval@1.063 tasks
bird-bench
BIRD SQL parity subset (150 tasks, seed 42). Original benchmark: https://huggingface.co/datasets/birdsql/bird_sql_dev_20251106. Adapter: https://github.com/laude-institute/harbor/tree/main/adapters/bird-bench.
uvx harbor run -d bird-bench@parity150 tasks
kumo
KUMO full dataset (5300 tasks; 50 instances per scenario).
uvx harbor run -d kumo@1.05300 tasks
kumo
KUMO(hard) split (250 tasks; 50 instances per scenario).
uvx harbor run -d kumo@hard250 tasks
kumo
KUMO(easy) split (5050 tasks; 50 instances per scenario).
uvx harbor run -d kumo@easy5050 tasks
kumo
KUMO parity subset (seeds 0/1; 212 tasks).
uvx harbor run -d kumo@parity212 tasks
gaia
GAIA (General AI Assistants): 165 validation tasks for multi-step reasoning, tool use, and multimodal question answering.
uvx harbor run -d gaia@1.0165 tasks
simpleqa
SimpleQA: 4,326 short, fact-seeking questions from OpenAI for evaluating language model factuality. Uses LLM-as-a-judge grading. Source: https://openai.com/index/introducing-simpleqa/
uvx harbor run -d simpleqa@1.04326 tasks
termigen-environments
3,500+ verified Docker environments for training and evaluating terminal agents, spanning 11 task categories across infrastructure, data/algorithm applications, and specialized domains including software build, system administration, security, data processing, ML/MLOps, algorithms, scientific computing, and more.
uvx harbor run -d termigen-environments@1.03566 tasks
openthoughts-tblite
OpenThoughts-TBLite: A difficulty-calibrated benchmark of 100 tasks for building terminal agents. By OpenThoughts Agent team, Snorkel AI, Bespoke Labs.
uvx harbor run -d openthoughts-tblite@2.0100 tasks
dabstep
DABstep: Data Agent Benchmark for Multi-step Reasoning. 450 tasks where agents analyze payment transaction data with Python/pandas to answer business questions.
uvx harbor run -d dabstep@1.0450 tasks
code-contests
A competitive programming benchmark from DeepMind that evaluates AI agents' ability to solve algorithmic problems, covering algorithms, data structures, and competitive programming challenges.
uvx harbor run -d code-contests@1.09644 tasks
binary-audit
An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.
uvx harbor run -d binary-audit@1.046 tasks
otel-bench
OpenTelemetry Benchmark - evaluates AI agents' ability to instrument applications with OpenTelemetry tracing across multiple languages.
uvx harbor run -d otel-bench@1.026 tasks
seta-env
CAMEL SETA Environment for RL training
uvx harbor run -d seta-env@1.01376 tasks
vmax-tasks
A collection of 1,043 validated real-world bug-fixing tasks from popular open-source JavaScript projects including Vue.js, Docusaurus, Redux, and Chalk. Each task presents an authentic bug report with reproduction steps and expected behavior.
uvx harbor run -d vmax-tasks@1.01043 tasks
mmmlu
MMMLU (Multilingual MMLU) parity validation subset with 10 tasks per language across 15 languages (150 tasks total). Evaluates language models' subject knowledge and reasoning across multiple languages using multiple-choice questions covering 57 academic subjects.
uvx harbor run -d mmmlu@parity150 tasks
swe-gen-js
SWE-gen-JS: 1000 JavaScript/TypeScript bug fix tasks from 30 open-source GitHub repos, generated using SWE-gen.
uvx harbor run -d swe-gen-js@1.01000 tasks
reasoning-gym-hard
Reasoning Gym benchmark (hard difficulty). Original benchmark: https://github.com/open-thought/reasoning-gym
uvx harbor run -d reasoning-gym-hard@parity288 tasks
reasoning-gym-easy
Reasoning Gym benchmark (easy difficulty). Original benchmark: https://github.com/open-thought/reasoning-gym
uvx harbor run -d reasoning-gym-easy@parity288 tasks
swe-lancer-diamond
Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Both manager and individual contributor tasks.
uvx harbor run -d swe-lancer-diamond@all463 tasks
terminal-bench-sample
A sample of tasks from Terminal-Bench 2.0.
uvx harbor run -d terminal-bench-sample@2.010 tasks
swe-lancer-diamond
Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Only the individual contributor SWE tasks.
uvx harbor run -d swe-lancer-diamond@ic198 tasks
swe-lancer-diamond
Adapter for SWE-Lancer (https://github.com/openai/preparedness/blob/main/project/swelancer/README.md). Only the manager tasks.
uvx harbor run -d swe-lancer-diamond@manager265 tasks
lawbench
LawBench: Benchmarking Legal Knowledge of Large Language Models
uvx harbor run -d lawbench@1.01000 tasks
crustbench
CRUST-bench: 100 C-to-safe-Rust transpilation tasks from real-world C repositories.
uvx harbor run -d crustbench@1.0100 tasks
bixbench-cli
bixbench-cli - A benchmark for evaluating AI agents on bioinformatics and computational biology tasks. (Adapted for CLI execution)
uvx harbor run -d bixbench-cli@1.5205 tasks
spider2-dbt
Spider 2.0-DBT is a comprehensive code generation agent task that includes 68 examples. Solving these tasks requires models to understand project code, navigating complex SQL environments and handling long contexts, surpassing traditional text-to-SQL challenges.
uvx harbor run -d spider2-dbt@1.064 tasks
algotune
AlgoTune: 154 algorithm optimization tasks focusing on speedup-based scoring from the AlgoTune benchmark.
uvx harbor run -d algotune@1.0154 tasks
ineqmath
This adapter brings IneqMath, the dev set of the first inequality-proof Q\&A benchmark for LLMs, into Harbor, enabling standardized evaluation of models on mathematical reasoning and proof construction.
uvx harbor run -d ineqmath@1.0100 tasks
ds-1000
DS-1000 is a code generation benchmark with 1000 realistic data science problems across seven popular Python libraries.
uvx harbor run -d ds-1000@head1000 tasks
strongreject
StrongReject benchmark for evaluating LLM safety and jailbreak resistance. Parity subset with 150 tasks (50 prompts * 3 jailbreaks).
uvx harbor run -d strongreject@parity150 tasks
bixbench
BixBench - A benchmark for evaluating AI agents on bioinformatics and computational biology tasks.
uvx harbor run -d bixbench@1.5205 tasks
hello-world
A simple example task to create a hello.txt file with 'Hello, world!' as content.
uvx harbor run -d hello-world@1.01 tasks
arc_agi_2
ARC-AGI-2: A benchmark measuring abstract reasoning through visual grid puzzles requiring rule inference and generalization.
uvx harbor run -d arc_agi_2@1.0167 tasks
humanevalfix
HumanEvalFix: 164 Python code repair tasks from HumanEvalPack.
uvx harbor run -d humanevalfix@1.0164 tasks
mmau
MMAU: 1000 carefully curated audio clips paired with human-annotated natural language questions and answers spanning speech, environmental sounds, and music.
uvx harbor run -d mmau@1.01000 tasks
swtbench-verified
SWTBench Verified - Software Testing Benchmark for code generation
uvx harbor run -d swtbench-verified@1.0433 tasks
mlgym-bench
Evaluates agents on ML tasks across computer vision, RL, tabular ML, and game theory.
uvx harbor run -d mlgym-bench@1.012 tasks
gpqa-diamond
GPQA Diamond subset: 198 graduate-level multiple-choice questions in biology, physics, and chemistry for evaluating scientific reasoning.
uvx harbor run -d gpqa-diamond@1.0198 tasks
replicationbench
ReplicationBench - A benchmark for evaluating AI agents on reproducing computational results from astrophysics research papers. Adapted from Christine8888/replicationbench-release.
uvx harbor run -d replicationbench@1.090 tasks
terminal-bench-pro
Terminal-Bench Pro (Public Set) is an extended benchmark dataset for testing AI agents in real terminal environments. From compiling code to training models and setting up servers, Terminal-Bench Pro evaluates how well agents can handle real-world, end-to-end tasks autonomously.
uvx harbor run -d terminal-bench-pro@1.0200 tasks
swesmith
SWE-smith is a synthetically generated dataset of software engineering tasks derived from GitHub issues for training and evaluating code generation models.
uvx harbor run -d swesmith@1.0100 tasks
swebenchpro
SWE-bench Pro: A multi-language software engineering benchmark with 731 instances covering Python, JavaScript/TypeScript, and Go. Evaluates AI systems' ability to resolve real-world bugs and implement features across diverse production codebases. Original benchmark: https://github.com/scaleapi/SWE-bench_Pro-os. Adapter details: https://github.com/laude-institute/harbor/tree/main/adapters/swebenchpro
uvx harbor run -d swebenchpro@1.0731 tasks
sldbench
SLDBench: A benchmark for scaling law discovery with symbolic regression tasks.
uvx harbor run -d sldbench@1.08 tasks
compilebench
Version 1.0 of CompileBench, a benchmark on real open-source projects against dependency hell, legacy toolchains, and complex build systems.
uvx harbor run -d compilebench@1.015 tasks
autocodebench
Adapter for AutoCodeBench (https://github.com/Tencent-Hunyuan/AutoCodeBenchmark).
uvx harbor run -d autocodebench@lite200200 tasks
usaco
USACO: 304 Python programming problems from USACO competition.
uvx harbor run -d usaco@2.0304 tasks
aime
American Invitational Mathematics Examination (AIME) benchmark for evaluating mathematical reasoning and problem-solving capabilities. Contains 60 competition-level mathematics problems from AIME 2024, 2025-I, and 2025-II competitions.
uvx harbor run -d aime@1.060 tasks
codepde
CodePDE evaluates code generation capabilities on scientific computing tasks, specifically focusing on Partial Differential Equation (PDE) solving.
uvx harbor run -d codepde@1.05 tasks
evoeval
EvoEval_difficult: 100 challenging Python programming tasks evolved from HumanEval.
uvx harbor run -d evoeval@1.0100 tasks
livecodebench
A subset of 100 sampled tasks from the release_v6 version of LiveCodeBench tasks.
uvx harbor run -d livecodebench@6.0100 tasks
aider-polyglot
A polyglot coding benchmark that evaluates AI agents' ability to perform code editing and generation tasks across multiple programming languages.
uvx harbor run -d aider-polyglot@1.0225 tasks