Publishing a dataset
Publish datasets on the Harbor registry to share with others
If you are new to Harbor, consider reading about the composition of a Harbor dataset first.
Benchmarks and datasets developed using Harbor can be easily shared with others inside and outside of your organization using the Harbor registry.
You can publish your dataset publicly or privately.
- Public datasets are visible and usable by everyone. This is a great way to increase the adoption of your dataset or benchmark.
- Private datasets are visible only to members of the publishing org.
You can run a published dataset with --dataset or -d:
harbor run -d "<org>/<dataset>" -a "<agent>" -m "<model>"This guide will walk you through the process of publishing a dataset to the Harbor registry.
Prerequisites
Login to the Harbor registry
Before publishing, run:
harbor auth loginThis opens a GitHub sign-in flow and creates your Harbor account. New accounts get an org using your GitHub username by default. You can publish to orgs you are an owner of. If you publish to an org that does not exist, it will be created for you.
You must be signed in to publish.
You can verify you are signed in by running:
harbor auth statusUpdate old tasks
Task identifiers are specified in the [task] section of the task.toml file as <org>/<name>. This enables you to create globally unique identifiers that can be published to the registry.
If your tasks are missing the [task] section (any task created before Mar 30, 2026), you'll need to add it before publishing them or including them in a dataset. You can do this by running:
harbor task update "<path/to/task>" --org "<org>"This will add the [task] section to the task with the name <org>/<folder-name>. For org name, we recommend using your company or benchmark name.
You can update an entire directory of tasks by including the --scan flag:
harbor task update "<path/to/tasks>" --org "<org>" --scan1) Initialize a dataset manifest
A dataset is a collection of versioned tasks defined in a dataset.toml manifest.
Task pointers look like this:
[[tasks]]
name = "<org>/<name>"
digest = "sha256:<hash>"The digest is the SHA256 hash of the task archive.
To initialize the dataset manifest run:
harbor dataset init "<org>/<dataset>" \
--description "A short description" \
--author "Your Name <your@email.com>"Datasets can optionally include a metric script to define how rewards are aggregated across tasks.
If you need a custom metric script include the --with-metric flag:
harbor dataset init "<org>/<dataset>" --with-metricThis creates:
<dataset>/
├── dataset.toml
└── metric.py # only if --with-metricIf you initialize a dataset in a directory with pre-existing tasks, the tasks will be auto-added to the dataset manifest. If you initialize a task in the same directory as dataset.toml, the task will be auto-added to the dataset manifest.
For example:
my-dataset/
├── dataset.toml
├── task-a/
├── task-b/
└── metric.py # only if --with-metricwill create a dataset.toml with the tasks task-a and task-b auto-added.
2) Add tasks and files
Use harbor add to explicitly add local or published tasks.
cd "<path/to/dataset>"Add local tasks
harbor add "<path/to/task-a>" "<path/to/task-b>"You can also add tasks from a local dataset.toml file by running:
harbor add "<path/to/dataset.toml>"Add registered tasks or datasets
harbor add org/name-of-taskBy default, harbor add will add the latest version of a task. You can also add a specific version using "<org>/<name>@<tag>", "<org>/<name>@<revision>", or "<org>/<name>@sha256:<hash>".
You can also add all of the tasks from a published dataset by running:
harbor add org/name-of-datasetAdd all tasks from a local folder
Include the --scan flag to add all tasks from a local folder.
harbor add "<path/to/candidate-tasks>" --scanOptional metric file
If dataset.toml does not include metric.py:
harbor add metric.pymetric.py must be added by filename and must be in the same directory as dataset.toml.
Removing tasks
You can remove tasks from a dataset by running:
harbor remove "<org>/<name-of-task>"harbor remove accepts the same arguments as harbor add.
3) Sync task and file digests (when needed)
A dataset.toml references tasks and metrics by their digest. When you are developing a dataset, tasks and metrics are subject to change.
When you publish a dataset, it automatically refreshes the digests of the tasks and metrics in the same directory as dataset.toml.
If you need to refresh before publishing, you can use:
harbor syncTo also upgrade remote tasks to their latest published version, run:
harbor sync --upgradeharbor add updates tasks if they are already present and can be used to refresh the digest for local tasks stored outside of the dataset folder.
4) Publish the dataset
Use harbor publish to publish a dataset:
harbor publish "<path/to/dataset>"By default, harbor publish will also publish any tasks in the dataset directory.
Publish options
-t / --tag: add one or more tags (repeatable).latestis always included.-c / --concurrency: control upload concurrency.--no-tasks: don't publish the tasks in the dataset directory.--public: make the dataset public. Private is the default.
By default, datasets are published with the latest tag.
harbor publish refreshes the digests of local tasks in the dataset directory during upload. Use harbor sync when you need to refresh remote tasks before publishing. Re-add local tasks outside of the dataset directory to refresh their digests.
Tagging and visibility
harbor publish "<path/to/dataset>" -t v1.0 --publicharbor publish "<path/to/dataset>" --no-tasks5) Run your published dataset
After publishing, evaluate with:
harbor run -d "my-org/my-dataset@v1.0" -a "<agent>" -m "<model>"Sharing
If you published it publicly, anyone can run it. If you published it privately, only members of the publishing org can run it. You can toggle visibility at any time using harbor dataset visibility or through the UI on the registry website.
Publishing visibility and access is documented in Sharing.