SFT
Generating SFT datasets from Harbor trials
Harbor includes utilities for turning trials (agent task completion attempts) into conversational traces that can be fed into supervised fine-tuning pipelines for agentic LLMs. Export helpers live under harbor.utils.traces_utils and power several CLI entry points.
CLI flags only work for Terminus 2
Today the exporter only understands the terminus-2 agent output structure. Runs produced by other agents will raise a NotImplementedError. We welcome PRs to add trace generation support for other agents.
- Each exported row represents one
agent/episode-*directory and captures the inputdebug.jsonmessages plus the final agent reply fromresponse.jsonorresponse.txt. - Rows include metadata such as
agent,model,model_provider,task,trial_name,episode, andrun_id, letting you merge runs from multiple jobs. --sharegptadds a ShareGPT-style column to support instruction-tuning datasets expecting the{"from": "...", "value": "..."}schema.- Success filtering (
--filter success|failure) inspectsresult.jsonand lets you keep only passing or failing attempts for curriculum-style datasets.
Run harbor traces export on a trial directory (or a parent directory) to build a datasets.Dataset. The command prints the number of rows produced and, when --push is set, uploads directly to the Hugging Face Hub.
harbor traces export \
--path trials \
--recursive \
--episodes last \
--filter success \
--sharegpt \
--push \
--repo my-org/harbor-terminus2-sftKey options
Prop
Type
If you want to persist the dataset locally (e.g., to Parquet), call the Python helper directly:
from harbor.utils.traces_utils import export_traces
dataset = export_traces("trials", episodes="last", success_filter="success")
dataset.to_parquet("harbor-terminus2-success.parquet")The datasets library is an optional dependency; install it if you plan to export traces.
harbor run can export traces automatically once a job completes. Pass trace flags alongside your job invocation:
harbor run \
--config examples/configs/job.yaml \
--agent claude-code \
--model anthropic/claude-3-sonnet-20240229 \
--export-traces \
--export-sharegpt \
--export-episodes last \
--export-push \
--export-repo my-org/harbor-job-runWhen --export-traces is set, Harbor exports from the produced job directory using the same machinery as harbor traces export. The --export-* options mirror the standalone CLI flags and default to in-memory exports unless --export-push is provided. Errors during export are surfaced at the end of the job run without interrupting evaluation.
harbor sweeps run can emit split datasets that separate successful and failed trajectories. Supply --push together with one of the repo arguments:
# Push a DatasetDict with "success" and "failure" splits
harbor sweeps run \
--config examples/configs/job.yaml \
--max-sweeps 3 \
--trials-per-task 2 \
--push \
--export-repo my-org/harbor-sweepsYou can also push successes and failures to independent repos by combining --push with --export-separate (alias --no-export-splits) plus --export-repo-success and --export-repo-failure. These exports reuse the same trace discovery logic and default to the last episode from each trial.