Terminus-2

Overview

Terminus-2 is Harbor's reference agent implementation, designed as a research-preview agent for evaluating language models' capabilities in terminal environments. It operates entirely autonomously within sandboxed environments and serves as a high-performance neutral test bed for understanding language model agent capabilities.

Key Features

Mono-tool Design

Terminus-2 uses a unique single-tool approach - an interactive tmux session - allowing it to:

Send keystrokes and navigate environments flexibly
Scroll through output and use arrow keys to navigate menus
Launch additional shells within the environment
Interact with any terminal-based application naturally

This design philosophy enables the agent to work with virtually any command-line interface without requiring specialized tools for each interaction pattern.

Independent Execution

The agent's logic runs in a separate Python process from the Docker container, enabling:

Remote connection to arbitrary computer environments
Dockerized execution environments for safety and isolation
Flexible deployment across different infrastructure setups
Clean separation between agent logic and task environment

Autonomy-First Approach

Terminus-2 is designed to operate without human intervention:

Will never ask for user input during task execution
Independently attempts to complete tasks end-to-end
Currently recommended only for sandboxed environments due to full autonomy
Makes decisions and recovers from errors without guidance

Using Terminus-2 with Harbor

Basic Usage

Run Terminus-2 on a task using the --agent terminus-2 flag:

harbor run \
  --agent terminus-2 \
  --model openai/gpt-5 \
  --path examples/tasks/ \
  --task-name hello-world

Configuration Options

Terminus-2 supports various configuration options through the agent config:

from harbor.models.trial.config import AgentConfig
from harbor.models.agent_name import AgentName

agent_config = AgentConfig(
    name=AgentName.TERMINUS_2,
    model_name="openai/gpt-5",
    kwargs={
        # Parser configuration
        "parser_name": "json",  # "json" or "xml" (default: "json")

        # API configuration
        "api_base": "https://your-vllm-server.com",  # Custom API endpoint
        "temperature": 0.7,  # Sampling temperature (default: 0.7)

        # Episode/turn limits
        "max_turns": 100,  # Maximum number of episodes (default: 1000000)

        # Summarization configuration
        "enable_summarize": True,  # Enable context summarization (default: True)
        "proactive_summarization_threshold": 8000,  # Free tokens threshold for summarization (default: 8000)

        # RL training configuration (default: False)
        # If enabled, token ids and logprobs are collected in result and persisted in trajectories
        "collect_rollout_details": False,

        # Advanced model configuration
        "reasoning_effort": "medium",  # "none", "minimal", "low", "medium", "high", or "default" (default: None)
        "max_thinking_tokens": 2048,  # For Anthropic extended thinking mode (minimum: 1024, default: None)

        # Optional: Register custom model info with LiteLLM
        # LiteLLM doesn't recognize uncommon models like custom models. For metrics
        # tracking and context summarization to work properly, provide model_info following
        # https://docs.litellm.ai/docs/completion/token_usage#9-register_model
        "model_info": {
            "max_input_tokens": 128000,
            "max_output_tokens": 4096,
            "input_cost_per_token": 0.000003,
            "output_cost_per_token": 0.000015,
        },  

        # Session tracking (included in the LLM request body unless LLM provider doesn't support)
        "session_id": "custom-session-id",  # Custom session ID (default: auto-generated UUID)
    }
)

Conversation History Management

Terminus-2 implements intelligent conversation history management to handle long-running tasks efficiently while staying within context window limits.

Standard Summarization Process

Both proactive and passive summarization use a 3-step subagent process to generate high-quality summaries:

┌─────────────────────────────────────────────────────────────────┐
│                   Standard Summarization Flow                    │
└─────────────────────────────────────────────────────────────────┘

  Previous History
        │
        ▼
  ┌─────────────────────┐
  │ 1. Summary Subagent │
  │   Input: Previous   │
  │   Output: Summary   │
  └─────────────────────┘
        │
        ▼
  ┌─────────────────────┐
  │ 2. Question Subagent│
  │   Input: Summary    │
  │   Output: Questions │
  └─────────────────────┘
        │
        ▼
  ┌─────────────────────┐
  │ 3. Answer Subagent  │
  │   Input: Previous + │
  │   Summary + Qs      │
  │   Output: Answers   │
  └─────────────────────┘
        │
        ▼
  ┌─────────────────────┐
  │   Main Agent        │
  │   Context:          │
  │   • System prompt   │
  │   • Task            │
  │   • Summary         │
  │   • Questions       │
  │   • Answers         │
  └─────────────────────┘

Step 1 - Summary Subagent: Receives the full previous conversation history and generates an initial summary.

Step 2 - Question Subagent: Receives only the summary (not the full history) and generates clarifying questions about any missing critical information.

Step 3 - Answer Subagent: Receives the previous history, summary, and questions, then answers the questions to fill in the gaps.

The main agent then continues with a compressed context containing: system prompt, task description, summary, questions, and answers.

Proactive Summarization

When free tokens (max input tokens - current context length) drop below the proactive_summarization_threshold (default: 8000), Terminus-2:

Pauses execution
Runs the standard 3-step summarization process on the conversation history
Replaces the middle portion of the conversation history with the summary + Q&A
Keeps the system prompt and task description intact
Resumes execution with the compressed history

The threshold can be configured via proactive_summarization_threshold in agent config.

Passive Summarization

When a ContextLengthExceededError occurs, Terminus-2 uses a 3-way fallback strategy to recover and continue execution:

┌─────────────────────────────────────────────────────────────────┐
│              Passive Summarization Fallback Flow                 │
└─────────────────────────────────────────────────────────────────┘

              ContextLengthExceededError
                           │
                           ▼
            ┌──────────────────────────────┐
            │ 1. Unwind to Free Tokens     │
            │    Remove recent messages    │
            │    from end until enough     │
            │    space (keeps first msg)   │
            └──────────────────────────────┘
                           │
                           ▼
            ┌──────────────────────────────┐
            │ 2. Standard Summarization    │
            │    (3-step subagent process) │
            └──────────────────────────────┘
                           │
                  ┌────────┴────────┐
                  │                 │
              Success            Failure
                  │                 │
                  │                 ▼
                  │    ┌──────────────────────────┐
                  │    │ 3. Fallback Summary      │
                  │    │    Only: System prompt + │
                  │    │    Task + Current state  │
                  │    └──────────────────────────┘
                  │                 │
                  │        ┌────────┴────────┐
                  │        │                 │
                  │    Success            Failure
                  │        │                 │
                  │        │                 ▼
                  │        │    ┌──────────────────────┐
                  │        │    │ 4. Ultimate Fallback │
                  │        │    │    System prompt +   │
                  │        │    │    Task + State only │
                  │        │    │    (Continue without │
                  │        │    │     summarization)   │
                  │        │    └──────────────────────┘
                  │        │                 │
                  └────────┴─────────────────┘
                           │
                           ▼
                   Continue execution with
                   compressed/recovered context

Step 1 - Unwind: Remove recent messages from the end of the conversation (in pairs of user + assistant) until there are enough free tokens for summarization. Always keeps at least the first message.

Step 2 - Standard Summarization: Run the 3-step subagent process. If successful, replace the unwound messages with the summary + Q&A and continue execution.

Step 3 - Fallback: If standard summarization fails, attempt a simpler summary using only system prompt, task description, and current state. If successful, continue with this compressed context.

Step 4 - Ultimate Fallback: If fallback also fails, continue execution with only system prompt, task description, and current state (no summary).

This recovery mechanism allows Terminus-2 to continue executing even when context limits are exceeded. Enable with enable_summarize=True in agent config.

Reinforcement Learning Support

Terminus-2 is designed with RL training in mind and collects detailed rollout information for use in RL pipelines.

Rollout Details Collection

During execution, Terminus-2 can collect and export:

Token Information

Prompt Token IDs: List of token ID sequences, one per turn. Each sequence contains the full prompt including chat history.
Completion Token IDs: List of token ID sequences, one per turn. Each sequence contains the response tokens for that turn.
Logprobs: List of log probability sequences corresponding to each completion.

These are stored as a list of RolloutDetail objects in the agent result metadata:

# First RolloutDetail contains main agent conversation
rollout_detail = trial_result.agent_result.metadata["rollout_details"][0]

# Access turn-by-turn data
prompt_token_ids = rollout_detail["prompt_token_ids"]  # List[List[int]]
completion_token_ids = rollout_detail["completion_token_ids"]  # List[List[int]]
logprobs = rollout_detail["logprobs"]  # List[List[float]]

Rewards

Terminus-2 integrates with Harbor's verifier system to collect rewards:

# Access rewards from trial results
reward = trial_result.verifier_result.rewards.get("reward", 0)

Trajectory Format

Terminus-2 automatically generates trajectories in the Agent Trajectory Interchange Format (ATIF), Harbor's standardized trajectory format. This enables:

SFT dataset generation: Convert successful trajectories to supervised fine-tuning data
RL training: Use complete action sequences and rewards for policy optimization
Debugging: Inspect detailed step-by-step execution logs
Visualization: Replay agent actions in Harbor's trajectory viewer

See the Agent Trajectory Format documentation for details on the ATIF specification.

Trajectory Configuration

Terminus-2 supports a TrajectoryConfig that controls how trajectories are recorded and formatted. This is particularly important when generating SFT datasets or when context summarization occurs.

Configuration Options

raw_content (default: False)

Controls whether to save raw LLM responses or parsed structured data in the trajectory.

raw_content=False (default): Saves parsed, structured data with separate message and tool_calls fields. Best for trajectory analysis and debugging.
raw_content=True: Saves the exact raw LLM response in the message field without parsing. Essential for SFT data export where you need the exact model outputs.

linear_history (default: False)

Controls how trajectories are split when context summarization occurs. The key difference is whether you can recover the true LLM conversation history from the trajectory files.

During agent execution, the actual conversation history sent to the LLM looks like this:

Turn 1:
  {"user": "You are an AI assistant tasked with solving command-line tasks...
           Task Description: Create a file called hello.txt
           Current terminal state: ..."}
  {"assistant": "{'analysis': '...', 'plan': '...', 'commands': [...]}"}

Turn 2:
  {"user": "You are an AI assistant tasked with solving command-line tasks...
           Task Description: Create a file called hello.txt
           Current terminal state: ..."}
  {"assistant": "{'analysis': '...', 'plan': '...', 'commands': [...]}"}
  {"user": "New Terminal Output:\nroot@container:/app# ..."}
  {"assistant": "{'analysis': '...', 'plan': '...', 'commands': [...]}"}

// Context summarization happens here

Turn 3:
  {"user": "You are an AI assistant tasked with solving command-line tasks...
           Task Description: Create a file called hello.txt
           Current terminal state: ..."}
  {"user": "You are picking up work from a previous AI agent on this task:
           **Original Task:** Create a file called hello.txt
           **Summary from Previous Agent:** ...
           **Current Terminal Screen:** ...
           Please begin by asking several questions..."}
  {"assistant": "1. What files have been created so far?\n2. ..."}
  {"user": "Here are the answers the other agent provided.\n\n[answers]\n\nContinue working on this task..."}
  {"assistant": "{'analysis': '...', 'plan': '...', 'commands': [...]}"}

Notice how in Turn 3, the conversation history was reset and compressed - the system prompt is followed by the question prompt (which includes the task, summary, and terminal screen), then model questions, then the handoff prompt with answers, skipping all the intermediate conversation steps.

When linear_history=False (default):

All main agent steps are stored in a single trajectory.json file in a human-readable format, while summarization subagents are stored in separate files. However, you cannot recover the true conversation history from the main trajectory file - the handoff prompt appears to be a continuation of the previous conversation, but in reality, the LLM context was reset.

File structure:

trajectory.json                              # All main agent steps
trajectory.summarization-1-summary.json      # First summarization: summary subagent
trajectory.summarization-1-questions.json    # First summarization: questions subagent
trajectory.summarization-1-answers.json      # First summarization: answers subagent

If multiple summarizations occur, you'll see:

trajectory.json
trajectory.summarization-1-*.json            # First summarization
trajectory.summarization-2-*.json            # Second summarization
...

When linear_history=True:

The trajectory is split into separate files when summarization occurs. Each file represents a continuous, unambiguous linear history that was actually sent to the LLM.

trajectory.json:

[
  {"user": "You are an AI assistant tasked with solving command-line tasks...\nTask Description: Create a file called hello.txt\nCurrent terminal state: ..."},
  {"assistant": "{'analysis': '...', 'plan': '...', 'commands': [...]}"},
  {"user": "New Terminal Output:\nroot@container:/app# ..."},
  {"assistant": "{'analysis': '...', 'plan': '...', 'commands': [...]}"}
]

trajectory.cont-1.json:

[
  {"user": "You are an AI assistant tasked with solving command-line tasks...\nTask Description: Create a file called hello.txt\nCurrent terminal state: ..."},
  {"user": "You are picking up work from a previous AI agent on this task:\n**Original Task:** ...\n**Summary from Previous Agent:** ...\n**Current Terminal Screen:** ...\nPlease begin by asking several questions..."},
  {"assistant": "1. What files have been created so far?\n2. ..."},
  {"user": "Here are the answers the other agent provided.\n\n[answers]\n\nContinue working on this task..."},  // No ambiguity!
  {"assistant": "{'analysis': '...', 'plan': '...', 'commands': [...]}"}
]

File structure:

trajectory.json                              # Before first summarization
trajectory.cont-1.json                       # After first summarization
trajectory.cont-2.json                       # After second summarization (if any)
trajectory.summarization-1-*.json            # First summarization subagents
trajectory.summarization-2-*.json            # Second summarization subagents (if any)

Use cases:

linear_history=False: Simpler structure, easier to see the full agent execution in one file. Good for debugging and human analysis.
linear_history=True: Each file represents the exact LLM context. Essential for SFT training where you need unambiguous input/output sequences.

Example Configuration

from harbor.agents.terminus_2 import Terminus2
from harbor.models.agent.trajectory_config import TrajectoryConfig

# For SFT dataset generation
trajectory_config = TrajectoryConfig(
    raw_content=True,      # Preserve exact LLM responses
    linear_history=True    # Split on summarization for clean sequences
)

agent = Terminus2(
    logs_dir=Path("logs"),
    model_name="anthropic/claude-3-5-sonnet-20241022",
    trajectory_config=trajectory_config
)

Common configurations:

For debugging and analysis:

TrajectoryConfig(
    raw_content=False,     # Structured, parsed data
    linear_history=False   # Single file, easier to navigate
)

For SFT data export:

TrajectoryConfig(
    raw_content=True,      # Raw model outputs
    linear_history=True    # Clean input/output sequences
)

For RL training:

TrajectoryConfig(
    raw_content=False,     # Either works (token IDs are always exact), but structured helps debugging
    linear_history=False   # Full episode in one file
)

Agents Overview - General agent integration guide
Agent Trajectory Format - ATIF specification and usage
RL Training - Using Terminus-2 for reinforcement learning
SFT Datasets - Generating supervised fine-tuning data

Terminus-2

On this page