Metadata-Version: 2.4
Name: hud-python
Version: 0.4.18
Summary: SDK for the HUD platform.
Project-URL: Homepage, https://github.com/hud-evals/hud-python
Project-URL: Bug Tracker, https://github.com/hud-evals/hud-python/issues
Project-URL: Documentation, https://docs.hud.so
Author-email: HUD SDK <founders@hud.so>
License: MIT License
        
        Copyright (c) 2025 Human Union Data, Inc
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: <3.14,>=3.11
Requires-Dist: httpx<1,>=0.23.0
Requires-Dist: hud-fastmcp-python-sdk>=0.1.2
Requires-Dist: hud-mcp-python-sdk>=3.13.2
Requires-Dist: opentelemetry-api>=1.34.1
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.34.1
Requires-Dist: opentelemetry-instrumentation-mcp>=0.44.1
Requires-Dist: opentelemetry-sdk>=1.34.1
Requires-Dist: pathspec>=0.12.1
Requires-Dist: prompt-toolkit==3.0.51
Requires-Dist: pydantic-settings<3,>=2
Requires-Dist: pydantic<3,>=2
Requires-Dist: questionary==2.1.0
Requires-Dist: rich>=13.0.0
Requires-Dist: toml>=0.10.2
Requires-Dist: typer>=0.9.0
Requires-Dist: watchfiles>=0.21.0
Requires-Dist: wrapt>=1.14.0
Provides-Extra: agent
Requires-Dist: anthropic; extra == 'agent'
Requires-Dist: datasets>=2.14.0; extra == 'agent'
Requires-Dist: dotenv>=0.9.9; extra == 'agent'
Requires-Dist: hud-mcp-use-python-sdk>=2.3.13; extra == 'agent'
Requires-Dist: ipykernel; extra == 'agent'
Requires-Dist: ipython<9; extra == 'agent'
Requires-Dist: jupyter-client; extra == 'agent'
Requires-Dist: jupyter-core; extra == 'agent'
Requires-Dist: langchain; extra == 'agent'
Requires-Dist: langchain-anthropic; extra == 'agent'
Requires-Dist: langchain-openai; extra == 'agent'
Requires-Dist: numpy>=1.24.0; extra == 'agent'
Requires-Dist: openai; extra == 'agent'
Provides-Extra: agents
Requires-Dist: anthropic; extra == 'agents'
Requires-Dist: datasets>=2.14.0; extra == 'agents'
Requires-Dist: dotenv>=0.9.9; extra == 'agents'
Requires-Dist: hud-mcp-use-python-sdk>=2.3.13; extra == 'agents'
Requires-Dist: ipykernel; extra == 'agents'
Requires-Dist: ipython<9; extra == 'agents'
Requires-Dist: jupyter-client; extra == 'agents'
Requires-Dist: jupyter-core; extra == 'agents'
Requires-Dist: langchain; extra == 'agents'
Requires-Dist: langchain-anthropic; extra == 'agents'
Requires-Dist: langchain-openai; extra == 'agents'
Requires-Dist: numpy>=1.24.0; extra == 'agents'
Requires-Dist: openai; extra == 'agents'
Provides-Extra: dev
Requires-Dist: aiodocker>=0.24.0; extra == 'dev'
Requires-Dist: anthropic; extra == 'dev'
Requires-Dist: datasets>=2.14.0; extra == 'dev'
Requires-Dist: dotenv>=0.9.9; extra == 'dev'
Requires-Dist: hud-mcp-use-python-sdk>=2.3.13; extra == 'dev'
Requires-Dist: inspect-ai>=0.3.80; extra == 'dev'
Requires-Dist: ipykernel; extra == 'dev'
Requires-Dist: ipython<9; extra == 'dev'
Requires-Dist: jupyter-client; extra == 'dev'
Requires-Dist: jupyter-core; extra == 'dev'
Requires-Dist: langchain; extra == 'dev'
Requires-Dist: langchain-anthropic; extra == 'dev'
Requires-Dist: langchain-openai; extra == 'dev'
Requires-Dist: numpy>=1.24.0; extra == 'dev'
Requires-Dist: openai; extra == 'dev'
Requires-Dist: pillow>=11.1.0; extra == 'dev'
Requires-Dist: playwright; extra == 'dev'
Requires-Dist: pyautogui>=0.9.54; extra == 'dev'
Requires-Dist: pyright==1.1.401; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest-mock; extra == 'dev'
Requires-Dist: pytest<9,>=8.1.1; extra == 'dev'
Requires-Dist: ruff>=0.11.8; extra == 'dev'
Requires-Dist: setuptools; extra == 'dev'
Requires-Dist: textdistance<5,>=4.5.0; extra == 'dev'
Description-Content-Type: text/markdown

<div align="left">
  <picture>
    <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/logo/hud_logo_dark.svg">
    <source media="(prefers-color-scheme: light)" srcset="https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/logo/hud_logo.svg">
    <img src="https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/logo/hud_logo.svg" alt="HUD" width="150" style="margin-bottom: 24px;"/>
  </picture>
</div>

OSS RL environment + evals toolkit. Wrap software as environments, run benchmarks, and train with RL – locally or at scale.

[![PyPI version](https://img.shields.io/pypi/v/hud-python?style=flat-square)](https://pypi.org/project/hud-python/)
[![License](https://img.shields.io/badge/license-MIT-green?style=flat-square)](LICENSE)
[![Add docs to Cursor](https://img.shields.io/badge/Add%20docs%20to-Cursor-black?style=flat-square)](https://cursor.com/en/install-mcp?name=docs-hud-python&config=eyJ1cmwiOiJodHRwczovL2RvY3MuaHVkLnNvL21jcCJ9)
[![Discord](https://img.shields.io/discord/1327447144772407390?label=Discord&logo=discord&style=flat-square)](https://discord.gg/wkjtmHYYjm)
[![X Follow](https://img.shields.io/twitter/follow/hud_evals?style=social)](https://x.com/intent/user?screen_name=hud_evals)
[![Shop](https://img.shields.io/badge/_-white.svg?label=shop&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAJCAYAAAAywQxIAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAACxMAAAsTAQCanBgAAAF6SURBVChTlZA9ixNhFIWf8yaTpFHRRMXCKpAZhCAYFvwoLHZhwUKw9A9YCJb+Bq0sxGbBQrTxX1j41dvIRAjGZbdwRUUGIzPMeyw2swS3WZ/ynHvP5VylafoAWAd+5Xm+wX+SpukmcMf29RDCZrD9BViz3f53+CjYngKZpD5A2/Y7SQBMJpOkKIprdV1vdzqdHzHGblmW9Ww2+5pl2TmAxWKxmM/nP8fj8cmqqtZijJ9sb0u6ABBWjh0riuIt8CqE8LGu66e2d5MkeQ8QY3xme7fb7T4ZjUbrZVl+jjFuSXoEXGxCDgIl9WzfAO5LSmzvNB771R6vzG4Bx0MIt/M8vwV8aLyDQNt70+n0G1AspaTxVln+aghQluVsKbvxVysflT9NQK/XO7R/SGiQ9Nt2aftElmWXJd1kv0kbeANQVdWl4XB4XtJouXaqNRgMHkrqS+r0+/3XwD1JXdungRfAVWBi+6WkK8D3EMJz22cl3W21WgNgx3YAzvwFd0Chdq03gKUAAAAASUVORK5CYII=&style=social)](https://shop.hud.so)


### Are you a startup building agents?

[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.so](mailto:founders@hud.so)

## Highlights

- 🚀 **[MCP environment skeleton](https://docs.hud.so/core-concepts/mcp-protocol)** – any agent can call any environment.
- ⚡️ **[Live telemetry](https://app.hud.so)** – inspect every tool call, observation, and reward in real time.
- 🗂️ **[Public benchmarks](https://app.hud.so/leaderboards)** – OSWorld-Verified, SheetBench-50, and more.
- 🌱 **[Reinforcement learning built-in](rl/)** – Verifiers gym pipelines for GRPO on any environment.
- 🌐 **[Cloud browsers](environments/remote_browser/)** – AnchorBrowser, Steel, BrowserBase integrations for browser automation.
- 🛠️ **[Hot-reload dev loop](environments/README.md#phase-5-hot-reload-development-with-cursor-agent)** – `hud dev` for iterating on environments without rebuilds.

> We welcome contributors and feature requests – open an issue or hop on a call to discuss improvements!

## Installation

```bash
# Core installation - MCP servers, telemetry, basic tools for environment design
pip install hud-python

# Agent installation - Adds AI providers, datasets
pip install "hud-python[agent]"

# CLI utilities
uv tool install hud-python
# uv tool update-shell

# From source (latest)
git clone https://github.com/hud-evals/hud-python
pip install -e "hud-python[dev]"
```

> See [docs.hud.so](https://docs.hud.so), or add docs to any MCP client:
> `claude mcp add --transport http docs-hud https://docs.hud.so/mcp`

## Quickstart

For a tutorial that explains the agent and evaluation design, run ([see quickstart docs](https://docs.hud.so/quickstart)):

```python
uvx hud-python quickstart
```

Or just write your own agent loop (more [examples here](examples/)).

```python
import asyncio, hud, os
from hud.settings import settings
from hud.clients import MCPClient
from hud.agents import ClaudeAgent
from hud.datasets import Task  # See docs: https://docs.hud.so/reference/tasks

async def main() -> None:
    with hud.trace("Quick Start 2048"): # All telemetry works for any MCP-based agent (see https://app.hud.so)
        task = {
            "prompt": "Reach 64 in 2048.",
            "mcp_config": {
                "hud": {
                    "url": "https://mcp.hud.so/v3/mcp",  # HUD's cloud MCP server (see https://docs.hud.so/core-concepts/architecture)
                    "headers": {
                        "Authorization": f"Bearer {settings.api_key}",  # Get your key at https://app.hud.so
                        "Mcp-Image": "hudpython/hud-text-2048:v1.2"  # Docker image from https://hub.docker.com/u/hudpython
                    }
                }
            },
            "evaluate_tool": {"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}},
        }
        task = Task(**task)

        # 1. Define the client explicitly:
        client = MCPClient(mcp_config=task.mcp_config)
        agent = ClaudeAgent(
            mcp_client=client,
            model="claude-sonnet-4-20250514",  # requires ANTHROPIC_API_KEY
        )

        result = await agent.run(task)

        # 2. Or just:
        # result = await ClaudeAgent().run(task)

        print(f"Reward: {result.reward}")
        await client.shutdown()

asyncio.run(main())
```

The above example let's the agent play 2048 ([See replay](https://app.hud.so/trace/6feed7bd-5f67-4d66-b77f-eb1e3164604f))

![Agent playing 2048](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/2048_1.gif)

## Reinforcement Learning with GRPO

This is a Qwen-2.5-3B agent training a policy on the [`text-2048`](environments/text_2048/) environment (see above) using [Verifiers](rl/):

![RL curve](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/rl_2.png)

To start training, check out the [`rl/README.md`](rl/README.md) folder:

```bash
git clone https://github.com/hud-evals/hud-python
cd hud-python/rl
python train_2048.py
```

Any hud MCP environment and evaluation works with our RL pipeline. Even our remote configurations!

> The [`rl/README.md`](rl/README.md) walks you through several examples of RL training and takes less than 15 minutes to set up for your custom agent!

## Benchmarking Agents

This is Claude Computer Use running on our proprietary financial analyst benchmark [SheetBench-50](https://huggingface.co/datasets/hud-evals/SheetBench-50):

![Trace screenshot](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/trace_sheet.gif)

> [See this trace on _app.hud.so_](https://app.hud.so/trace/9e212e9e-3627-4f1f-9eb5-c6d03c59070a)

This example runs the full dataset (only takes ~20 minutes) using [run_evaluation.py](examples/run_evaluation.py):

```bash
python examples/run_evaluation.py hud-evals/SheetBench-50 --full --agent claude
```

Or in code:

```python
import asyncio
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

results = await run_dataset(
    name="My SheetBench-50 Evaluation",
    dataset="hud-evals/SheetBench-50",      # <-- HuggingFace dataset
    agent_class=ClaudeAgent,                # <-- Your custom agent can replace this (see https://docs.hud.so/evaluate-agents/create-agents)
    agent_config={"model": "claude-sonnet-4-20250514"},
    max_concurrent=50,
    max_steps=30,
)
print(f"Average reward: {sum(r.reward for r in results) / len(results):.2f}")
```

> Running a dataset creates a job and streams results to the [app.hud.so](https://app.hud.so) platform for analysis and [leaderboard submission](https://docs.hud.so/evaluate-agents/leaderboards).

## Building Environments (MCP)

This is how you can make any environment into an interactable one in 5 steps:

1. Define MCP server layer using [`MCPServer`](https://docs.hud.so/reference/environments)

```python
from hud.server import MCPServer
from hud.tools import HudComputerTool

mcp = MCPServer("My Environment")

# Add hud tools (see all tools: https://docs.hud.so/reference/tools)
mcp.add_tool(HudComputerTool())

# Or custom tools (see https://docs.hud.so/build-environments/adapting-software)
@mcp.tool("launch_app"):
def launch_app(name: str = "Gmail")
...

if __name__ == "__main__":
    mcp.run()
```

2. Write a simple Dockerfile that installs packages and runs:

```python
CMD ["python", "-m", "hud_controller.server"]
```

And build the image:

```bash
hud build # runs docker build under the hood
```

Or run it in interactible mode

```bash
hud dev
```

3. Debug it with the CLI to see if it launches:

```console
$ hud debug my-name/my-environment:latest

✓ Phase 1: Docker image exists
✓ Phase 2: MCP server responds to initialize 
✓ Phase 3: Tools are discoverable
✓ Phase 4: Basic tool execution works
✓ Phase 5: Parallel performance is good

Progress: [█████████████████████] 5/5 phases (100%)
✅ All phases completed successfully!
```

Analyze it to see if all tools appear:

```console
$ hud analyze hudpython/hud-remote-browser:latest
⠏ ✓ Analysis complete
...
Tools
├── Regular Tools
│   ├── computer
│   │   └── Control computer with mouse, keyboard, and screenshots
...
└── Hub Tools
    ├── setup
    │   ├── navigate_to_url
    │   ├── set_cookies
    │   ├── ...
    └── evaluate
        ├── url_match
        ├── page_contains
        ├── cookie_exists
        ├── ...

📡 Telemetry Data
 Live URL  https://live.anchorbrowser.io?sessionId=abc123def456
```

4. When the tests pass, push it up to the docker registry:

```bash
hud push # needs docker login, hud api key
```

5. Now you can use `mcp.hud.so` to launch 100s of instances of this environment in parallel with any agent, and see everything live on [app.hud.so](https://app.hud.so):

```python
from hud.agents import ClaudeAgent

result = await ClaudeAgent().run({  # See all agents: https://docs.hud.so/reference/agents
    "prompt": "Please explore this environment",
    "mcp_config": {
        "my-environment": {
            "url": "https://mcp.hud.so/v3/mcp",
            "headers": {
                "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                "Mcp-Image": "my-name/my-environment:latest"
            }
        }
        # "my-environment": { # or use hud run which wraps local and remote running
        #     "cmd": "hud",
        #     "args": [
        #         "run",
        #         "my-name/my-environment:latest",
        #     ]
        # }
    }
})

```

> See the full environment design guide and common pitfalls in [`environments/README.md`](environments/README.md)

## Leaderboards & benchmarks

All leaderboards are publicly available on [app.hud.so/leaderboards](https://app.hud.so/leaderboards) (see [docs](https://docs.hud.so/evaluate-agents/leaderboards))

![Leaderboard](https://raw.githubusercontent.com/hud-evals/hud-python/main/docs/src/images/leaderboards_3.png)

We highly suggest running 3-5 evaluations per dataset for the most consistent results across multiple jobs.

Using the [`run_dataset`](https://docs.hud.so/reference/tasks#run_dataset) function with a HuggingFace dataset automatically assigns your job to that leaderboard page, and allows you to create a scorecard out of it:

## Architecture

```mermaid
%%{init: {"theme": "neutral", "themeVariables": {"fontSize": "14px"}} }%%
graph LR
    subgraph "Platform"
        Dashboard["📊 app.hud.so"]
        API["🔌 mcp.hud.so"]
    end
  
    subgraph "hud"
        Agent["🤖 Agent"]
        Task["📋 Task"]
        SDK["📦 SDK"]
    end
  
    subgraph "Environments"
        LocalEnv["🖥️ Local Docker<br/>(Development)"]
        RemoteEnv["☁️ Remote Docker<br/>(100s Parallel)"]
    end
  
    subgraph "otel"
        Trace["📡 Traces & Metrics"]
    end
  
    Dataset["📚 Dataset<br/>(HuggingFace)"]
  
    AnyMCP["🔗 Any MCP Client<br/>(Cursor, Claude, Custom)"]
  
    Agent <--> SDK
    Task --> SDK
    Dataset <-.-> Task
    SDK <-->|"MCP"| LocalEnv
    SDK <-->|"MCP"| API
    API  <-->|"MCP"| RemoteEnv
    SDK  --> Trace
    Trace --> Dashboard
    AnyMCP -->|"MCP"| API
  
```

## CLI reference

| Command                 | Purpose                                    | Docs |
| ----------------------- | ------------------------------------------ | ---- |
| [`hud init`](https://docs.hud.so/reference/cli/init)            | Create new environment with boilerplate.  | [📖](https://docs.hud.so/reference/cli/init) |
| [`hud dev`](https://docs.hud.so/reference/cli/dev)              | Hot-reload development with Docker.        | [📖](https://docs.hud.so/reference/cli/dev) |
| [`hud build`](https://docs.hud.so/reference/cli/build)          | Build image and generate lock file.       | [📖](https://docs.hud.so/reference/cli/build) |
| [`hud push`](https://docs.hud.so/reference/cli/push)            | Share environment to registry.            | [📖](https://docs.hud.so/reference/cli/push) |
| [`hud pull <target>`](https://docs.hud.so/reference/cli/pull)   | Get environment from registry.            | [📖](https://docs.hud.so/reference/cli/pull) |
| [`hud analyze <image>`](https://docs.hud.so/reference/cli/analyze) | Discover tools, resources, and metadata.   | [📖](https://docs.hud.so/reference/cli/analyze) |
| [`hud debug <image>`](https://docs.hud.so/reference/cli/debug)   | Five-phase health check of an environment. | [📖](https://docs.hud.so/reference/cli/debug) |
| [`hud run <image>`](https://docs.hud.so/reference/cli/run)       | Run MCP server locally or remotely.       | [📖](https://docs.hud.so/reference/cli/run) |

## Roadmap

- Merging our forks in to the main `mcp`, `mcp_use`, `verifiers` repositories
- Helpers for building new environments (see [current guide](environments/README.md))
- Integrations with every major agent framework
- Evaluation environment registry
- Native RL training to hud environments (see [current RL support](rl/))
- MCP opentelemetry standard

## Contributing

We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

Key areas:
- [Environment examples](environments/) - Add new MCP environments
- [Agent implementations](hud/agents/) - Add support for new LLM providers
- [Tool library](hud/tools/) - Extend the built-in tool collection
- [RL training](rl/) - Improve reinforcement learning pipelines

Thanks to all our contributors!

<a href="https://github.com/hud-evals/hud-python/graphs/contributors">
  <img src="https://contrib.rocks/image?repo=hud-evals/hud-python&max=50" />
</a>

## Citation

```bibtex
@software{hud2025agentevalplatform,
  author = {HUD and Jay Ram and Lorenss Martinsons and Parth Patel and Oskars Putans and Govind Pimpale and Mayank Singamreddy and Nguyen Nhat Minh},
  title  = {HUD: An Evaluation Platform for Agents},
  date   = {2025-04},
  url    = {https://github.com/hud-evals/hud-python},
  langid = {en}
}
```

> **License**: HUD is released under the MIT License – see the [LICENSE](LICENSE) file for details.
