# MLOps Python SDK

[MLOps](https://xcloud-service.com) Python SDK for XCloud Service API. Manage and execute tasks with confidence.

## Installation

Install the SDK from PyPI:

```bash
pip install mlops-python-sdk
```

## Quick Start

### 1. Setup Authentication

You can authenticate using either an API Key or an Access Token.

#### Option 1: API Key (Recommended for programmatic access)

1. Sign up at [MLOps](https://xcloud-service.com)
2. Create an API key from [API Keys](https://xcloud-service.com/home/api-keys)
3. Set environment variables:

```bash
export MLOPS_API_KEY=xck_******
export MLOPS_DOMAIN=localhost:8090  # optional, default is localhost:8090
```

#### Option 2: Access Token (For user authentication)

```bash
export MLOPS_ACCESS_TOKEN=your_access_token
export MLOPS_DOMAIN=localhost:8090  # optional
```

### 2. Basic Usage

```python
from mlops import Task, ConnectionConfig
from mlops.api.client.models.task_status import TaskStatus

# Initialize Task client (uses environment variables by default)
task = Task()

# Or initialize with explicit configuration
config = ConnectionConfig(
    api_key="xck_******",
    domain="localhost:8090",
    debug=False
)
task = Task(config=config)

# Submit a task with script
result = task.submit(
    name="my-training-task",
    cluster_id=1,
    script="#!/bin/bash\necho 'Hello World'",
    resources={"cpu": 4, "memory": "8GB", "gpu": 1}
)

# Or submit with command
result = task.submit(
    name="my-task",
    cluster_id=1,
    command="python train.py",
    resources={"cpu": 4, "memory": "8GB"}
)

# Get task details
task_info = task.get(task_id=result.job_id, cluster_id=1)

# List tasks with filters
running_tasks = task.list(
    status=TaskStatus.RUNNING,
    cluster_id=1,
    page=1,
    page_size=20
)

# Cancel a task
task.cancel(task_id=result.job_id, cluster_id=1)

# Delete a task
task.delete(task_id=task_id, cluster_id=1)
```

## API Reference

### Task Class

The `Task` class provides a high-level interface for managing tasks.

#### Initialization

```python
from mlops import Task, ConnectionConfig

# Using environment variables
task = Task()

# With explicit configuration
config = ConnectionConfig(
    api_key="xck_******",           # API key for authentication
    access_token="token_******",     # Access token (alternative to API key)
    domain="localhost:8090",         # API domain
    debug=False,                      # Enable debug mode
    request_timeout=30.0              # Request timeout in seconds
)
task = Task(config=config)

# Or pass parameters directly
task = Task(
    api_key="xck_******",
    domain="localhost:8090"
)
```

#### Methods

##### `submit()`

Submit a new task to the cluster.

```python
result = task.submit(
    name: str,                    # Task name (required)
    cluster_id: int,              # Cluster ID (required)
    script: Optional[str] = None, # Script content (script or command required)
    command: Optional[str] = None,# Command to execute (script or command required)
    resources: Optional[dict] = None, # Resource requirements
    team_id: Optional[int] = None # Team ID (optional)
) -> TaskSubmitResponse
```

**Resources dictionary** can contain:
- `cpu` or `cpus_per_task`: Number of CPUs
- `memory`: Memory requirement (e.g., "8GB", "4096M")
- `nodes`: Number of nodes
- `gres`: GPU resources (e.g., "gpu:1")
- `time`: Time limit (e.g., "1-00:00:00" for 1 day)
- `partition`: Partition name
- `tres`: TRES specification

**Example:**

```python
result = task.submit(
    name="ml-training",
    cluster_id=1,
    script="#!/bin/bash\npython train.py --epochs 100",
    resources={
        "cpu": 8,
        "memory": "16GB",
        "gpu": 1,
        "time": "2-00:00:00",  # 2 days
        "partition": "gpu"
    }
)
print(f"Task submitted: Job ID = {result.job_id}")
```

##### `get()`

Get task details by task ID.

```python
task_info = task.get(
    task_id: int,    # Task ID (Slurm job ID)
    cluster_id: int  # Cluster ID (required)
) -> Task
```

**Example:**

```python
task_info = task.get(task_id=12345, cluster_id=1)
print(f"Task status: {task_info.status}")
print(f"Task name: {task_info.name}")
```

##### `list()`

List tasks with optional filters and pagination.

```python
tasks = task.list(
    page: int = 1,                           # Page number
    page_size: int = 20,                     # Items per page
    status: Optional[TaskStatus] = None,     # Filter by status
    cluster_id: Optional[int] = None,         # Filter by cluster ID
    team_id: Optional[int] = None,           # Filter by team ID
    user_id: Optional[int] = None            # Filter by user ID
) -> TaskListResponse
```

**Example:**

```python
from mlops.api.client.models.task_status import TaskStatus

# List all running tasks
running_tasks = task.list(status=TaskStatus.RUNNING)

# List tasks in a specific cluster
cluster_tasks = task.list(cluster_id=1, page=1, page_size=10)

# List completed tasks with pagination
completed = task.list(
    status=TaskStatus.COMPLETED,
    cluster_id=1,
    page=1,
    page_size=50
)
```

##### `cancel()`

Cancel a running task.

```python
task.cancel(
    task_id: int,    # Task ID (Slurm job ID)
    cluster_id: int  # Cluster ID (required)
)
```

**Example:**

```python
task.cancel(task_id=12345, cluster_id=1)
```

### TaskStatus Enum

Task status values for filtering:

```python
from mlops.api.client.models.task_status import TaskStatus

TaskStatus.PENDING      # Task is pending
TaskStatus.QUEUED       # Task is queued
TaskStatus.RUNNING      # Task is running
TaskStatus.COMPLETED    # Task completed successfully
TaskStatus.SUCCEEDED    # Task succeeded
TaskStatus.FAILED       # Task failed
TaskStatus.CANCELLED    # Task was cancelled
TaskStatus.CREATED      # Task was created
```

## Configuration

### Environment Variables

The SDK reads configuration from environment variables:

- `MLOPS_API_KEY`: API key for authentication
- `MLOPS_ACCESS_TOKEN`: Access token for authentication (alternative to API key)
- `MLOPS_DOMAIN`: API domain (default: `localhost:8090`)
- `MLOPS_DEBUG`: Enable debug mode (`true`/`false`, default: `false`)
- `MLOPS_API_PATH`: API path prefix (default: `/api/v1`)

### ConnectionConfig

You can also configure the connection programmatically:

```python
from mlops import ConnectionConfig

config = ConnectionConfig(
    domain="api.example.com",
    api_key="xck_******",
    debug=True,
    request_timeout=60.0,
    api_path="/api/v1"
)
```

## Error Handling

The SDK provides specific exception types:

```python
from mlops.exceptions import (
    APIException,           # General API errors
    AuthenticationException, # Authentication failures
    NotFoundException,       # Resource not found
    RateLimitException,     # Rate limit exceeded
    TimeoutException,       # Request timeout
    InvalidArgumentException # Invalid arguments
)

try:
    result = task.submit(name="test", cluster_id=1, command="echo hello")
except AuthenticationException as e:
    print(f"Authentication failed: {e}")
except NotFoundException as e:
    print(f"Resource not found: {e}")
except APIException as e:
    print(f"API error: {e}")
```

## Examples

### Submit a Machine Learning Training Job

```python
from mlops import Task

task = Task()

result = task.submit(
    name="pytorch-training",
    cluster_id=1,
    script="""#!/bin/bash
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4GB

python train.py --config config.yaml
""",
    resources={
        "cpus_per_task": 2,
        "memory": "4GB",
        "gres": "gpu:1",
        "time": "1-00:00:00",  # 1 days
        "partition": "gpu"
    }
)

print(f"Training job submitted: {result.job_id}")
```

### Monitor Task Status

```python
from mlops import Task
from mlops.api.client.models.task_status import TaskStatus
import time

task = Task()
job_id = 12345
cluster_id = 1

while True:
    task_info = task.get(task_id=job_id, cluster_id=cluster_id)
    print(f"Status: {task_info.status}")
    
    if task_info.status in [TaskStatus.COMPLETED, TaskStatus.FAILED, TaskStatus.CANCELLED]:
        break
    
    time.sleep(10)  # Check every 10 seconds
```

### List and Filter Tasks

```python
from mlops import Task
from mlops.api.client.models.task_status import TaskStatus

task = Task()

# Get all running tasks in cluster 1
running = task.list(
    status=TaskStatus.RUNNING,
    cluster_id=1
)

for t in running.tasks:
    print(f"{t.name}: {t.status} (Job ID: {t.job_id})")

# Get failed tasks
failed = task.list(status=TaskStatus.FAILED)

print(f"Total failed tasks: {failed.total}")
```

## Documentation

- [MLOPS Documentation](https://xcloud-service.com/docs)
- [API Reference](https://xcloud-service.com/docs/api)

## License

MIT

## Support

- [GitHub Issues](https://github.com/xcloud-service/xservice/issues)
- [Documentation](https://xcloud-service.com/docs)
