# TrainLoop Evaluation Workspace

This folder is generated by `trainloop init` and is **the place where all collected data, metrics, and test suites live**.  Commit everything **except** the `data/` folder (it is already git‑ignored).

## 📂 Directory layout

```text
trainloop/
├── data/                 # ← raw & derived artefacts (ignored by git)
│   ├── events/           #   append‑only JSONL shards, one per 10‑min window
│   ├── results/          #   evaluation verdicts (one line per test)
│   ├── benchmarks/       #   benchmark results comparing providers
│   ├── judge_traces/     #   consolidated LLM judge trace logs (one JSONL per run)
│   └── _registry.json    #   call‑site → tag counters
└── eval/                 #   Your evaluation code
    ├── metrics/          #   User‑defined primitives (e.g., is_polite(sample) -> 1/0)
    └── suites/           #   User‑defined test collections (e.g., core_capabilities.py)
```

## 🚀 Collecting data

### JavaScript / TypeScript projects

1. **Install the SDK** (already done by `trainloop init` if it detected `package.json`):

```bash
npm install trainloop-llm-logging --save
```
2. **Start your app with one flag - no code edits required:**

```bash
# This reads the data folder path from the trainloop config
NODE_OPTIONS="--require=trainloop-llm-logging" \
npx next dev               # or node server.js / vite / etc.
```

3. **Tag individual requests** so you can write targeted tests later:

```ts
import { trainloopTag } from "trainloop-llm-logging";

await openai.chat.completions.create(
  { model: "gpt-4o", messages },
  { headers: { ...trainloopTag("bubble-sort") } }
);
```

### Python projects

Install the Python SDK and initialise once at startup:

```bash
pip install trainloop-llm-logging  # tiny wrapper around the Node collector
```

```python
from trainloop_llm_logging import collect, flush
collect("some/path/to/trainloop.config.yaml")  # must be called exactly once, e.g. in your app entrypoint

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_headers=trainloop_tag("bubble-sort")  # tag requests for evaluation
)
```

**Buffering Control:**
By default, the Python SDK flushes each LLM call immediately. To enable buffering and flush manually when needed:

```python
# Flush immediately after each LLM call (default)
collect("some/path/to/trainloop.config.yaml")

# Or enable buffering and flush manually when needed
collect("some/path/to/trainloop.config.yaml", flush_immediately=False)
# ... your LLM calls ...
flush()  # Manually flush buffered calls
```

## 📊 Running evaluations

Once you've collected data, you can run evaluations:

```bash
trainloop eval                    # run all metrics and suites
trainloop eval --tag bubble-sort  # run only on bubble-sort samples
trainloop eval --suite perf       # run a specific suite
```

Results are saved to `data/results/` and can be viewed with:

```bash
trainloop studio  # opens the web UI
```

## 🧪 Writing metrics & suites

Create metrics in **`eval/metrics/`**:

```python
# eval/metrics/includes_hello.py
from trainloop_cli.eval_core.types import Sample

def includes_hello(sample: Sample) -> int:  # return 1 (pass) or 0 (fail)
    if "hello" in sample.output["content"].lower():
        return 1
    return 0
```

Create suites in **`eval/suites/`**:

Creating a suite should read like English. For example, if you want to test if bubble sort compiles, you might write:

```python
# eval/suites/parsers.py
from trainloop_cli.eval_core.helpers import tag
from trainloop_cli.eval_core.types import Result # If using the lower-level API
from .metrics import does_compile # Assuming does_compile is in eval/metrics/

results = tag("bubble-sort").check(does_compile)
```

Where `tag("bubble-sort")` is a helper function that finds all samples with the tag `"bubble-sort"` and `check(does_compile)` is a helper function that runs the `does_compile` metric on each sample.

You can also chain multiple metrics:

```python
results = tag("bubble-sort").check(does_compile, fcn_called_bubble_sort)
```

Where `check(does_compile, fcn_called_bubble_sort)` is a helper function that runs the `does_compile` and `fcn_called_bubble_sort` metrics on each sample.

If you don't like the declarative `tag("bubble-sort").check(does_compile)` syntax, you can also use the lower-level API:

```python
from trainloop_cli.eval_core.types import Result, Sample # Ensure imports for lower-level API
from trainloop_cli.eval_core.helpers import tag
from .metrics import does_compile, fcn_called_bubble_sort # Example user metrics

samples = tag("bubble-sort", raw=True)
results = [] # REQUIRED: we look for this variable name
for sample in samples:
    compile_success = does_compile(sample)
    compile_result = Result(metric="does_compile", sample=sample, passed=compile_success)
    called_bubble_sort = fcn_called_bubble_sort(sample)
    called_bubble_sort_result = Result(metric="called_bubble_sort", sample=sample, passed=called_bubble_sort)
    results.extend([compile_result, called_bubble_sort_result])
```

The only requirement is that you must declare a variable `results` and it should be an array of `Result` objects.

## 🧑‍⚖️ Using the LLM Judge

TrainLoop includes a built-in LLM judge for evaluating claims about your model outputs. This is useful for metrics that require subjective evaluation or complex reasoning.

### Basic Usage

```python
from trainloop_cli.eval_core.judge import assert_true

def my_metric(sample):
    response = sample["response"]
    
    # Define your claims
    yes_claim = f"The response '{response}' is polite and professional."
    no_claim = f"The response '{response}' is NOT polite and professional."
    
    # Let the judge evaluate (returns 1 for pass, 0 for fail)
    return assert_true(yes_claim, no_claim)
```

### Judge Trace Logging

The LLM Judge automatically logs all trace events (LLM calls, intermediate verdicts, final judgments, etc.) for an evaluation run into a single timestamped JSONL file. This file is located in `trainloop/data/judge_traces/` and is named with the format `YYYY-MM-DD_HH-MM-SS.jsonl`. It's created when the judge engine initializes for an evaluation session (e.g., when `trainloop eval` starts) and accumulates all judge-related events for that entire session.

### Configuration

The judge can be configured in `trainloop.config.yaml`:

```yaml
trainloop:
  # ... other trainloop config ...
  judge:
    models:
      - openai/gpt-4o
      - anthropic/claude-3-sonnet-20240229
    calls_per_model_per_claim: 3  # Number of calls per model for consistency
    temperature: 0.7
```

Or override per-call:

```python
verdict = assert_true(
    yes_claim,
    no_claim,
    cfg={
        "models": ["openai/gpt-4o-mini"],
        "calls_per_model_per_claim": 3,
        "temperature": 0.5,
        "template": "Custom prompt template with {claim}"
    }
)
```

### Features

- **Multi-model ensemble**: Use multiple models for more reliable judgments
- **Self-consistency**: Each model is called k times per claim
- **XOR sanity check**: Discards samples that answer both claims the same way
- **Custom templates**: Define your own prompt format for specific evaluation needs
- **LLM Call Count**: For each `assert_true` call, the number of LLM calls made is `len(configured_models) * calls_per_model_per_claim * 2 * len(samples)`.

## 🏎️ Benchmarking LLM Providers

TrainLoop supports benchmarking multiple LLM providers to compare performance, quality, and cost. Configure providers in `trainloop.config.yaml`:

```yaml
trainloop:
  # ... other config ...
  benchmark:
    env_path: "../.env.benchmark"  # Optional: separate env file for benchmark API keys
    providers:
      - name: openai
        models:
          - gpt-4o
          - gpt-4o-mini
      - name: anthropic
        models:
          - claude-3-5-sonnet-20241022
          - claude-3-5-haiku-20241022
    temperature: 0.7
    max_tokens: 1000
    timeout: 60  # seconds
    parallel_requests: 5
```

The benchmark configuration is optional.

## 🏃‍♂️ Running evaluations & studio

```bash
trainloop eval      # run all suites and append verdicts under data/results/
trainloop eval --suite bubble_sort # run a specific suite
trainloop studio    # launch the interactive viewer on http://localhost:3000
```

Use the `--help` flag on any command for detailed options.

## ⚙️ Environment variables

| Variable                   | Purpose                                               | Default                            |
| -------------------------- | ----------------------------------------------------- | ---------------------------------- |
| `TRAINLOOP_DATA_FOLDER`    | **Required.** Where `_registry.json` & `events/` live | —                                  |
| `TRAINLOOP_HOST_ALLOWLIST` | CSV of host substrings to instrument                  | `api.openai.com,api.anthropic.com` |
| `TRAINLOOP_LOG_LEVEL`      | `error`, `warn`, `info`, `debug`                      | `info`                             |
| `TRAINLOOP_CONFIG_PATH`    | Path to the trainloop config file                     | `trainloop/trainloop.config.yaml`  |
