# TrainLoop Evaluation Workspace

This folder is generated by `trainloop init` and is **the place where all collected data, metrics, and test suites live**.  Commit everything **except** the `data/` folder (it is already git‑ignored).

## 📂 Directory layout

```text
trainloop/
├── data/                 # ← raw & derived artefacts (ignored by git)
│   ├── events/           #   append‑only JSONL shards, one per 10‑min window
│   ├── results/          #   evaluation verdicts (one line per test)
│   └── _registry.json    #   call‑site → tag counters
└── eval/
    ├── helpers.py        #   tiny DSL (tag, etc.)
    ├── types.py          #   Sample / Result dataclasses
    ├── runner.py         #   CLI engine (hidden)
    ├── metrics/          #   user‑defined primitives
    └── suites/           #   user‑defined test collections
```

## 🚀 Collecting data

### JavaScript / TypeScript projects

1. **Install the SDK** (already done by `trainloop init` if it detected `package.json`):

```bash
npm install trainloop-evals-sdk --save
```
2. **Start your app with one flag - no code edits required:**

```bash
# This reads the data folder path from the trainloop config
NODE_OPTIONS="--require=trainloop-evals-sdk" \
npx next dev               # or node server.js / vite / etc.
```

3. **Tag individual requests** so you can write targeted tests later:

```ts
import { trainloopTag } from "trainloop-evals-sdk";

await openai.chat.completions.create(
  { model: "gpt-4o", messages },
  { headers: { ...trainloopTag("bubble-sort") } }
);
```

### Python projects

Install the Python SDK and initialise once at startup:

```bash
pip install trainloop-evals-sdk  # tiny wrapper around the Node collector
```

```python
from trainloop_evals import collect
collect("some/path/to/trainloop.config.yaml")  # must be called exactly once, e.g. in your app entrypoint

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[...],
    extra_headers=trainloop_tag("bubble-sort")  # tag requests for evaluation
)
```

## 🧪 Writing metrics & suites

Create metrics in **`eval/metrics/`**:

```python
# eval/metrics/includes_hello.py
from ..types import Sample

def includes_hello(sample: Sample) -> int:  # return 1 (pass) or 0 (fail)
    if "hello" in sample.output["content"].lower():
        return 1
    return 0
```

Create suites in **`eval/suites/`**:

Creating a suite should read like English. For example, if you want to test if bubble sort compiles, you might write:

```python
# eval/suites/parsers.py
from ..helpers import tag
from ..metrics import does_compile

results = tag("bubble-sort").check(does_compile)
```

Where `tag("bubble-sort")` is a helper function that finds all samples with the tag `"bubble-sort"` and `check(does_compile)` is a helper function that runs the `does_compile` metric on each sample.

You can also chain multiple metrics:

```python
results = tag("bubble-sort").check(does_compile, fcn_called_bubble_sort)
```

Where `check(does_compile, fcn_called_bubble_sort)` is a helper function that runs the `does_compile` and `fcn_called_bubble_sort` metrics on each sample.

If you don't like the declarative `tag("bubble-sort").check(does_compile)` syntax, you can also use the lower-level API:

```python
samples = tag("bubble-sort", raw=True)
results = [] # REQUIRED: we look for this variable name
for sample in samples:
    compile_success = does_compile(sample)
    compile_result = Result(metric="does_compile", sample=sample, passed=compile_success)
    called_bubble_sort = fcn_called_bubble_sort(sample)
    called_bubble_sort_result = Result(metric="called_bubble_sort", sample=sample, passed=called_bubble_sort)
    results.extend([compile_result, called_bubble_sort_result])
```

The only requirement is that you must declare a variable `results` and it should be an array of `Result` objects.

## 🏃‍♂️ Running evaluations & studio

```bash
trainloop eval      # run all suites and append verdicts under data/results/
trainloop eval --suite bubble_sort # run a specific suite
trainloop studio    # launch the interactive viewer on http://localhost:3000
```

Use the `--help` flag on any command for detailed options.

## ⚙️ Environment variables

| Variable                   | Purpose                                               | Default                            |
| -------------------------- | ----------------------------------------------------- | ---------------------------------- |
| `TRAINLOOP_DATA_FOLDER`    | **Required.** Where `_registry.json` & `events/` live | —                                  |
| `TRAINLOOP_HOST_ALLOWLIST` | CSV of host substrings to instrument                  | `api.openai.com,api.anthropic.com` |
| `TRAINLOOP_LOG_LEVEL`      | `error`, `warn`, `info`, `debug`                      | `info`                             |
| `TRAINLOOP_CONFIG_PATH`    | Path to the trainloop config file                     | `trainloop/trainloop.config.yaml`  |

