# TOON Python Library Specification

## Overview

Token-Oriented Object Notation (TOON) is a compact, human-readable format designed for passing structured data to Large Language Models with significantly reduced token usage. This specification outlines the requirements for porting the existing TypeScript implementation to Python.

## Project Goals

1. **Token Efficiency**: Achieve 30-60% fewer tokens than JSON for structured data
2. **LLM-Friendly**: Explicit lengths and field lists help models validate output
3. **Minimal Syntax**: Remove redundant punctuation (braces, brackets, most quotes)
4. **Pythonic API**: Follow Python conventions and best practices
5. **Feature Parity**: Match all functionality of the TypeScript implementation

## Core Requirements

### 1. API Design

#### Primary Function
```python
def encode(data: Any, options: Optional[EncodeOptions] = None) -> str:
    """Convert any JSON-serializable value to TOON format."""
```

#### Options Class
```python
@dataclass
class EncodeOptions:
    """Configuration options for TOON encoding."""
    indent: int = 2
    delimiter: Delimiter = Delimiter.COMMA
    length_marker: Optional[str] = None  # '#' or None
```

#### Delimiter Enum
```python
class Delimiter(Enum):
    COMMA = ','
    TAB = '\t'
    PIPE = '|'
```

### 2. Type System

#### JSON-Compatible Types
- `JsonPrimitive`: `str | int | float | bool | None`
- `JsonObject`: `Dict[str, JsonValue]`
- `JsonArray`: `List[JsonValue]`
- `JsonValue`: `JsonPrimitive | JsonObject | JsonArray`

#### Type Normalization Rules
| Input Type | Output | Notes |
|------------|--------|-------|
| `str` | `str` | Preserved as-is |
| `int` | `int` | Preserved, including large integers |
| `float` | `float` | `-0` → `0`, `NaN/±inf` → `null` |
| `bool` | `bool` | Preserved |
| `None` | `None` | Preserved |
| `datetime` | `str` | ISO format string |
| `date` | `str` | ISO format string |
| `Decimal` | `str` | String representation |
| `UUID` | `str` | String representation |
| `list/tuple` | `list` | Recursively normalized |
| `dict` | `dict` | Recursively normalized |
| `set` | `list` | Converted to list |
| `frozenset` | `list` | Converted to list |
| `bytes` | `str` | Base64 encoded |
| Other | `None` | Functions, objects, etc. |

### 3. Encoding Rules

#### 3.1 Primitive Encoding
- **Strings**: Quote only when necessary (see quoting rules)
- **Numbers**: Decimal form, no scientific notation
- **Booleans**: `true`/`false` literals
- **None**: `null` literal

#### 3.2 Quoting Rules
Strings are quoted when ANY of the following is true:

| Condition | Examples |
|-----------|----------|
| Empty string | `""` |
| Contains active delimiter, colon, quotes, backslash, or control chars | `"a,b"`, `"a\tb"`, `"a:b"`, `"say \"hi\""` |
| Leading or trailing spaces | `" padded "` |
| Looks like boolean/number/null | `"true"`, `"false"`, `"null"`, `"42"`, `"-3.14"` |
| Starts with `"- "` (list-like) | `"- item"` |
| Looks like structural token | `"[5]"`, `"{key}"`, `"[3]: x,y"` |

#### 3.3 Key Quoting Rules
Keys are quoted when ANY of the following is true:

| Condition | Examples |
|-----------|----------|
| Contains spaces, commas, colons, quotes, control chars | `"full name"`, `"a,b"`, `"order:id"` |
| Contains brackets or braces | `"[index]"`, `"{key}"` |
| Leading hyphen | `"-lead"` |
| Numeric-only key | `"123"` |
| Empty key | `""` |
| Doesn't match `^[A-Z_][\w.]*$` pattern | `"123key"`, `"key-with-dash"` |

#### 3.4 Object Encoding
```
# Simple object
{ id: 123, name: 'Ada' } → id: 123
                          name: Ada

# Nested object
{ user: { id: 1 } } → user:
                      id: 1

# Empty object
{} → (empty output)
{ config: {} } → config:
```

#### 3.5 Array Encoding

**Primitive Arrays (Inline)**
```
{ tags: ['admin', 'ops', 'dev'] } → tags[3]: admin,ops,dev
```

**Tabular Arrays (Uniform Objects)**
```
{ items: [
  { sku: 'A1', qty: 2, price: 9.99 },
  { sku: 'B2', qty: 1, price: 14.5 }
] } → items[2]{sku,qty,price}:
      A1,2,9.99
      B2,1,14.5
```

**Tabular Requirements:**
- All elements are objects
- Identical key sets (no missing/extra keys)
- Primitive values only (no nested arrays/objects)
- Header order from first object

**Mixed/Non-Uniform Arrays (List Format)**
```
{ items: [1, { a: 1 }, 'x'] } → items[3]:
                              - 1
                              - a: 1
                              - x
```

**Arrays of Arrays**
```
{ pairs: [[1, 2], [3, 4]] } → pairs[2]:
                              - [2]: 1,2
                              - [2]: 3,4
```

### 4. Formatting Rules

#### 4.1 Indentation
- 2 spaces per nesting level (configurable)
- No trailing spaces
- No trailing newline at end of output

#### 4.2 Delimiters
- **Comma (default)**: Implicit in headers `[N]{f1,f2}`
- **Tab**: Explicit in headers `[N	]{f1	f2}`
- **Pipe**: Explicit in headers `[N|]{f1|f2}`

#### 4.3 Length Markers
- Optional `#` prefix: `[#N]` instead of `[N]`
- Helps emphasize count vs index

#### 4.4 Line Formats
- `key: value` for primitives (single space after colon)
- `key:` for nested/empty objects (no trailing space)
- `key[N]: v1,v2` for primitive arrays
- `key[N]{f1,f2}:` for tabular arrays
- `  - item` for list items (2 spaces, hyphen, space)

### 5. Implementation Architecture

#### 5.1 Module Structure
```
toon/
├── __init__.py          # Public API
├── encoder.py           # Main encoding logic
├── normalizer.py        # Type normalization
├── primitives.py        # Primitive encoding/quoting
├── formatter.py         # Output formatting
├── constants.py         # Constants and enums
├── types.py            # Type definitions
└── writer.py           # Line writing utilities
```

#### 5.2 Core Classes

**Encoder**
```python
class ToonEncoder:
    def encode(self, data: Any, options: Optional[EncodeOptions] = None) -> str:
        """Main encoding entry point."""

    def _encode_value(self, value: JsonValue, depth: int) -> None:
        """Encode a value with current indentation depth."""

    def _encode_object(self, obj: JsonObject, depth: int) -> None:
        """Encode an object."""

    def _encode_array(self, arr: JsonArray, key: Optional[str], depth: int) -> None:
        """Encode an array."""
```

**Normalizer**
```python
class ValueNormalizer:
    def normalize(self, value: Any) -> JsonValue:
        """Convert Python types to JSON-compatible types."""

    def _normalize_number(self, value: Union[int, float]) -> JsonPrimitive:
        """Handle special number cases."""

    def _normalize_collection(self, value: Any) -> JsonValue:
        """Handle lists, tuples, sets, dicts."""
```

**PrimitiveEncoder**
```python
class PrimitiveEncoder:
    def encode_primitive(self, value: JsonPrimitive, delimiter: str) -> str:
        """Encode a primitive value."""

    def encode_string(self, value: str, delimiter: str) -> str:
        """Encode a string with proper quoting."""

    def escape_string(self, value: str) -> str:
        """Escape special characters in strings."""

    def needs_quotes(self, value: str, delimiter: str) -> bool:
        """Check if string needs quoting."""
```

**LineWriter**
```python
class LineWriter:
    def __init__(self, indent_size: int = 2):
        self.indent_size = indent_size
        self.lines: List[str] = []

    def push(self, depth: int, line: str) -> None:
        """Add a line with proper indentation."""

    def to_string(self) -> str:
        """Get the final output string."""
```

### 6. Testing Strategy

#### 6.1 Test Categories
1. **Primitive Encoding**: Strings, numbers, booleans, null
2. **Object Encoding**: Simple, nested, empty objects
3. **Array Encoding**: Primitive, tabular, mixed, nested
4. **Quoting Rules**: Keys and values in various contexts
5. **Delimiter Options**: Comma, tab, pipe delimiters
6. **Length Markers**: With and without markers
7. **Formatting**: Indentation, whitespace invariants
8. **Type Normalization**: Python types to JSON
9. **Edge Cases**: Complex nested structures
10. **Error Handling**: Invalid inputs

#### 6.2 Test Data
- Port all existing TypeScript test cases
- Add Python-specific test cases (datetime, Decimal, UUID, etc.)
- Property-based testing for complex scenarios

#### 6.3 Benchmark Tests
- Token efficiency comparison with JSON
- Performance benchmarks for large datasets
- Memory usage profiling

### 7. Dependencies

#### 7.1 Runtime Dependencies
- **None** (pure Python implementation)

#### 7.2 Development Dependencies
- `pytest` - Testing framework
- `pytest-cov` - Coverage reporting
- `hypothesis` - Property-based testing
- `black` - Code formatting
- `ruff` - Linting and formatting
- `mypy` - Type checking
- `pre-commit` - Git hooks

#### 7.3 Optional Dependencies
- `orjson` - Fast JSON serialization (for benchmarks)
- `tiktoken` - Token counting (for benchmarks)

### 8. Package Configuration

#### 8.1 Setup
```python
# pyproject.toml
[build-system]
requires = ["setuptools>=61.0", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "toon"
version = "0.1.0"
description = "Token-Oriented Object Notation – a token-efficient JSON alternative for LLM prompts"
readme = "README.md"
license = {text = "MIT"}
authors = [{name = "...", email = "..."}]
classifiers = [
    "Development Status :: 4 - Beta",
    "Intended Audience :: Developers",
    "License :: OSI Approved :: MIT License",
    "Programming Language :: Python :: 3",
    "Programming Language :: Python :: 3.10",
    "Programming Language :: Python :: 3.11",
    "Programming Language :: Python :: 3.12",
    "Programming Language :: Python :: 3.13",
    "Programming Language :: Python :: 3.14",
]
requires-python = ">=3.10"

[project.urls]
Homepage = "https://github.com/..."
Repository = "https://github.com/..."
Documentation = "https://toon.readthedocs.io/"
```

#### 8.2 CI/CD
- GitHub Actions for testing on Python 3.10-3.14
- Automated releases to PyPI
- Code quality checks (linting, type checking, coverage)

### 9. Documentation

#### 9.1 User Documentation
- README with quick start guide
- API reference documentation
- Usage examples and best practices
- Performance benchmarks

#### 9.2 Developer Documentation
- Architecture overview
- Contributing guidelines
- Design decisions and rationale

### 10. Migration Path

#### 10.1 Phase 1: Core Implementation
- Basic encoding functionality
- Primitive and object encoding
- Simple array encoding
- Comprehensive test suite

#### 10.2 Phase 2: Advanced Features
- Tabular array optimization
- Delimiter options
- Length markers
- Type normalization

#### 10.3 Phase 3: Polish & Optimization
- Performance optimization
- Documentation
- Package publishing
- Community feedback integration

### 11. Success Criteria

1. **Functional Parity**: All TypeScript tests pass in Python
2. **Performance**: Comparable or better performance than TypeScript
3. **Token Efficiency**: Same token savings as original implementation
4. **Pythonic Design**: Follows Python conventions and best practices
5. **Documentation**: Complete user and developer documentation
6. **Package Quality**: Meets Python packaging standards
7. **Test Coverage**: >95% test coverage
8. **Type Safety**: Full type annotations with mypy compliance

### 12. Risks and Mitigations

#### 12.1 Technical Risks
- **String Handling Differences**: Python vs JavaScript string behavior
  - *Mitigation*: Comprehensive string encoding tests
- **Type System Differences**: Dynamic vs static typing
  - *Mitigation*: Strict type annotations and runtime checks
- **Performance**: Python may be slower than TypeScript
  - *Mitigation*: Profile and optimize critical paths

#### 12.2 Project Risks
- **Scope Creep**: Adding too many features beyond parity
  - *Mitigation*: Strict adherence to TypeScript feature set
- **Maintenance Burden**: Long-term maintenance commitment
  - *Mitigation*: Automated testing and CI/CD pipeline

This specification provides a comprehensive roadmap for porting TOON to Python while maintaining the core benefits and functionality of the original TypeScript implementation.
