# Data Model: TOON Python Library

## Core Type System

### JSON-Compatible Types

```python
# Primitive types
JsonPrimitive = str | int | float | bool | None

# Complex types
JsonObject = Dict[str, JsonValue]
JsonArray = List[JsonValue]

# Union type
JsonValue = JsonPrimitive | JsonObject | JsonArray
```

### Configuration Types

```python
@dataclass
class EncodeOptions:
    """Configuration options for TOON encoding."""
    indent: int = 2
    delimiter: Delimiter = Delimiter.COMMA
    length_marker: Optional[str] = None  # '#' or None

class Delimiter(Enum):
    COMMA = ','
    TAB = '\t'
    PIPE = '|'
```

## Entity Definitions

### ToonEncoder
**Purpose**: Main orchestration component for TOON encoding
**Key Attributes**:
- `options`: EncodeOptions configuration
- `normalizer`: ValueNormalizer instance
- `primitive_encoder`: PrimitiveEncoder instance
- `formatter`: OutputFormatter instance
- `writer`: LineWriter instance

**Key Methods**:
- `encode(data: Any, options: Optional[EncodeOptions] = None) -> str`
- `_encode_value(value: JsonValue, depth: int) -> None`
- `_encode_object(obj: JsonObject, depth: int) -> None`
- `_encode_array(arr: JsonArray, key: Optional[str], depth: int) -> None`

### ValueNormalizer
**Purpose**: Converts Python types to JSON-compatible representations
**Key Attributes**:
- `type_handlers`: Dict[type, Callable] for custom type conversions

**Key Methods**:
- `normalize(value: Any) -> JsonValue`
- `_normalize_number(value: Union[int, float]) -> JsonPrimitive`
- `_normalize_collection(value: Any) -> JsonValue`

**Type Conversion Rules**:
| Input Type | Output | Conversion Method |
|------------|--------|------------------|
| `str` | `str` | Preserved as-is |
| `int` | `int` | Preserved, including large integers |
| `float` | `float` | `-0` → `0`, `NaN/±inf` → `null` |
| `bool` | `bool` | Preserved |
| `None` | `None` | Preserved |
| `datetime` | `str` | ISO format string |
| `date` | `str` | ISO format string |
| `Decimal` | `str` | String representation |
| `UUID` | `str` | String representation |
| `list/tuple` | `list` | Recursively normalized |
| `dict` | `dict` | Recursively normalized |
| `set` | `list` | Converted to list |
| `frozenset` | `list` | Converted to list |
| `bytes` | `str` | Base64 encoded |
| Other | `None` | Functions, objects, etc. |

### PrimitiveEncoder
**Purpose**: Encodes primitive values with proper quoting
**Key Attributes**:
- `delimiter`: Current delimiter character
- `escape_chars`: Set of characters requiring escaping

**Key Methods**:
- `encode_primitive(value: JsonPrimitive, delimiter: str) -> str`
- `encode_string(value: str, delimiter: str) -> str`
- `escape_string(value: str) -> str`
- `needs_quotes(value: str, delimiter: str) -> bool`

**Quoting Rules**:
**String Quoting Conditions** (quote when ANY is true):
- Empty string
- Contains active delimiter, colon, quotes, backslash, or control chars
- Leading or trailing spaces
- Looks like boolean/number/null (`true`, `false`, `null`, `42`, `-3.14`)
- Starts with `"- "` (list-like)
- Looks like structural token (`[5]`, `{key}`, `[3]: x,y`)

**Key Quoting Conditions** (quote when ANY is true):
- Contains spaces, commas, colons, quotes, control chars
- Contains brackets or braces
- Leading hyphen
- Numeric-only key
- Empty key
- Doesn't match `^[A-Z_][\w.]*$` pattern

### ArrayAnalyzer
**Purpose**: Determines optimal encoding strategy for arrays
**Key Attributes**:
- `uniform_threshold`: Minimum size for tabular optimization

**Key Methods**:
- `analyze_array(arr: JsonArray) -> ArrayType`
- `is_uniform_objects(arr: JsonArray) -> bool`
- `get_tabular_headers(arr: JsonArray) -> List[str]`

**Array Types**:
```python
class ArrayType(Enum):
    INLINE = "inline"      # Primitive arrays: tags[3]: admin,ops,dev
    TABULAR = "tabular"    # Uniform objects: items[2]{sku,qty,price}:
    LIST = "list"          # Mixed/nested: items[3]: - item1 - item2
```

**Tabular Requirements**:
- All elements are objects
- Identical key sets (no missing/extra keys)
- Primitive values only (no nested arrays/objects)
- Header order from first object

### OutputFormatter
**Purpose**: Handles indentation, delimiters, and final output formatting
**Key Attributes**:
- `indent_size`: Spaces per nesting level
- `delimiter`: Current delimiter character
- `length_marker`: Optional prefix for array counts

**Key Methods**:
- `format_object_line(key: str, depth: int) -> str`
- `format_primitive_line(key: str, value: str, depth: int) -> str`
- `format_array_header(key: str, count: int, headers: Optional[List[str]]) -> str`
- `format_list_item(item: str, depth: int) -> str`

**Line Formats**:
- `key: value` for primitives (single space after colon)
- `key:` for nested/empty objects (no trailing space)
- `key[N]: v1,v2` for primitive arrays
- `key[N]{f1,f2}:` for tabular arrays
- `  - item` for list items (2 spaces, hyphen, space)

### LineWriter
**Purpose**: Streaming output to avoid memory buildup
**Key Attributes**:
- `lines`: List[str] for accumulated output
- `indent_size`: Spaces per nesting level

**Key Methods**:
- `push(depth: int, line: str) -> None`
- `to_string() -> str`

**Formatting Rules**:
- 2 spaces per nesting level (configurable)
- No trailing spaces
- No trailing newline at end of output

## Data Flow

1. **Input**: Raw Python data structure
2. **Normalization**: ValueNormalizer converts to JsonValue types
3. **Analysis**: ArrayAnalyzer determines optimal encoding strategies
4. **Encoding**: ToonEncoder orchestrates multi-pass encoding
5. **Formatting**: OutputFormatter applies indentation and delimiters
6. **Output**: LineWriter streams final TOON format

## Validation Rules

### Input Validation
- Must be JSON-serializable after normalization
- Circular references throw explicit exceptions
- Dataset size < 10MB (clarified constraint)

### Output Validation
- No trailing spaces or newlines
- Proper indentation maintained
- Token reduction achieved (30-60% vs JSON)
- All quoting rules applied correctly