# Feature Specification: TOON Python Library - Core Encoding Implementation

**Feature Branch**: `001-toon-encoding`  
**Created**: 2025-01-27  
**Status**: Draft  
**Input**: User description: "Based on my reading of SPEC.md, here's my feature specification: [detailed feature description]"

## User Scenarios & Testing *(mandatory)*

### User Story 1 - Basic TOON Encoding (Priority: P1)

AI/LLM developers need to convert Python data structures to TOON format to reduce token costs when sending structured data to language models. They want a simple way to encode common data types with minimal configuration.

**Why this priority**: This is the core functionality that delivers the primary value proposition of token reduction. Without basic encoding, the library provides no value to users.

**Independent Test**: Can be fully tested by encoding various Python data structures and verifying token reduction compared to JSON, with output matching TOON specification format.

**Acceptance Scenarios**:

1. **Given** a Python dictionary with primitive values, **When** encoded to TOON, **Then** output uses minimal tokens with proper formatting
2. **Given** a list of strings, **When** encoded to TOON, **Then** output uses inline array format with correct delimiters
3. **Given** nested data structures, **When** encoded to TOON, **Then** output maintains proper indentation and hierarchy

---

### User Story 2 - Advanced Type Support (Priority: P2)

Data scientists working with real-world data need to encode Python-specific types like datetime, Decimal, UUID, and bytes objects that are common in data processing workflows.

**Why this priority**: Essential for practical usage in data science and backend applications where these types are frequently used. Without this support, users would need manual preprocessing.

**Independent Test**: Can be fully tested by encoding objects with Python-specific types and verifying they convert to appropriate JSON-compatible representations in TOON format.

**Acceptance Scenarios**:

1. **Given** datetime objects, **When** encoded to TOON, **Then** output contains ISO format strings
2. **Given** Decimal objects, **When** encoded to TOON, **Then** output contains string representations
3. **Given** UUID objects, **When** encoded to TOON, **Then** output contains string representations
4. **Given** bytes objects, **When** encoded to TOON, **Then** output contains Base64 encoded strings

---

### User Story 3 - Flexible Formatting Options (Priority: P2)

Backend developers building LLM-powered applications need control over output formatting to match different use cases, including configuration files, API responses, and prompt templates.

**Why this priority**: Different applications require different formatting conventions. Without flexibility, the library would be limited to a single use case.

**Independent Test**: Can be fully tested by encoding the same data with different formatting options and verifying output matches expected patterns.

**Acceptance Scenarios**:

1. **Given** configurable indentation, **When** encoding nested objects, **Then** output uses specified indentation size
2. **Given** different delimiter options, **When** encoding arrays, **Then** output uses correct delimiter (comma, tab, or pipe)
3. **Given** length marker option, **When** encoding arrays, **Then** output includes optional `#` prefix when enabled

---

### User Story 4 - Array Optimization Strategies (Priority: P3)

Power users working with large datasets need optimal array encoding to maximize token reduction, especially for tabular data and mixed content arrays.

**Why this priority**: Advanced optimization provides maximum token savings for power users and differentiates the library from basic JSON alternatives.

**Independent Test**: Can be fully tested by encoding different array types and verifying the correct optimization strategy is applied.

**Acceptance Scenarios**:

1. **Given** uniform object arrays, **When** encoded to TOON, **Then** output uses tabular format with headers
2. **Given** primitive arrays, **When** encoded to TOON, **Then** output uses inline format with delimiters
3. **Given** mixed/nested arrays, **When** encoded to TOON, **Then** output uses list format with hyphens

---

### Edge Cases

- **Large datasets**: System fails fast with clear error when dataset exceeds 10MB limit
- **Circular references**: System throws explicit exception with reference path information
- **Non-serializable objects**: System fails fast with clear exception indicating object type
- **Numeric edge cases**: System handles -0, NaN, infinity according to TOON specification with strict validation
- **Empty values**: System properly encodes empty strings, arrays, and objects according to TOON format rules

## Requirements *(mandatory)*

### Functional Requirements

- **FR-001**: System MUST convert Python data structures to TOON format with 30-60% fewer tokens than equivalent JSON
- **FR-002**: System MUST support all JSON-compatible primitive types (strings, numbers, booleans, null)
- **FR-003**: System MUST normalize Python-specific types (datetime, Decimal, UUID, bytes, sets) to JSON-compatible representations
- **FR-004**: System MUST apply smart quoting rules that minimize quote usage while maintaining parsability
- **FR-005**: System MUST support three array encoding strategies: inline, tabular, and list format
- **FR-006**: System MUST provide configurable formatting options (indentation, delimiters, length markers)
- **FR-007**: System MUST handle numeric edge cases (-0, NaN, infinity) according to specification
- **FR-008**: System MUST fail fast with clear exception when encountering non-serializable objects
- **FR-009**: System MUST produce output with no trailing spaces or newlines
- **FR-010**: System MUST use single space after colons for inline values, no space for nested objects

### Key Entities *(include if feature involves data)*

- **TOON Encoder**: Main component that orchestrates the encoding process and manages configuration
- **Type Normalizer**: Component that converts Python types to JSON-compatible representations
- **Array Analyzer**: Component that determines optimal encoding strategy for different array types
- **Quoting Engine**: Component that applies smart quoting rules based on content analysis
- **Output Formatter**: Component that handles indentation, delimiters, and final output formatting

## Clarifications

### Session 2025-01-27

- Q: Data Volume & Scale Constraints → A: Focus on small datasets (<10MB) with simple, fast implementation
- Q: Error Handling Strategy → A: Strict mode: fail fast on any encoding errors with clear exceptions
- Q: Performance Optimization Priority → D: Prioritize maximum token reduction over speed

## Success Criteria *(mandatory)*

### Measurable Outcomes

- **SC-001**: TOON encoding reduces token count by 30-60% compared to equivalent JSON for structured data
- **SC-002**: Users can encode datasets up to 10MB with maximum token reduction optimization
- **SC-003**: 95% of Python data structures encode successfully without errors
- **SC-004**: Memory usage scales linearly with data size (no exponential growth for nested structures)
- **SC-005**: Users can achieve token cost reduction of at least 25% on their first encoding attempt
- **SC-006**: All TypeScript test cases pass with identical output when ported to Python implementation