# Design Document

## Overview

The Credit Card Statement Processor is a Python-based system that extracts transaction data from PDF credit card statements and saves it to CSV. The system uses Docling's DocumentExtractor with Pydantic models for structured data extraction, pandas for in-memory data manipulation, and outputs raw transaction data to CSV files.

This spec focuses on the extraction pipeline only. Analytics (vendor aggregation, recurrence detection) are handled by separate specs.

## Architecture

```mermaid
flowchart TD
    A[Input Directory] --> B[PDF Scanner]
    B --> C[Docling DocumentExtractor]
    C --> D[Pydantic Models]
    D --> E[Data Validator]
    E --> F[pandas DataFrame]
    F --> G[CSV Exporter]
    G --> H[full_transactions.csv]
```

### Component Flow

1. **PDF Scanner**: Discovers all PDF files in the input directory
2. **DocumentExtractor**: Uses Docling's DocumentExtractor API with Pydantic model templates
3. **Pydantic Models**: Define extraction templates with Field definitions and examples
4. **Data Validator**: Validates extracted data using Pydantic model validation
5. **DataFrame Storage**: Holds all transactions in memory
6. **CSV Exporter**: Saves raw transactions to full_transactions.csv

## Components and Interfaces

### PDFScanner

```python
class PDFScanner:
    def __init__(self, input_directory: str):
        """Initialize scanner with input directory path."""
        
    def scan(self) -> List[Path]:
        """Return list of PDF file paths found in directory."""
        
    def validate_directory(self) -> bool:
        """Check if directory exists and is readable."""
```

### TextExtractor

```python
class TextExtractor:
    def __init__(self):
        """Initialize Docling document converter."""
        
    def extract(self, pdf_path: Path) -> str:
        """Extract raw text from PDF file using Docling."""
        
    def extract_batch(self, pdf_paths: List[Path]) -> Dict[Path, str]:
        """Extract text from multiple PDFs, returning path-to-text mapping."""
```

### StructuredExtractor

```python
class StructuredExtractor:
    """Extracts structured data from PDFs using Docling's DocumentExtractor with Pydantic models."""
    
    def __init__(self):
        """Initialize Docling DocumentExtractor with PDF format support."""
        
    def extract_statement(self, pdf_path: Path) -> StatementExtraction:
        """Extract structured statement data using Pydantic model template."""
        
    def extract_transactions(self, pdf_path: Path) -> List[TransactionExtraction]:
        """Extract transactions from all pages of a PDF statement."""
        
    def validate_extraction(self, extraction: StatementExtraction) -> ValidationResult:
        """Validate extracted data using Pydantic model validation."""
```

### TransactionParser

```python
class TransactionParser:
    def __init__(self, date_format: str = "%m/%d/%Y"):
        """Initialize parser with expected date format."""
        
    def parse(self, raw_text: str, source_file: str) -> List[Transaction]:
        """Parse raw text into list of Transaction objects."""
        
    def parse_from_extraction(self, extraction: StatementExtraction, source_file: str) -> List[Transaction]:
        """Convert structured extraction to Transaction objects."""
        
    def normalize_date(self, date_str: str) -> str:
        """Convert date string to ISO format (YYYY-MM-DD)."""
        
    def parse_amount(self, amount_str: str) -> float:
        """Parse amount string to float, handling currency symbols and signs."""
```

### CSVExporter

```python
class CSVExporter:
    def __init__(self, output_directory: str):
        """Initialize with output directory path."""
        
    def export_all(self, df: pd.DataFrame, filename: str = "full_transactions.csv"):
        """Export all transactions to CSV."""
```

### StatementProcessor (Orchestrator)

```python
class StatementProcessor:
    def __init__(self, input_dirs: List[str], output_dir: str):
        """Initialize processor with input/output directories."""
        
    def process(self) -> ProcessingResult:
        """Run extraction pipeline and return results."""
        
    def process_directory(self, directory: str) -> pd.DataFrame:
        """Process all PDFs in a single directory."""
        
    def combine_datasets(self, dataframes: List[pd.DataFrame]) -> pd.DataFrame:
        """Combine multiple DataFrames from different directories."""
```

## Data Models

### Pydantic Extraction Models

```python
class TransactionExtraction(BaseModel):
    """Pydantic model for extracting individual transactions from PDF."""
    transaction_date: str = Field(examples=["01/15", "02/28", "12/01"])
    post_date: str = Field(examples=["01/15", "02/28", "12/01"])
    description: str = Field(examples=["WALMART #1234 ATLANTA GA"])
    amount: str = Field(examples=["$123.45", "$50.00-"])

class StatementExtraction(BaseModel):
    """Complete Pydantic model for extracting all statement data."""
    metadata: StatementMetadataExtraction = Field(default_factory=StatementMetadataExtraction)
    transactions: List[TransactionExtraction] = Field(default_factory=list)
```

### Transaction

```python
@dataclass
class Transaction:
    date: str              # ISO format: YYYY-MM-DD
    description: str       # Merchant/transaction description
    amount: float          # Negative for charges, positive for credits
    source_file: str       # Original PDF filename
```

### ProcessingResult

```python
@dataclass
class ProcessingResult:
    total_files: int
    successful_files: int
    failed_files: int
    total_transactions: int
    errors: List[str]
```

### DataFrame Schema

| Column | Type | Description |
|--------|------|-------------|
| date | string | Transaction date in YYYY-MM-DD format |
| description | string | Merchant or transaction description |
| amount | float | Transaction amount (negative=charge, positive=credit) |
| source_file | string | Original PDF filename |

## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system.*

### Property 1: PDF Scanner Returns Only PDF Files

*For any* directory containing a mix of file types, the PDF scanner SHALL return only files with .pdf extension.

**Validates: Requirements 1.1**

### Property 2: Transaction Parsing Round-Trip

*For any* valid Transaction object, formatting it to text and then parsing that text back SHALL produce an equivalent Transaction.

**Validates: Requirements 2.1**

### Property 3: Date Normalization Consistency

*For any* valid date string in supported formats, normalizing to ISO format SHALL produce a string matching YYYY-MM-DD.

**Validates: Requirements 2.2**

### Property 4: Amount Parsing Preserves Value

*For any* amount string containing currency symbols, commas, or parentheses, parsing SHALL extract the correct numeric value with appropriate sign.

**Validates: Requirements 2.3**

### Property 5: CSV Round-Trip Preserves Data

*For any* DataFrame of transactions, exporting to CSV and importing back SHALL preserve all values including UTF-8 characters.

**Validates: Requirements 3.1, 3.3**

### Property 6: Transaction Order Preservation

*For any* ordered list of transactions, saving to CSV and reading back SHALL preserve the original order.

**Validates: Requirements 3.4**

### Property 7: Multi-Directory Combination

*For any* list of DataFrames from different directories, combining them SHALL produce a DataFrame where row count equals the sum of input row counts.

**Validates: Requirements 4.1**

### Property 8: Multi-Page Extraction Completeness

*For any* PDF statement with transactions across multiple pages, the extractor SHALL return all transactions with none lost.

**Validates: Requirements 5.1, 5.4**

### Property 9: Pydantic Validation Round-Trip

*For any* valid StatementExtraction model, serializing to JSON and deserializing back SHALL produce an equivalent model.

**Validates: Requirements 6.3**

## Error Handling

| Error Type | Handling Strategy |
|------------|-------------------|
| File not found | Log error, skip file, continue |
| Corrupted PDF | Log error, skip file, continue |
| Invalid date format | Log warning, skip transaction |
| Invalid amount format | Log warning, skip transaction |
| Output directory not writable | Raise exception |

## Testing Strategy

### Property-Based Testing

Use **Hypothesis** with minimum 100 iterations per property test.

### Test Annotations

```python
# **Feature: credit-card-statement-processor, Property {number}: {property_text}**
```
