# Design Document

## Overview

The Vendor Analysis feature provides vendor clustering capabilities using the transaction-pattern-analysis framework. It instantiates a StrategyRunner configured with vendor-matching strategies to group transactions by merchant, handling variations in vendor names beyond simple exact string matching.

This feature depends on: `transaction-pattern-analysis`

## Architecture

```mermaid
flowchart TD
    subgraph "transaction-pattern-analysis framework"
        SR[StrategyRunner]
        SR --> S1[ExactMatchStrategy]
        SR --> S2[Future: FuzzyMatchStrategy]
        SR --> S3[Future: EmbeddingStrategy]
    end
    
    A[Transaction DataFrame] --> VC[VendorClusterer]
    VC --> SR
    SR --> CR[CombinedResult per transaction pair]
    CR --> CL[Cluster Builder]
    CL --> VS[Vendor Summary DataFrame]
    VS --> CSV[vendor_amounts.csv]
```

### Component Flow

1. **Input**: pandas DataFrame with columns: date, description, amount
2. **VendorClusterer**: Orchestrates clustering using StrategyRunner
3. **StrategyRunner**: Applies registered vendor-matching strategies (from transaction-pattern-analysis)
4. **Cluster Builder**: Groups transactions based on strategy results
5. **Vendor Summary**: Calculates aggregates per cluster
6. **Export**: Saves to vendor_amounts.csv

## Components and Interfaces

### ExactMatchStrategy

```python
class ExactMatchStrategy(DetectionStrategy):
    """Baseline strategy that groups transactions with identical descriptions.
    
    This is the default strategy registered with VendorClusterer.
    Returns probability 1.0 for exact matches, None otherwise.
    """
    
    @property
    def name(self) -> str:
        return "exact_match"
        
    def detect(self, transactions: pd.DataFrame) -> Optional[StrategyResult]:
        """Group transactions by exact description match.
        
        Args:
            transactions: DataFrame with description column
            
        Returns:
            StrategyResult with probability 1.0 and matching_indices for
            transactions with the same description as the first row,
            or None if only one unique description.
        """
```

### VendorClusterer

```python
class VendorClusterer:
    """Clusters transactions by vendor using the StrategyRunner framework."""
    
    def __init__(self, threshold: float = 0.7):
        """Initialize with clustering threshold.
        
        Args:
            threshold: Probability threshold for grouping (default 0.7)
        """
        self._runner = StrategyRunner()
        self._runner.register_strategy(ExactMatchStrategy(), weight=1.0)
        self._threshold = threshold
        
    def register_strategy(self, strategy: DetectionStrategy, weight: float = 1.0) -> None:
        """Register an additional vendor-matching strategy.
        
        Args:
            strategy: Strategy conforming to DetectionStrategy interface
            weight: Weight for this strategy in combination
        """
        
    def cluster(self, df: pd.DataFrame) -> List[VendorCluster]:
        """Cluster transactions by vendor.
        
        Args:
            df: DataFrame with columns: date, description, amount
            
        Returns:
            List of VendorCluster objects
        """
        
    def summarize(self, df: pd.DataFrame) -> pd.DataFrame:
        """Cluster transactions and return summary DataFrame.
        
        Args:
            df: DataFrame with columns: date, description, amount
            
        Returns:
            DataFrame with columns: vendor_name, transaction_count, 
            total_amount, earliest_date, latest_date
        """
        
    def export(self, summary_df: pd.DataFrame, output_path: Path) -> None:
        """Export vendor summary to CSV file."""
```

### VendorCluster

```python
@dataclass
class VendorCluster:
    """A group of transactions identified as belonging to the same vendor."""
    vendor_name: str  # Representative name for the cluster
    transaction_indices: List[int]  # DataFrame indices of transactions in cluster
    transaction_count: int
    total_amount: float
    earliest_date: str
    latest_date: str
```

## Data Models

### Input DataFrame Schema

| Column | Type | Description |
|--------|------|-------------|
| date | string | Transaction date in YYYY-MM-DD format |
| description | string | Merchant or transaction description |
| amount | float | Transaction amount |

### Output DataFrame Schema (Vendor Summary)

| Column | Type | Description |
|--------|------|-------------|
| vendor_name | string | Representative name for the vendor cluster |
| transaction_count | int | Number of transactions in cluster |
| total_amount | float | Sum of all transaction amounts |
| earliest_date | string | First transaction date (YYYY-MM-DD) |
| latest_date | string | Most recent transaction date (YYYY-MM-DD) |

## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system.*

### Property 1: Cluster Summary Correctness

*For any* vendor cluster, the summary statistics SHALL be correct:
- transaction_count equals the count of transactions in the cluster
- total_amount equals the sum of amounts for transactions in the cluster
- earliest_date equals the minimum date for transactions in the cluster
- latest_date equals the maximum date for transactions in the cluster

**Validates: Requirements 2.1, 2.2, 2.3**

### Property 2: ExactMatchStrategy Correctness

*For any* set of transactions, the ExactMatchStrategy SHALL group transactions with identical description fields together, returning matching_indices for all transactions sharing the same description.

**Validates: Requirements 4.2**

### Property 3: CSV Round-Trip Preserves Data

*For any* vendor summary DataFrame containing special UTF-8 characters in vendor names, exporting to CSV and importing back SHALL produce an equivalent DataFrame with all values preserved.

**Validates: Requirements 3.3**

## Future Strategies

The following strategies can be added to improve vendor clustering:

- **FuzzyMatchStrategy**: Use Levenshtein distance to match similar vendor names
- **EmbeddingStrategy**: Use vector embeddings to cluster semantically similar merchants
- **VendorDatabaseStrategy**: Cross-reference against known vendor name variations
- **LLMStrategy**: Use LLM to evaluate if two descriptions refer to the same vendor

## Error Handling

| Error Type | Handling Strategy |
|------------|-------------------|
| Empty DataFrame input | Return empty DataFrame with correct columns |
| Missing required columns | Raise ValueError with clear message |
| Strategy exception | Log error, exclude from combination (handled by framework) |
| Output directory not writable | Raise exception with clear message |

## Testing Strategy

### Property-Based Testing

Use **Hypothesis** with minimum 100 iterations per property test.

### Test Annotations

```python
# **Feature: vendor-analysis, Property {number}: {property_text}**
```
