# Design Document

## Overview

The Clustering Framework provides a unified approach to transaction analysis where all problems are modeled as clustering. Vendor grouping, recurrence detection, and anomaly detection are all implemented as clustering strategies that group transactions based on different features.

This spec supersedes transaction-pattern-analysis with a simpler, more unified model.

## Architecture

```mermaid
flowchart TD
    subgraph Setup
        R1[register_strategy] --> REG[Strategy Registry]
        R2[register_strategy] --> REG
        R3[register_strategy] --> REG
    end
    
    subgraph Execution
        A[Transaction DataFrame] --> B[ClusterRunner.run]
        B --> REG
        REG --> D1[Strategy 1.cluster]
        REG --> D2[Strategy 2.cluster]
        REG --> D3[Strategy N.cluster]
        D1 --> E1[List of Clusters]
        D2 --> E2[List of Clusters]
        D3 --> E3[List of Clusters]
    end
    
    subgraph Combination
        E1 --> F[Merge Overlapping Clusters]
        E2 --> F
        E3 --> F
        F --> G[Final Cluster List]
    end
```

### Component Flow

1. **Setup**: Register strategies with weights via `register_strategy(strategy, weight)`
2. **Execution**: Call `run(transactions)` which iterates through all registered strategies
3. **Strategy Execution**: Each strategy's `cluster()` method returns a list of TransactionClusters
4. **Combination**: Overlapping clusters are merged, confidence scores combined using weights
5. **Output**: Final list of TransactionClusters

## Components and Interfaces

### TransactionCluster

```python
@dataclass
class TransactionCluster:
    """A group of transactions identified as belonging together.
    
    Each transaction has a membership score indicating confidence it belongs
    to this cluster. This allows fuzzy membership where some transactions
    are strong matches (1.0) and others are weaker matches (0.7).
    
    Attributes:
        memberships: Map of transaction index → confidence (0.0 to 1.0).
        label: Type of cluster (e.g., "vendor", "recurring_monthly").
        metadata: Strategy-specific details about the cluster.
    """
    memberships: Dict[int, float]  # {transaction_idx: confidence}
    label: str
    metadata: Dict[str, Any] = field(default_factory=dict)
    
    def __post_init__(self) -> None:
        """Validate all membership scores are within bounds."""
        for idx, conf in self.memberships.items():
            if not 0.0 <= conf <= 1.0:
                raise ValueError(f"membership score must be between 0.0 and 1.0, got {conf} for index {idx}")
    
    @property
    def indices(self) -> List[int]:
        """Return all transaction indices in this cluster."""
        return list(self.memberships.keys())
    
    def indices_above_threshold(self, threshold: float) -> List[int]:
        """Return indices with membership >= threshold."""
        return [idx for idx, conf in self.memberships.items() if conf >= threshold]
```

### ClusteringStrategy (Abstract Base)

```python
class ClusteringStrategy(ABC):
    """Abstract base class for transaction clustering strategies."""
    
    @property
    @abstractmethod
    def name(self) -> str:
        """Return the strategy name."""
        
    @abstractmethod
    def cluster(self, transactions: pd.DataFrame) -> List[TransactionCluster]:
        """Group transactions into clusters.
        
        Args:
            transactions: DataFrame with columns: date, description, amount
            
        Returns:
            List of TransactionCluster objects. Empty list if no clusters found.
        """
```

### ClusterRunner

```python
class ClusterRunner:
    """Orchestrates clustering strategies and combines results."""
    
    def __init__(self):
        """Initialize the cluster runner."""
        self._strategies: List[Tuple[ClusteringStrategy, float]] = []
        
    def register_strategy(self, strategy: ClusteringStrategy, weight: float = 1.0) -> None:
        """Register a clustering strategy with optional weight.
        
        Args:
            strategy: Strategy instance conforming to ClusteringStrategy interface
            weight: Weight for this strategy when combining overlapping clusters (default 1.0)
            
        Raises:
            ValueError: If weight is negative
        """
        
    def run(self, transactions: pd.DataFrame, min_confidence: float = 0.0) -> List[TransactionCluster]:
        """Apply all strategies and return combined clusters.
        
        Args:
            transactions: DataFrame of transactions
            min_confidence: Minimum confidence threshold for returned clusters
            
        Returns:
            List of TransactionCluster objects, filtered by min_confidence
        """
        
    def run_by_label(self, transactions: pd.DataFrame, label: str) -> List[TransactionCluster]:
        """Return only clusters matching the specified label.
        
        Args:
            transactions: DataFrame of transactions
            label: Cluster label to filter by
            
        Returns:
            List of TransactionCluster objects with matching label
        """
```

## Data Models

### Input DataFrame Schema

| Column | Type | Description |
|--------|------|-------------|
| date | string | Transaction date in YYYY-MM-DD format |
| description | string | Merchant or transaction description |
| amount | float | Transaction amount |

### TransactionCluster

| Field | Type | Description |
|-------|------|-------------|
| memberships | Dict[int, float] | Map of transaction index → membership confidence (0.0 to 1.0) |
| label | str | Cluster type identifier |
| metadata | Dict | Strategy-specific details |

Example:
```python
# Walmart cluster with fuzzy matching
TransactionCluster(
    memberships={
        0: 1.0,   # "Walmart.com" - exact match
        1: 1.0,   # "Walmart.com" - exact match  
        2: 0.85,  # "Walmart ATL GA" - fuzzy match on "Walmart"
        5: 0.12,  # "Target" - low similarity, probably doesn't belong
    },
    label="vendor",
    metadata={"vendor_name": "Walmart", "match_strategy": "fuzzy"}
)
```

### Common Labels

| Label | Description | Example Strategies |
|-------|-------------|-------------------|
| vendor | Transactions from same merchant | ExactMatchStrategy, FuzzyMatchStrategy |
| recurring_monthly | Monthly recurring pattern | SameAmountPeriodicallyStrategy |
| recurring_yearly | Yearly recurring pattern | SameAmountPeriodicallyStrategy |
| similar_day | Same day-of-month pattern | SimilarDayOfMonthStrategy |



## Correctness Properties

*A property is a characteristic or behavior that should hold true across all valid executions of a system-essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.*

### Property 1: All Strategies Applied

*For any* set of registered strategies and input transactions, the ClusterRunner SHALL call the cluster method on every registered strategy.

**Validates: Requirements 1.1**

### Property 2: Cluster Structure Validity

*For any* TransactionCluster returned by the framework:
- memberships SHALL be a non-empty dict mapping valid DataFrame indices to confidence scores
- all membership confidence scores SHALL be between 0.0 and 1.0 inclusive
- label SHALL be a non-empty string
- metadata SHALL be a dictionary (may be empty)

**Validates: Requirements 1.4, 2.1, 3.1, 3.2**

### Property 3: Weighted Membership Combination

*For any* transaction that appears in overlapping clusters from different strategies, the combined membership score SHALL equal the weighted average: sum(membership_i * weight_i) / sum(weight_i).

**Validates: Requirements 2.2**

### Property 4: Membership Threshold Filtering

*For any* cluster and membership threshold, `indices_above_threshold(threshold)` SHALL return only indices with membership >= threshold.

**Validates: Requirements 2.3**

### Property 5: Label Filtering

*For any* set of clusters and label filter, the filtered result SHALL contain only clusters with matching label.

**Validates: Requirements 3.3**

### Property 6: Strategy Error Isolation

*For any* strategy that throws an exception during clustering, the ClusterRunner SHALL catch the error, log it, and continue processing other strategies without crashing.

**Validates: Requirements 4.4**

### Property 7: Default Weight

*For any* strategy registered without an explicit weight, the ClusterRunner SHALL use a default weight of 1.0.

**Validates: Requirements 5.1**

## Strategy Examples

### ExactMatchStrategy (Vendor Clustering)

```python
class ExactMatchStrategy(ClusteringStrategy):
    """Groups transactions with identical descriptions."""
    
    @property
    def name(self) -> str:
        return "exact_match"
        
    def cluster(self, transactions: pd.DataFrame) -> List[TransactionCluster]:
        clusters = []
        for description, group in transactions.groupby("description"):
            # All exact matches get 1.0 membership
            memberships = {idx: 1.0 for idx in group.index}
            clusters.append(TransactionCluster(
                memberships=memberships,
                label="vendor",
                metadata={"vendor_name": description}
            ))
        return clusters
```

### SameAmountPeriodicallyStrategy (Recurrence Clustering)

```python
class SameAmountPeriodicallyStrategy(ClusteringStrategy):
    """Groups transactions with same amount at regular intervals."""
    
    @property
    def name(self) -> str:
        return "same_amount_periodically"
        
    def cluster(self, transactions: pd.DataFrame) -> List[TransactionCluster]:
        # Group by vendor first, then check for periodic same-amount patterns
        # Returns clusters labeled "recurring_monthly" or "recurring_yearly"
        pass
```

## Cluster Merging Strategy

When multiple strategies return clusters with the same label and overlapping transactions:

1. **Same label, overlapping memberships**: Merge into single cluster, combine membership scores via weighted average
2. **Different labels**: Keep as separate clusters (vendor cluster vs recurrence cluster are different)
3. **No overlap**: Clusters remain independent

```python
def _merge_clusters_by_label(
    self, 
    all_clusters: List[Tuple[TransactionCluster, float]]  # (cluster, weight)
) -> List[TransactionCluster]:
    """Merge clusters with same label and overlapping memberships."""
    # Group clusters by label
    # For each label group, merge overlapping memberships
    # Combined membership = weighted average of individual memberships
    # Preserve metadata from highest-weight strategy
```

Example merging:
```python
# Strategy A (weight=1.0) returns:
#   Cluster(memberships={0: 1.0, 1: 1.0}, label="vendor")
# Strategy B (weight=0.5) returns:
#   Cluster(memberships={0: 0.8, 2: 0.9}, label="vendor")

# Merged result:
#   Cluster(memberships={
#       0: (1.0*1.0 + 0.8*0.5) / (1.0 + 0.5) = 0.93,  # weighted avg
#       1: 1.0,  # only in A
#       2: 0.9,  # only in B
#   }, label="vendor")
```

## Performance Considerations

For large transaction sets (1000+), individual strategies may need to use efficient algorithms:

- **NumPy/SciPy**: Vectorized similarity computations, sparse matrices for pairwise distances
- **scikit-learn**: DBSCAN, hierarchical clustering, or other clustering algorithms
- **Approximate methods**: Locality-sensitive hashing for fuzzy string matching at scale

The framework itself (ClusterRunner) stays simple - it just orchestrates strategies and merges results. Heavy computation happens inside individual strategy implementations.

For the initial implementation, we'll use simple Python loops. Strategies can be optimized later as needed.

## Error Handling

| Error Type | Handling Strategy |
|------------|-------------------|
| Strategy throws exception | Log error, skip strategy, continue with others |
| Empty transactions DataFrame | Return empty list |
| No strategies registered | Return empty list |
| Negative weight | Raise ValueError on registration |
| Invalid confidence in cluster | Raise ValueError in TransactionCluster.__post_init__ |

## Testing Strategy

### Property-Based Testing

Use **Hypothesis** with minimum 100 iterations per property test.

### Test Approach

- Create mock strategies with deterministic outputs for testing
- Test weighted combination formula with various weight configurations
- Test cluster merging with overlapping and non-overlapping clusters
- Test filtering by confidence and label

### Test Annotations

```python
# **Feature: clustering-framework, Property {number}: {property_text}**
```

