Metadata-Version: 2.4
Name: statement-processor
Version: 0.1.0
Summary: Extensible PDF financial statement processing framework
Project-URL: Homepage, https://www.bogdanvarlamov.com/projects/statement-processor
Project-URL: Repository, https://github.com/bogdanvarlamov/statement-processor
Project-URL: Issues, https://github.com/bogdanvarlamov/statement-processor/issues
Author-email: Bogdan Varlamov <1427744+bogdanvarlamov@users.noreply.github.com>
License: MIT
Keywords: finance,parser,pdf,statement,transaction
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business :: Financial
Requires-Python: >=3.13
Requires-Dist: docling>=2.64.0
Requires-Dist: huggingface-hub[hf-xet]>=0.20.0
Requires-Dist: pandas>=2.3.3
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pyyaml>=6.0
Requires-Dist: qwen-vl-utils>=0.0.8
Requires-Dist: twine>=6.2.0
Provides-Extra: dev
Requires-Dist: hypothesis>=6.148.7; extra == 'dev'
Requires-Dist: pytest>=9.0.1; extra == 'dev'
Description-Content-Type: text/markdown

# Statement Processor

An extensible Python framework for extracting transaction data from PDF financial statements, parsing them into structured data, and clustering transactions by vendor.

## Overview

Statement Processor provides a plugin-based architecture where:
- **Core package** handles PDF extraction, clustering framework, and CLI
- **Parser plugins** add support for specific statement formats (banks, credit cards, etc.)
- **Strategy plugins** add custom clustering/analysis algorithms
- **Pattern files** define vendor matching rules via YAML

```
┌─────────────────────────────────────────────────────────────────┐
│                  statement-processor (core)                     │
├─────────────────────────────────────────────────────────────────┤
│  • BaseStatementParser, BaseTransaction (interfaces)            │
│  • ClusteringStrategy, ClusterRunner (analytics framework)      │
│  • Plugin registry (auto-discovers installed add-ons)           │
│  • CLI entry point                                              │
│  • Built-in: PNC CashBuilder parser                             │
└─────────────────────────────────────────────────────────────────┘
          ▲                    ▲                    ▲
          │                    │                    │
    ┌─────┴─────┐        ┌─────┴─────┐        ┌─────┴─────┐
    │  chase    │        │   amex    │        │  your     │
    │  add-on   │        │  add-on   │        │  add-on   │
    └───────────┘        └───────────┘        └───────────┘
```

## Installation

```bash
# Install core package
pip install statement-processor

# Install add-ons for your bank(s)
pip install statement-processor-chase
pip install statement-processor-amex
```

**Development Setup:**

```bash
# Prerequisites: Python 3.13+, uv package manager
git clone <repository-url>
cd statement-processor
uv sync
```

## Quick Start

```bash
# Process statements (auto-detects parser based on content)
statement-processor ./statements --output ./results

# Multiple directories
statement-processor ./2024_statements ./2025_statements -o ./results

# Enable debug output
statement-processor ./statements --debug ./debug_output

# Specify a parser explicitly
statement-processor ./statements --parser markdown_table

# Continue on validation errors
statement-processor ./statements --no-strict
```

## Architecture

### Core Interfaces

The framework defines abstract base classes that plugins implement:

```python
from statement_processor.core import BaseStatementParser, BaseTransaction

class BaseTransaction(ABC):
    """Minimal transaction interface."""
    @property
    def date(self) -> str: ...
    @property
    def description(self) -> str: ...
    @property
    def amount(self) -> float: ...

class BaseStatementParser(ABC):
    """Parser for a specific statement format."""
    @property
    def name(self) -> str: ...
    
    def can_parse(self, raw_text: str) -> bool:
        """Return True if this parser handles this format."""
    
    def parse(self, raw_text: str, source_file: str) -> list[BaseTransaction]:
        """Parse text into transactions."""
```

### Plugin Discovery

Plugins register via Python entry points. When installed, they're automatically discovered:

```python
from statement_processor.core import discover_parsers, auto_detect_parser

# List all available parsers
parsers = discover_parsers()

# Auto-detect parser for a document
parser = auto_detect_parser(raw_text)
if parser:
    transactions = parser.parse(raw_text, "statement.pdf")
```

### Clustering Framework

All transaction analysis uses a unified clustering interface:

```python
from statement_processor.analytics import ClusteringStrategy, ClusterRunner

class ClusteringStrategy(ABC):
    @property
    def name(self) -> str: ...
    
    def cluster(self, transactions: pd.DataFrame) -> list[TransactionCluster]: ...

# Run multiple strategies
runner = ClusterRunner(cascade=True)
runner.register_strategy(RegexVendorStrategy(), weight=1.0)
runner.register_strategy(ExactMatchStrategy(), weight=0.5)
clusters = runner.run(df)
```

## Creating a Parser Plugin

To add support for a new statement format, create a package with:

**1. Parser implementation:**

```python
# my_bank_parser/parser.py
from statement_processor.core import BaseStatementParser, BaseTransaction
from pydantic import BaseModel

class MyBankTransaction(BaseModel, BaseTransaction):
    date: str
    description: str
    amount: float
    # Add bank-specific fields
    category: str = ""

class MyBankParser(BaseStatementParser):
    @property
    def name(self) -> str:
        return "my_bank"
    
    def can_parse(self, raw_text: str) -> bool:
        return "My Bank" in raw_text and "Statement" in raw_text
    
    def parse(self, raw_text: str, source_file: str) -> list[BaseTransaction]:
        transactions = []
        # Your parsing logic here
        return transactions
```

**2. Entry point registration:**

```toml
# pyproject.toml
[project]
name = "statement-processor-mybank"
dependencies = ["statement-processor>=1.0"]

[project.entry-points."statement_processor.parsers"]
my_bank = "my_bank_parser:MyBankParser"
```

**3. Install and use:**

```bash
pip install statement-processor-mybank
statement-processor ./my-bank-statements/
```

## Creating a Strategy Plugin

Custom clustering strategies follow the same pattern:

```python
# my_strategies/rewards.py
from statement_processor.analytics import ClusteringStrategy, TransactionCluster

class RewardsCategoryStrategy(ClusteringStrategy):
    @property
    def name(self) -> str:
        return "rewards_category"
    
    def cluster(self, transactions: pd.DataFrame) -> list[TransactionCluster]:
        clusters = []
        # Group by reward category, etc.
        return clusters
```

```toml
# pyproject.toml
[project.entry-points."statement_processor.strategies"]
rewards = "my_strategies:RewardsCategoryStrategy"
```

## Vendor Patterns

Vendor patterns are YAML files that map transaction descriptions to canonical vendor names:

```yaml
# patterns/retail.yaml
patterns:
  - pattern: 'WALMART.*'
    vendor: Walmart
  - pattern: 'COSTCO\s*(WHSE|WHOLESALE)?.*'
    vendor: Costco
  - pattern: 'AMZN\s*MKTP.*|AMAZON\.COM.*'
    vendor: Amazon
```

**Loading custom patterns:**

```python
from statement_processor.analytics import load_patterns

# Load from additional directories
patterns = load_patterns(extra_dirs=[Path("./my_patterns")])
```

Plugins can also bundle patterns and register them via entry points.

## Built-in Components

### Parsers
- `markdown_table` - Parses transactions from markdown tables extracted by Docling

> **Note:** The built-in parser works with text-based PDFs only. It does not perform OCR or use vision/language models. If your statements are scanned images or require OCR, you'll need to add a parser plugin with those capabilities.

### Clustering Strategies
- `RegexVendorStrategy` - Match descriptions against YAML patterns
- `ExactMatchStrategy` - Group identical descriptions

## Project Structure

```
statement_processor/
├── core/
│   ├── base_transaction.py    # BaseTransaction interface
│   ├── base_parser.py         # BaseStatementParser interface
│   └── registry.py            # Plugin discovery
├── parsers/
│   └── markdown_table.py      # Markdown table parser
├── analytics/
│   ├── clustering.py          # ClusteringStrategy interface
│   ├── cluster_runner.py      # Strategy orchestration
│   ├── strategies/            # Built-in strategies
│   └── vendor_patterns/       # YAML pattern files
├── extraction/
│   ├── pdf_scanner.py         # PDF file discovery
│   └── pdf_markdown_extractor.py  # Docling PDF to markdown
└── cli.py                     # Command-line interface
```

## Configuration

### Cascade vs Merge Mode

The `ClusterRunner` supports two modes:
- **Cascade** (default): Claimed transactions excluded from later strategies
- **Merge**: All strategies see all transactions; results merged with weighted averaging

### Debug Output

With `--debug`, saves per-PDF:
- `.json` - Docling JSON export
- `.html` - HTML with embedded images
- `.md` - Markdown
- `.txt` - Plain text
- `_metadata.txt` - Extraction summary

## Python API

```python
from statement_processor import StatementProcessor
from statement_processor.core import auto_detect_parser
from statement_processor.analytics import VendorClusterer

# High-level API
processor = StatementProcessor(
    input_dirs=["./statements"],
    output_dir="./results",
)
result = processor.process()

# Low-level API
from statement_processor.extraction import TextExtractor
from statement_processor.core import discover_parsers

extractor = TextExtractor()
text = extractor.extract(pdf_path)

parser = auto_detect_parser(text)
transactions = parser.parse(text, pdf_path.name)

df = pd.DataFrame([t.to_dict() for t in transactions])
clusterer = VendorClusterer()
vendor_df = clusterer.summarize(df)
```

## Running Tests

```bash
uv run pytest        # Run all tests
uv run pytest -v     # Verbose output
```

## Future Development

Planned features (see `future/` folder for prototype code):

- **Recurrence Detection** - Classify transactions as monthly, yearly, or one-time based on amount clustering and calendar patterns
- **Description Parsing** - Extract structured fields (vendor name, phone, city, state) from raw transaction descriptions
- **OCR Support** - Handle scanned paper statements and photographed documents where text extraction requires optical character recognition
- **VLM Extraction** - Use vision-language models with Pydantic schema templates to extract transactions from non-standard statement formats where traditional table/text parsing fails

## License

MIT
