# Requirements Document

## Introduction

This feature provides a proof-of-concept system for processing credit card statement PDFs using Docling. The system extracts raw text from PDF statements, parses transaction data into a structured format, and saves the raw transaction data to CSV files. This spec focuses on the extraction pipeline only - analytics and categorization are handled by a separate vendor-analysis spec.

## Glossary

- **Statement_Processor**: The main system that orchestrates PDF processing and data extraction
- **Docling**: A document processing library used to extract text content from PDF files
- **Transaction**: A single credit card charge or credit entry containing date, description, and amount
- **Input_Directory**: The file system directory containing PDF statement files to process

## Requirements

### Requirement 1

**User Story:** As a user, I want to extract raw text from PDF credit card statements, so that I can access the transaction data in a processable format.

#### Acceptance Criteria

1. WHEN the Statement_Processor receives an input directory path THEN the Statement_Processor SHALL scan the directory and identify all PDF files
2. WHEN a PDF file is identified THEN the Statement_Processor SHALL use Docling to extract the raw text content from the PDF
3. WHEN text extraction completes THEN the Statement_Processor SHALL preserve the extracted text for subsequent parsing
4. IF a PDF file cannot be read or is corrupted THEN the Statement_Processor SHALL log an error message and continue processing remaining files

### Requirement 2

**User Story:** As a user, I want transaction data parsed into a structured format, so that I can analyze my spending programmatically.

#### Acceptance Criteria

1. WHEN raw text is extracted from a statement THEN the Statement_Processor SHALL parse individual transactions containing date, description, and amount fields
2. WHEN parsing transactions THEN the Statement_Processor SHALL normalize date formats to a consistent ISO format (YYYY-MM-DD)
3. WHEN parsing transactions THEN the Statement_Processor SHALL extract the transaction amount as a numeric value with sign indicating charge (negative) or credit (positive)
4. WHEN parsing completes THEN the Statement_Processor SHALL store transactions in a pandas DataFrame for in-memory analysis
5. IF a transaction line cannot be parsed THEN the Statement_Processor SHALL log a warning and skip that line

### Requirement 3

**User Story:** As a user, I want extracted transactions saved to CSV, so that I have a raw data backup and can use the data in other tools.

#### Acceptance Criteria

1. WHEN transaction parsing completes THEN the Statement_Processor SHALL save all transactions to a full_transactions.csv file containing date, description, amount, and source_file columns
2. WHEN saving the transactions CSV THEN the Statement_Processor SHALL include column headers describing each field
3. WHEN saving the transactions CSV THEN the Statement_Processor SHALL use UTF-8 encoding to preserve special characters in merchant descriptions
4. WHEN saving the transactions CSV THEN the Statement_Processor SHALL preserve the original transaction order from the source statements

### Requirement 4

**User Story:** As a user, I want to process statements from multiple input directories, so that I can combine data from different time periods.

#### Acceptance Criteria

1. WHEN processing multiple input directories THEN the Statement_Processor SHALL combine transactions from all sources into a unified dataset
2. WHEN combining transactions THEN the Statement_Processor SHALL preserve the source_file field to track origin

### Requirement 5

**User Story:** As a user, I want transactions that span multiple pages in a PDF statement to be fully extracted, so that I do not lose any transaction data.

#### Acceptance Criteria

1. WHEN a PDF statement contains transactions spanning multiple pages THEN the Statement_Processor SHALL extract transactions from all pages
2. WHEN extracting transaction data THEN the Statement_Processor SHALL use Docling's DocumentExtractor with Pydantic models for structured data extraction
3. WHEN combining transactions from multiple pages THEN the Statement_Processor SHALL preserve the original order of transactions
4. WHEN a multi-page statement is parsed THEN the Statement_Processor SHALL return a transaction count equal to the total transactions across all pages

### Requirement 6

**User Story:** As a user, I want the system to use structured data extraction with Pydantic models, so that transaction parsing is more reliable and maintainable.

#### Acceptance Criteria

1. WHEN extracting data from a PDF THEN the Statement_Processor SHALL use Docling's DocumentExtractor API with Pydantic model templates
2. WHEN defining extraction templates THEN the Statement_Processor SHALL use Pydantic models with Field definitions including examples for better extraction accuracy
3. WHEN extraction completes THEN the Statement_Processor SHALL validate the extracted data using Pydantic model validation
4. IF extraction produces invalid data THEN the Statement_Processor SHALL log the validation errors and handle gracefully

