# Streaming SQL Join Engine

A lightweight Python library for executing SQL queries with joins in a streaming, row-by-row fashion without loading full tables into memory.

## Features

- **Streaming execution**: Processes queries row-by-row using Python generators
- **INNER and LEFT JOIN support**: Efficient join algorithms (lookup-based and merge joins)
- **WHERE clause filtering**: Supports comparisons, boolean logic, and NULL checks
- **Column projection**: SELECT with column selection and aliasing
- **Memory efficient**: Only materializes lookup-side tables for joins, not full result sets

## Installation

> **Quick Start:** For detailed installation instructions, see [INSTALLATION.md](INSTALLATION.md)

### Option 1: Install from Local Directory (Development)

If you have the library source code locally:

```bash
# Navigate to the library directory
cd sql_engine

# Install in editable mode (recommended for development)
pip install -e .

# Or install normally
pip install .
```

### Option 2: Install Dependencies Only

If you want to use the library without installing it as a package:

```bash
pip install -r requirements.txt
```

Then add the library directory to your Python path or import directly:

```python
import sys
sys.path.insert(0, '/path/to/sql_engine')
from streaming_sql_engine import Engine
```

### Option 3: Install from Git Repository

If the library is hosted on GitHub or another Git repository:

```bash
pip install git+https://github.com/yourusername/streaming-sql-engine.git
```

### Option 4: Install as a Package in Your Project

Add to your project's `requirements.txt`:

```
streaming-sql-engine @ file:///path/to/sql_engine
```

Or if using `pyproject.toml`:

```toml
[project.dependencies]
streaming-sql-engine = {path = "../sql_engine", develop = true}
```

## PostgreSQL Integration

The library includes utilities for connecting to PostgreSQL databases:

```python
from streaming_sql_engine import Engine, create_pool_from_env, create_table_source
from dotenv import load_dotenv
import os

load_dotenv()

# Create connection pool from environment variables
pool = create_pool_from_env()

# Create engine
engine = Engine()

# Register tables from PostgreSQL
engine.register(
    "users",
    create_table_source(pool, "users", where_clause="active = true", order_by="id"),
    ordered_by="id"  # Enable merge joins if sorted
)

# Execute queries
for row in engine.query("SELECT users.name FROM users WHERE users.id > 1"):
    print(row)
```

Required environment variables (in `.env` file):

- `db_host`
- `db_port`
- `db_user`
- `db_password`
- `db_name`

## Quick Start

```python
from streaming_sql_engine import Engine

# Create engine instance
engine = Engine()

# Register table sources (functions that return iterators of dictionaries)
def users_source():
    return iter([
        {"id": 1, "name": "Alice", "dept_id": 10},
        {"id": 2, "name": "Bob", "dept_id": 20},
        {"id": 3, "name": "Charlie", "dept_id": 10},
    ])

def departments_source():
    return iter([
        {"id": 10, "name": "Engineering"},
        {"id": 20, "name": "Sales"},
    ])

engine.register("users", users_source)
engine.register("departments", departments_source)

# Execute a query
query = """
    SELECT users.name, departments.name AS dept_name
    FROM users
    JOIN departments ON users.dept_id = departments.id
    WHERE users.id > 1
"""

for row in engine.query(query):
    print(row)
# Output:
# {'name': 'Bob', 'dept_name': 'Sales'}
# {'name': 'Charlie', 'dept_name': 'Engineering'}
```

## Supported SQL Features

### SELECT

- Column selection: `SELECT col1, col2`
- Aliasing: `SELECT col1 AS alias1`
- Table-qualified columns: `SELECT users.name`

### FROM

- Single table: `FROM table_name`
- Table aliases: `FROM users AS u`

### JOIN

- INNER JOIN: `JOIN table ON left.key = right.key`
- LEFT JOIN: `LEFT JOIN table ON left.key = right.key`
- Multiple joins in sequence
- Only equality joins are supported

### WHERE

- Comparisons: `=`, `!=`, `<`, `>`, `<=`, `>=`
- Boolean operators: `AND`, `OR`, `NOT`
- NULL checks: `IS NULL`, `IS NOT NULL`
- IN clauses: `column IN (value1, value2, ...)`
- Column references: `alias.column`
- Constants: strings, numbers

## Not Supported

- GROUP BY and aggregations
- ORDER BY
- HAVING
- UNION
- Subqueries
- Non-equality joins
- Arithmetic expressions
- Functions (except basic literals)

## Advanced: Merge Joins

For better performance when both tables are sorted by the join key, you can register sources with `ordered_by`:

```python
# Register a source that's sorted by id
engine.register(
    "users",
    users_source,
    ordered_by="id"  # Enables merge join optimization
)
```

## Performance Optimizations

### Memory-Efficient Joins

For file-based sources, use `filename` parameter to enable memory-mapped joins (90-99% memory reduction):

```python
engine.register(
    "images",
    lambda: load_jsonl_file("images.jsonl"),
    filename="images.jsonl"  # Enables mmap-based joins
)
```

### Database Filter Pushdown

For database sources, use `is_database_source=True` to enable filter pushdown:

```python
engine.register(
    "products",
    create_table_source(pool, "products"),
    is_database_source=True  # Enables WHERE clause pushdown
)
```

### Polars Acceleration

Polars optimizations are enabled by default. Disable with:

```python
engine = Engine(use_polars=False)  # Disable Polars optimizations
```

## Architecture

The engine follows a pipeline architecture:

1. **Parser**: Uses `sqlglot` to parse SQL into an AST
2. **Planner**: Converts AST into a logical execution plan
3. **Executor**: Builds a chain of iterator operators:
   - `ScanIterator`: Reads from source
   - `FilterIterator`: Applies WHERE clause
   - `LookupJoinIterator`: Joins using hash lookup (indexes smaller table)
   - `MergeJoinIterator`: Joins using merge algorithm (when both sides sorted)
   - `ProjectIterator`: Applies SELECT projection

## Error Handling

The engine raises clear exceptions for:

- Unsupported SQL constructs
- Missing table registrations
- Ambiguous column references
- Invalid join conditions

## Performance Considerations

- **Lookup joins**: Memory usage proportional to size of lookup-side table
- **Merge joins**: Memory usage proportional to size of equal-key runs (usually very small)
- **Streaming**: Results are yielded as soon as they're produced
- **No buffering**: Intermediate results are not fully materialized

## Documentation

- **[TECHNICAL_DOCUMENTATION.md](TECHNICAL_DOCUMENTATION.md)**: Complete technical documentation explaining architecture, algorithms, and design decisions
- **[DEVELOPER_GUIDE.md](DEVELOPER_GUIDE.md)**: Guide for developers who want to understand, modify, or extend the codebase
- **[PERFORMANCE.md](PERFORMANCE.md)**: Performance comparison with database joins and optimization tips
- **[MYSQL_USAGE.md](MYSQL_USAGE.md)**: MySQL-specific usage guide
- **[QUICK_START.md](QUICK_START.md)**: Quick start guide for new users
- **[INSTALLATION.md](INSTALLATION.md)**: Detailed installation instructions

## License

MIT
