# Streaming SQL Join Engine

A lightweight Python library for executing SQL queries with joins in a streaming, row-by-row fashion without loading full tables into memory.

## What This Library Does

**Streaming SQL Join Engine** is a Python library that lets you write SQL queries to join data from **any source** - CSV files, REST APIs, database tables, JSON files, or custom Python functions - **without copying or loading the data into a database first**.

### The Core Problem It Solves

Traditionally, to join data from different sources (like multiple APIs, CSV files, or databases), you need to:

1. Extract data from each source
2. Load it into a database or data warehouse
3. Write SQL queries against that database
4. Export the results

This library eliminates steps 2 and 4. You can write SQL queries directly against your original data sources, and the engine processes them row-by-row in memory, streaming results as they're computed.

### How It Works

1. **Register Data Sources**: You register any Python function that yields dictionaries (rows) as a "table":

   ```python
   def my_api_source():
       response = requests.get("https://api.example.com/data")
       for item in response.json():
           yield {"id": item["id"], "name": item["name"]}

   engine.register("my_table", my_api_source)
   ```

2. **Write SQL Queries**: Use standard SQL syntax to join and filter:

   ```python
   query = """
       SELECT t1.name, t2.value
       FROM table1 t1
       JOIN table2 t2 ON t1.id = t2.id
       WHERE t1.status = 'active'
   """
   ```

3. **Stream Results**: The engine processes your query row-by-row, yielding results as they're computed:
   ```python
   for row in engine.query(query):
       print(row)  # Process each row immediately, no waiting for full result set
   ```

### Key Characteristics

- **No Data Copying**: Data stays in its original source (API, file, database). Only the data needed for joins is temporarily cached.
- **Memory Efficient**: Uses iterator pipelines - processes one row at a time instead of loading entire tables.
- **Source Agnostic**: Works with any data source you can write a Python function for (APIs, CSVs, databases, JSON files, etc.).
- **SQL Interface**: Familiar SQL syntax for joins and filtering, even when joining data from completely different systems.
- **Streaming Execution**: Results are yielded immediately as they're computed, enabling real-time processing of large datasets.

### Use Cases

- **API Data Integration**: Join data from multiple REST APIs without setting up a database
- **File Processing**: Join CSV files, JSON files, or other file formats using SQL
- **Cross-System Queries**: Query data from PostgreSQL, MySQL, APIs, and files in a single SQL query
- **Real-Time Analytics**: Process streaming data sources with SQL joins
- **ETL Simplification**: Replace complex ETL pipelines with simple SQL queries
- **Prototyping**: Quickly explore relationships between different data sources

### What Makes It Different

Unlike traditional databases or query engines:

- **No setup required**: No database server, no schema creation, no data loading
- **Works with live data**: Queries execute against current data from APIs or files
- **Minimal memory footprint**: Only caches what's needed for joins, not entire tables
- **Python-native**: Integrates seamlessly with Python code and data sources

## Features

- **Streaming execution**: Processes queries row-by-row using Python generators
- **INNER and LEFT JOIN support**: Efficient join algorithms (lookup-based and merge joins)
- **WHERE clause filtering**: Supports comparisons, boolean logic, and NULL checks
- **Column projection**: SELECT with column selection and aliasing
- **Memory efficient**: Only materializes lookup-side tables for joins, not full result sets

## Installation

> **Quick Start:** For detailed installation instructions, see [INSTALLATION.md](INSTALLATION.md)

### Option 1: Install from Local Directory (Development)

If you have the library source code locally:

```bash
# Navigate to the library directory
cd sql_engine

# Install in editable mode (recommended for development)
pip install -e .

# Or install normally
pip install .
```

### Option 2: Install Dependencies Only

If you want to use the library without installing it as a package:

```bash
pip install -r requirements.txt
```

Then add the library directory to your Python path or import directly:

```python
import sys
sys.path.insert(0, '/path/to/sql_engine')
from streaming_sql_engine import Engine
```

### Option 3: Install from Git Repository

If the library is hosted on GitHub or another Git repository:

```bash
pip install git+https://github.com/Ierofantis/streaming-sql-engine.git
```

### Option 4: Install as a Package in Your Project

Add to your project's `requirements.txt`:

```
streaming-sql-engine @ file:///path/to/sql_engine
```

Or if using `pyproject.toml`:

```toml
[project.dependencies]
streaming-sql-engine = {path = "../sql_engine", develop = true}
```

## PostgreSQL Integration

The library includes utilities for connecting to PostgreSQL databases:

```python
from streaming_sql_engine import Engine, create_pool_from_env, create_table_source
from dotenv import load_dotenv
import os

load_dotenv()

# Create connection pool from environment variables
pool = create_pool_from_env()

# Create engine
engine = Engine()

# Register tables from PostgreSQL
engine.register(
    "users",
    create_table_source(pool, "users", where_clause="active = true", order_by="id"),
    ordered_by="id"  # Enable merge joins if sorted
)

# Execute queries
for row in engine.query("SELECT users.name FROM users WHERE users.id > 1"):
    print(row)
```

Required environment variables (in `.env` file):

- `db_host`
- `db_port`
- `db_user`
- `db_password`
- `db_name`

## Quick Start

```python
from streaming_sql_engine import Engine

# Create engine instance
engine = Engine()

# Register table sources (functions that return iterators of dictionaries)
def users_source():
    return iter([
        {"id": 1, "name": "Alice", "dept_id": 10},
        {"id": 2, "name": "Bob", "dept_id": 20},
        {"id": 3, "name": "Charlie", "dept_id": 10},
    ])

def departments_source():
    return iter([
        {"id": 10, "name": "Engineering"},
        {"id": 20, "name": "Sales"},
    ])

engine.register("users", users_source)
engine.register("departments", departments_source)

# Execute a query
query = """
    SELECT users.name, departments.name AS dept_name
    FROM users
    JOIN departments ON users.dept_id = departments.id
    WHERE users.id > 1
"""

for row in engine.query(query):
    print(row)
# Output:
# {'name': 'Bob', 'dept_name': 'Sales'}
# {'name': 'Charlie', 'dept_name': 'Engineering'}
```

## Real-World Example: Joining Multiple APIs

Join data from multiple free public APIs using SQL:

```python
# Install: pip install streaming-sql-engine
import requests
from streaming_sql_engine import Engine

# Define API sources
def jsonplaceholder_users_api():
    """Fetch users from JSONPlaceholder API."""
    response = requests.get("https://jsonplaceholder.typicode.com/users")
    for user in response.json():
        yield {
            "user_id": user["id"],
            "user_name": user["name"],
            "user_email": user["email"],
            "user_city": user["address"]["city"],
        }

def jsonplaceholder_posts_api():
    """Fetch posts from JSONPlaceholder API."""
    response = requests.get("https://jsonplaceholder.typicode.com/posts")
    for post in response.json():
        yield {
            "post_id": post["id"],
            "user_id": post["userId"],
            "post_title": post["title"],
            "post_body": post["body"][:100],
        }

def rest_countries_api():
    """Fetch countries from REST Countries API."""
    response = requests.get("https://restcountries.com/v3.1/all")
    for country in response.json():
        yield {
            "country_code": country.get("cca2", ""),
            "country_name": country["name"].get("common", ""),
            "capital": country.get("capital", ["N/A"])[0] if country.get("capital") else "N/A",
            "population": country.get("population", 0),
        }

# Create engine and register sources
engine = Engine()
engine.register("users", jsonplaceholder_users_api)
engine.register("posts", jsonplaceholder_posts_api)
engine.register("countries", rest_countries_api)

# Join Users + Posts from different APIs
query = """
    SELECT
        users.user_name,
        users.user_email,
        posts.post_title,
        posts.post_body
    FROM users
    JOIN posts ON users.user_id = posts.user_id
    WHERE users.user_id <= 3
"""

for row in engine.query(query):
    print(f"{row['user_name']} ({row['user_email']}): {row['post_title']}")
```

This example demonstrates joining data from multiple REST APIs using standard SQL syntax - no ETL, no data copying, just pure SQL joins across different data sources!

## Supported SQL Features

### SELECT

- Column selection: `SELECT col1, col2`
- Aliasing: `SELECT col1 AS alias1`
- Table-qualified columns: `SELECT users.name`

### FROM

- Single table: `FROM table_name`
- Table aliases: `FROM users AS u`

### JOIN

- INNER JOIN: `JOIN table ON left.key = right.key`
- LEFT JOIN: `LEFT JOIN table ON left.key = right.key`
- Multiple joins in sequence
- Only equality joins are supported

### WHERE

- Comparisons: `=`, `!=`, `<`, `>`, `<=`, `>=`
- Boolean operators: `AND`, `OR`, `NOT`
- NULL checks: `IS NULL`, `IS NOT NULL`
- IN clauses: `column IN (value1, value2, ...)`
- Column references: `alias.column`
- Constants: strings, numbers

## Not Supported

- GROUP BY and aggregations
- ORDER BY
- HAVING
- UNION
- Subqueries
- Non-equality joins
- Arithmetic expressions
- Functions (except basic literals)

## Advanced: Merge Joins

For better performance when both tables are sorted by the join key, you can register sources with `ordered_by`:

```python
# Register a source that's sorted by id
engine.register(
    "users",
    users_source,
    ordered_by="id"  # Enables merge join optimization
)
```

## Architecture

The engine follows a pipeline architecture:

1. **Parser**: Uses `sqlglot` to parse SQL into an AST
2. **Planner**: Converts AST into a logical execution plan
3. **Executor**: Builds a chain of iterator operators:
   - `ScanIterator`: Reads from source
   - `FilterIterator`: Applies WHERE clause
   - `LookupJoinIterator`: Joins using hash lookup (indexes smaller table)
   - `MergeJoinIterator`: Joins using merge algorithm (when both sides sorted)
   - `ProjectIterator`: Applies SELECT projection

## Error Handling

The engine raises clear exceptions for:

- Unsupported SQL constructs
- Missing table registrations
- Ambiguous column references
- Invalid join conditions

## Performance Considerations

- **Lookup joins**: Memory usage proportional to size of lookup-side table
- **Merge joins**: Memory usage proportional to size of equal-key runs (usually very small)
- **Streaming**: Results are yielded as soon as they're produced
- **No buffering**: Intermediate results are not fully materialized

## Installation from PyPI

Once published, you can install the package directly from PyPI:

```bash
pip install streaming-sql-engine
```

## License

MIT License
