Metadata-Version: 2.4
Name: dataforge-cli
Version: 0.1.1
Summary: A CLI for reproducible data cleaning pipelines.
Author-email: Aditya <bit2swaz@gmail.com>
Project-URL: Homepage, https://github.com/bit2swaz/dataforge-python
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: mysql-connector-python
Requires-Dist: python-dotenv
Requires-Dist: click
Requires-Dist: scikit-learn

# DataForge CLI

[![Python Version](https://img.shields.io/badge/python-3.10%2B-blue.svg)](#) [![PyPI Version](https://img.shields.io/pypi/v/dataforge-cli.svg)](#) [![Build Status](https://img.shields.io/github/actions/workflow/status/bit2swaz/dataforge-python/ci.yml)](#)

DataForge is a local-first, Python-based CLI for building and running reproducible data-cleaning pipelines.

<!-- TODO: Add product walkthrough GIF -->

## The Problem

Cleaning data inside sprawling Jupyter notebooks or ad-hoc Excel sheets is repetitive, fragile, and easy to break. Each dataset requires copy/pasting logic, manual tweaks, and constant vigilance to avoid regressions.

## The Solution

DataForge captures every cleaning step as an ordered, versionable pipeline stored in MySQL. Once your recipe is defined, you can rerun it on fresh data instantly—no notebooks to maintain, no spreadsheets to reconcile.

## Core Features

- **`dataforge profile`**: Instantly profile a CSV to inspect null counts, column types, medians, and more.
- **`dataforge pipeline`**: Create pipelines, append ordered steps, inspect them, rename, or delete outdated flows.
- **`dataforge run`**: Execute an entire pipeline in batch mode to transform messy CSVs into clean outputs.
- **`dataforge run --interactive`**: Step through each operation with Undo support for safe experimentation.

## Tech Stack

Python · Click · Pandas · NumPy · MySQL · scikit-learn

## Installation & Setup

1. **Clone the repository**
   ```bash
   git clone https://github.com/bit2swaz/dataforge-python.git
   cd dataforge-python
   ```
2. **Create a virtual environment**
   ```bash
   python -m venv venv
   source venv/bin/activate
   ```
3. **Install dependencies**
   ```bash
   pip install -r requirements.txt
   ```
4. **Configure environment variables**
   ```bash
   cp .env.example .env
   # Update credentials if necessary
   ```
5. **Prepare MySQL**
   ```sql
   CREATE DATABASE IF NOT EXISTS dataforge;
   CREATE USER 'dataforge_user'@'localhost' IDENTIFIED BY 'your_password';
   GRANT ALL PRIVILEGES ON dataforge.* TO 'dataforge_user'@'localhost';
   FLUSH PRIVILEGES;
   ```
6. **Create the schema**
   ```bash
   python db_init.py
   ```
7. **Install DataForge in editable mode**
   ```bash
   pip install -e .
   ```

## Quick Start (Usage)

```bash
# 1. Profile your data
dataforge profile messy_data.csv

# 2. Create a pipeline
dataforge pipeline create "my-first-cleaner"

# 3. Add cleaning steps
dataforge pipeline add "my-first-cleaner" --op "regex_replace" --params '{"column": "product_name", "pattern": "SKU-", "replace": ""}' --order 1
dataforge pipeline add "my-first-cleaner" --op "drop_nulls" --params '{"column": "email"}' --order 2

# 4. Run the pipeline!
dataforge run "my-first-cleaner" -i "messy_data.csv" -o "clean_output.csv"
```

## Available Transformers

- `op_regex_replace`
- `op_drop_nulls`
- `op_fill_nulls`
- `op_rename_column`
- `op_drop_column`
- `op_change_type`
- `op_scale_minmax`
- `op_one_hot_encode`

## Running Tests

```bash
pytest
```
