Metadata-Version: 2.1
Name: data-diff
Version: 0.8.2
Summary: Command-line tool and Python library to efficiently diff rows across two different databases.
Home-page: https://github.com/datafold/data-diff
License: MIT
Author: Datafold
Author-email: data-diff@datafold.com
Requires-Python: >=3.7.2,<4.0.0
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Topic :: Database :: Database Engines/Servers
Classifier: Typing :: Typed
Provides-Extra: clickhouse
Provides-Extra: duckdb
Provides-Extra: mysql
Provides-Extra: oracle
Provides-Extra: postgresql
Provides-Extra: preql
Provides-Extra: presto
Provides-Extra: redshift
Provides-Extra: snowflake
Provides-Extra: trino
Provides-Extra: vertica
Requires-Dist: click (>=8.1,<9.0)
Requires-Dist: clickhouse-driver ; extra == "clickhouse"
Requires-Dist: cryptography ; extra == "snowflake"
Requires-Dist: cx_Oracle ; extra == "oracle"
Requires-Dist: dbt-artifacts-parser (>=0.4.0,<0.5.0)
Requires-Dist: dbt-core (>=1.0.0,<2.0.0)
Requires-Dist: dsnparse (<0.2.0)
Requires-Dist: duckdb ; extra == "duckdb"
Requires-Dist: keyring
Requires-Dist: mysql-connector-python (==8.0.29) ; extra == "mysql"
Requires-Dist: preql (>=0.2.19,<0.3.0) ; extra == "preql"
Requires-Dist: presto-python-client ; extra == "presto"
Requires-Dist: psycopg2 ; extra == "postgresql" or extra == "redshift"
Requires-Dist: rich
Requires-Dist: runtype (>=0.2.6,<0.3.0)
Requires-Dist: snowflake-connector-python (>=3.0.2,<4.0.0) ; extra == "snowflake"
Requires-Dist: tabulate (>=0.9.0,<0.10.0)
Requires-Dist: toml (>=0.10.2,<0.11.0)
Requires-Dist: trino (>=0.314.0,<0.315.0) ; extra == "trino"
Requires-Dist: urllib3 (<2)
Requires-Dist: vertica-python ; extra == "vertica"
Project-URL: Repository, https://github.com/datafold/data-diff
Description-Content-Type: text/markdown

<p align="left">
    <a href="https://datafold.com/"><img alt="Datafold" src="https://user-images.githubusercontent.com/1799931/196497110-d3de1113-a97f-4322-b531-026d859b867a.png" width="30%" /></a>
</p>

<h1 align="left">
data-diff: compare datasets fast, within or across SQL databases
</h1>

<br>

# How it works

When comparing the data, `data-diff` utilizes the resources of the underlying databases as much as possible. It has two primary modes of comparison:

## joindiff
- Recommended for comparing data within the same database
- Uses the outer join operation to diff the rows as efficiently as possible within the same database
- Fully relies on the underlying database engine for computation
- Requires both datasets to be queryable with a single SQL query
- Time complexity approximates JOIN operation and is largely independent of the number of differences in the dataset
  
## hashdiff
- Recommended for comparing datasets across different databases
- Can also be helpful in diffing very large tables with few expected differences within the same database
- Employs a divide-and-conquer algorithm based on hashing and binary search
- Can diff data across distinct database engines, e.g., PostgreSQL <> Snowflake
- Time complexity approximates COUNT(*) operation when there are few differences
- Performance degrades when datasets have a large number of differences

More information about the algorithm and performance considerations can be found [here](https://github.com/datafold/data-diff/blob/master/docs/technical-explanation.md)

# Get started

Install `data-diff` with specific database adapters, e.g.:

```
pip install data-diff 'data-diff[postgresql,snowflake]' -U
```

Run `data-diff` with connection URIs. In the following example, we compare tables between PostgreSQL and Snowflake using hashdiff algorithm:
```
data-diff \
  postgresql://<username>:'<password>'@localhost:5432/<database> \
  <table> \
  "snowflake://<username>:<password>@<password>/<DATABASE>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<ROLE>" \
  <TABLE> \
  -k <primary key column> \
  -c <columns to compare> \
  -w <filter condition>
```

Check out [documentation](https://docs.datafold.com/reference/open_source/cli) for the full command reference.


# Use cases

## Data Migration & Replication Testing
Compare source to target and check for discrepancies when moving data between systems:
- Migrating to a new data warehouse (e.g., Oracle > Snowflake)
- Converting SQL to a new transformation framework (e.g., stored procedures > dbt)
- Continuously replicating data from an OLTP DB to OLAP DWH (e.g., MySQL > Redshift)


## Data Development Testing
Test SQL code and preview changes by comparing development/staging environment data to production:
1. Make a change to some SQL code
2. Run the SQL code to create a new dataset
3. Compare the dataset with its production version or another iteration

  <p align="left">
  <img alt="dbt" src="https://seeklogo.com/images/D/dbt-logo-E4B0ED72A2-seeklogo.com.png" width="10%" />
  </p>
  
`data-diff` integrates with dbt Core and dbt Cloud to seamlessly compare local development to production datasets. 

:eyes: **Watch [4-min demo video](https://www.loom.com/share/ad3df969ba6b4298939efb2fbcc14cde)**

**[Get started with data-diff & dbt](https://docs.datafold.com/development_testing/open_source)**

Also available in a [VS Code Extension](https://marketplace.visualstudio.com/items?itemName=Datafold.datafold-vscode)

Reach out on the dbt Slack in [#tools-datafold](https://getdbt.slack.com/archives/C03D25A92UU) for advice and support

# Supported databases


| Database      | Status | Connection string |
|---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------|
| PostgreSQL >=10 |  💚    | `postgresql://<user>:<password>@<host>:5432/<database>`                                                                        |
| MySQL         |  💚    | `mysql://<user>:<password>@<hostname>:5432/<database>`                                                                              |
| Snowflake     |  💚    | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |
| BigQuery      |  💚    | `bigquery://<project>/<dataset>`                                                                                                    |
| Redshift      |  💚    | `redshift://<username>:<password>@<hostname>:5439/<database>`                                                                       |
| Oracle        |  💛    | `oracle://<username>:<password>@<hostname>/database`                                                                                |
| Presto        |  💛    | `presto://<username>:<password>@<hostname>:8080/<database>`                                                                         |
| Databricks    |  💛    | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>`                                                      |
| Trino         |  💛    | `trino://<username>:<password>@<hostname>:8080/<database>`                                                                          |
| Clickhouse    |  💛    | `clickhouse://<username>:<password>@<hostname>:9000/<database>`                                                                     |
| Vertica       |  💛    | `vertica://<username>:<password>@<hostname>:5433/<database>`                                                                        |
| DuckDB        |  💛    |                                                                                                                                     |
| ElasticSearch |  📝    |                                                                                                                                     |
| Planetscale   |  📝    |                                                                                                                                     |
| Pinot         |  📝    |                                                                                                                                     |
| Druid         |  📝    |                                                                                                                                     |
| Kafka         |  📝    |                                                                                                                                     |
| SQLite        |  📝    |                                                                                                                                     |

* 💚: Implemented and thoroughly tested.
* 💛: Implemented, but not thoroughly tested yet.
* ⏳: Implementation in progress.
* 📝: Implementation planned. Contributions welcome.

Your database not listed here?

- Contribute a [new database adapter](https://github.com/datafold/data-diff/blob/master/docs/new-database-driver-guide.rst) – we accept pull requests!
- [Get in touch](https://www.datafold.com/demo) about enterprise support and adding new adapters and features


<br>

## Contributors

We thank everyone who contributed so far!

<a href="https://github.com/datafold/data-diff/graphs/contributors">
  <img src="https://contributors-img.web.app/image?repo=datafold/data-diff" />
</a>

<br>

## Analytics

* [Usage Analytics & Data Privacy](https://github.com/datafold/data-diff/blob/master/docs/usage_analytics.md)

<br>

## License

This project is licensed under the terms of the [MIT License](https://github.com/datafold/data-diff/blob/master/LICENSE).

