# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['data_diff', 'data_diff.databases']

package_data = \
{'': ['*']}

install_requires = \
['click>=8.1,<9.0',
 'dsnparse',
 'rich',
 'runtype>=0.2.6,<0.3.0',
 'toml>=0.10.2,<0.11.0']

extras_require = \
{'clickhouse': ['clickhouse-driver'],
 'mysql': ['mysql-connector-python'],
 'postgresql': ['psycopg2'],
 'presto': ['presto-python-client'],
 'snowflake': ['snowflake-connector-python>=2.7.2,<3.0.0', 'cryptography'],
 'trino': ['trino>=0.314.0,<0.315.0']}

entry_points = \
{'console_scripts': ['data-diff = data_diff.__main__:main']}

setup_kwargs = {
    'name': 'data-diff',
    'version': '0.2.6',
    'description': 'Command-line tool and Python library to efficiently diff rows across two different databases.',
    'long_description': '# **data-diff**\n\n**data-diff is in shape to be run in production, but also under development. If\nyou run into issues or bugs, please [open an issue](https://github.com/datafold/data-diff/issues/new/choose) and we\'ll help you out ASAP! You can\nalso find us in `#tools-data-diff` in the [Locally Optimistic Slack][slack].**\n\n**We\'d love to hear about your experience using data-diff, and learn more your use cases. [Reach out to product team share any product feedback or feature requests!](https://calendly.com/jp-toor/customer-interview-oss)**\n\n\n\n**data-diff** is a command-line tool and Python library to efficiently diff\nrows across two different databases.\n\n* ⇄  Verifies across [many different databases][dbs] (e.g. PostgreSQL -> Snowflake)\n* 🔍 Outputs [diff of rows](#example-command-and-output) in detail\n* 🚨 Simple CLI/API to create monitoring and alerts\n* 🔁 Bridges column types of different formats and levels of precision (e.g. Double ⇆ Float ⇆ Decimal)\n* 🔥 Verify 25M+ rows in <10s, and 1B+ rows in ~5min.\n* ♾️  Works for tables with 10s of billions of rows\n\n**data-diff** splits the table into smaller segments, then checksums each\nsegment in both databases. When the checksums for a segment aren\'t equal, it\nwill further divide that segment into yet smaller segments, checksumming those\nuntil it gets to the differing row(s). See [Technical Explanation][tech-explain] for more\ndetails.\n\nThis approach has performance within an order of magnitude of `count(*)` when\nthere are few/no changes, but is able to output each differing row! By pushing\nthe compute into the databases, it\'s _much_ faster than querying for and\ncomparing every row.\n\n![Performance for 100M rows](https://user-images.githubusercontent.com/97400/175182987-a3900d4e-c097-4732-a4e9-19a40fac8cdc.png)\n\n**†:** The implementation for downloading all rows that `data-diff` and\n`count(*)` is compared to is not optimal. It is a single Python multi-threaded\nprocess. The performance is fairly driver-specific, e.g. PostgreSQL\'s performs 10x\nbetter than MySQL.\n\n## Table of Contents\n\n- [**data-diff**](#data-diff)\n  - [Table of Contents](#table-of-contents)\n  - [Common use-cases](#common-use-cases)\n  - [Example Command and Output](#example-command-and-output)\n  - [Supported Databases](#supported-databases)\n- [How to install](#how-to-install)\n  - [Install drivers](#install-drivers)\n- [How to use](#how-to-use)\n  - [How to use from the command-line](#how-to-use-from-the-command-line)\n  - [How to use from Python](#how-to-use-from-python)\n- [Technical Explanation](#technical-explanation)\n  - [Performance Considerations](#performance-considerations)\n- [Anonymous Tracking](#anonymous-tracking)\n- [Development Setup](#development-setup)\n- [License](#license)\n\n## Common use-cases\n\n* **Verify data migrations.** Verify that all data was copied when doing a\n  critical data migration. For example, migrating from Heroku PostgreSQL to Amazon RDS.\n* **Verifying data pipelines.** Moving data from a relational database to a\n  warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.\n* **Alerting and maintaining data integrity SLOs.** You can create and monitor\n  your SLO of e.g. 99.999% data integrity, and alert your team when data is\n  missing.\n* **Debugging complex data pipelines.** When data gets lost in pipelines that\n  may span a half-dozen systems, without verifying each intermediate datastore\n  it\'s extremely difficult to track down where a row got lost.\n* **Detecting hard deletes for an `updated_at`-based pipeline**. If you\'re\n  copying data to your warehouse based on an `updated_at`-style column, then\n  you\'ll miss hard-deletes that **data-diff** can find for you.\n* **Make your replication self-healing.** You can use **data-diff** to\n  self-heal by using the diff output to write/update rows in the target\n  database.\n\n## Example Command and Output\n\nBelow we run a comparison with the CLI for 25M rows in PostgreSQL where the\nright-hand table is missing single row with `id=12500048`:\n\n```\n$ data-diff \\\n    postgresql://user:password@localhost/database rating \\\n    postgresql://user:password@localhost/database rating_del1 \\\n    --bisection-threshold 100000 \\ # for readability, try default first\n    --bisection-factor 6 \\ # for readability, try default first\n    --update-column timestamp \\\n    --verbose\n\n    # Consider running with --interactive the first time.\n    # Runs `EXPLAIN` for you to verify the queries are using indexes.\n    # --interactive\n[10:15:00] INFO - Diffing tables | segments: 6, bisection threshold: 100000.\n[10:15:00] INFO - . Diffing segment 1/6, key-range: 1..4166683, size: 4166682\n[10:15:03] INFO - . Diffing segment 2/6, key-range: 4166683..8333365, size: 4166682\n[10:15:06] INFO - . Diffing segment 3/6, key-range: 8333365..12500047, size: 4166682\n[10:15:09] INFO - . Diffing segment 4/6, key-range: 12500047..16666729, size: 4166682\n[10:15:12] INFO - . . Diffing segment 1/6, key-range: 12500047..13194494, size: 694447\n[10:15:13] INFO - . . . Diffing segment 1/6, key-range: 12500047..12615788, size: 115741\n[10:15:13] INFO - . . . . Diffing segment 1/6, key-range: 12500047..12519337, size: 19290\n[10:15:13] INFO - . . . . Diff found 1 different rows.\n[10:15:13] INFO - . . . . Diffing segment 2/6, key-range: 12519337..12538627, size: 19290\n[10:15:13] INFO - . . . . Diffing segment 3/6, key-range: 12538627..12557917, size: 19290\n[10:15:13] INFO - . . . . Diffing segment 4/6, key-range: 12557917..12577207, size: 19290\n[10:15:13] INFO - . . . . Diffing segment 5/6, key-range: 12577207..12596497, size: 19290\n[10:15:13] INFO - . . . . Diffing segment 6/6, key-range: 12596497..12615788, size: 19291\n[10:15:13] INFO - . . . Diffing segment 2/6, key-range: 12615788..12731529, size: 115741\n[10:15:13] INFO - . . . Diffing segment 3/6, key-range: 12731529..12847270, size: 115741\n[10:15:13] INFO - . . . Diffing segment 4/6, key-range: 12847270..12963011, size: 115741\n[10:15:14] INFO - . . . Diffing segment 5/6, key-range: 12963011..13078752, size: 115741\n[10:15:14] INFO - . . . Diffing segment 6/6, key-range: 13078752..13194494, size: 115742\n[10:15:14] INFO - . . Diffing segment 2/6, key-range: 13194494..13888941, size: 694447\n[10:15:14] INFO - . . Diffing segment 3/6, key-range: 13888941..14583388, size: 694447\n[10:15:15] INFO - . . Diffing segment 4/6, key-range: 14583388..15277835, size: 694447\n[10:15:15] INFO - . . Diffing segment 5/6, key-range: 15277835..15972282, size: 694447\n[10:15:15] INFO - . . Diffing segment 6/6, key-range: 15972282..16666729, size: 694447\n+ (12500048, 1268104625)\n[10:15:16] INFO - . Diffing segment 5/6, key-range: 16666729..20833411, size: 4166682\n[10:15:19] INFO - . Diffing segment 6/6, key-range: 20833411..25000096, size: 4166685\n```\n\n## Supported Databases\n\n| Database      | Connection string                                                                                                                   | Status |\n|---------------|-------------------------------------------------------------------------------------------------------------------------------------|--------|\n| PostgreSQL >=10    | `postgresql://<user>:<password>@<host>:5432/<database>`                                                                             |  💚    |\n| MySQL         | `mysql://<user>:<password>@<hostname>:5432/<database>`                                                                              |  💚    |\n| Snowflake     | `"snowflake://<user>[:<password>]@<account>/<database>/<SCHEMA>?warehouse=<WAREHOUSE>&role=<role>[&authenticator=externalbrowser]"` |  💚    |\n| Oracle        | `oracle://<username>:<password>@<hostname>/database`                                                                                |  💛    |\n| BigQuery      | `bigquery://<project>/<dataset>`                                                                                                    |  💛    |\n| Redshift      | `redshift://<username>:<password>@<hostname>:5439/<database>`                                                                       |  💛    |\n| Presto        | `presto://<username>:<password>@<hostname>:8080/<database>`                                                                         |  💛    |\n| Databricks    | `databricks://<http_path>:<access_token>@<server_hostname>/<catalog>/<schema>`                                                      |  💛    |\n| Trino         | `trino://<username>:<password>@<hostname>:8080/<database>`                                                                          |  💛    |\n| Clickhouse    | `clickhouse://<username>:<password>@<hostname>:9000/<database>`                                                                     |  💛    |\n| Vertica       | `vertica://<username>:<password>@<hostname>:5433/<database>`                                                                        |  💛    |\n| ElasticSearch |                                                                                                                                     |  📝    |\n| Planetscale   |                                                                                                                                     |  📝    |\n| Pinot         |                                                                                                                                     |  📝    |\n| Druid         |                                                                                                                                     |  📝    |\n| Kafka         |                                                                                                                                     |  📝    |\n\n* 💚: Implemented and thoroughly tested.\n* 💛: Implemented, but not thoroughly tested yet.\n* ⏳: Implementation in progress.\n* 📝: Implementation planned. Contributions welcome.\n\nIf a database is not on the list, we\'d still love to support it. Open an issue\nto discuss it.\n\nNote: Because URLs allow many special characters, and may collide with the syntax of your command-line,\nit\'s recommended to surround them with quotes. Alternatively, you may provide them in a TOML file via the `--config` option.\n\n\n# How to install\n\nRequires Python 3.7+ with pip.\n\n```pip install data-diff```\n\n## Install drivers\n\nTo connect to a database, we need to have its driver installed, in the form of a Python library.\n\nWhile you may install them manually, we offer an easy way to install them along with data-diff<sup>*</sup>:\n\n- `pip install \'data-diff[mysql]\'`\n\n- `pip install \'data-diff[postgresql]\'`\n\n- `pip install \'data-diff[snowflake]\'`\n\n- `pip install \'data-diff[presto]\'`\n\n- `pip install \'data-diff[oracle]\'`\n\n- `pip install \'data-diff[trino]\'`\n\n- `pip install \'data-diff[clickhouse]\'`\n\n- `pip install \'data-diff[vertica]\'`\n\n- For BigQuery, see: https://pypi.org/project/google-cloud-bigquery/\n\n\nUsers can also install several drivers at once:\n\n```pip install \'data-diff[mysql,postgresql,snowflake]\'```\n\n_<sup>*</sup> Some drivers have dependencies that cannot be installed using `pip` and still need to be installed manually._\n\n\n### Install Psycopg2\n\nIn order to run Postgresql, you\'ll need `psycopg2`. This Python package requires some additional dependencies described in their [documentation](https://www.psycopg.org/docs/install.html#build-prerequisites).\nAn easy solution is to install [psycopg2-binary](https://www.psycopg.org/docs/install.html#quick-install) by running:\n\n```pip install psycopg2-binary```\n\nWhich comes with a pre-compiled binary and does not require additonal prerequisites. However, note that for production use it is adviced to use `psycopg2`.\n\n\n# How to use\n\n## How to use from the command-line\n\nUsage: `data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]`\n\nSee the [example command](#example-command-and-output) and the [sample\nconnection strings](#supported-databases).\n\nNote that for some databases, the arguments that you enter in the command line\nmay be case-sensitive. This is the case for the Snowflake schema and table names.\n\nOptions:\n\n  - `--help` - Show help message and exit.\n  - `-k` or `--key-column` - Name of the primary key column\n  - `-t` or `--update-column` - Name of updated_at/last_updated column\n  - `-c` or `--columns` - Names of extra columns to compare.  Can be used more than once in the same command.\n                          Accepts a name or a pattern like in SQL.\n                          Example: `-c col% -c another_col -c %foorb.r%`\n  - `-l` or `--limit` - Maximum number of differences to find (limits maximum bandwidth and runtime)\n  - `-s` or `--stats` - Print stats instead of a detailed diff\n  - `-d` or `--debug` - Print debug info\n  - `-v` or `--verbose` - Print extra info\n  - `-i` or `--interactive` - Confirm queries, implies `--debug`\n  - `--json` - Print JSONL output for machine readability\n  - `--min-age` - Considers only rows older than specified. Useful for specifying replication lag.\n                  Example: `--min-age=5min` ignores rows from the last 5 minutes.\n                  Valid units: `d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years`\n  - `--max-age` - Considers only rows younger than specified. See `--min-age`.\n  - `--bisection-factor` - Segments per iteration. When set to 2, it performs binary search.\n  - `--bisection-threshold` - Minimal bisection threshold. i.e. maximum size of pages to diff locally.\n  - `-j` or `--threads` - Number of worker threads to use per database. Default=1.\n  - `-w`, `--where` - An additional \'where\' expression to restrict the search space.\n  - `--conf`, `--run` - Specify the run and configuration from a TOML file. (see below)\n  - `--no-tracking` - data-diff sends home anonymous usage data. Use this to disable it.\n\n\n### How to use with a configuration file\n\nData-diff lets you load the configuration for a run from a TOML file.\n\nReasons to use a configuration file:\n\n- Convenience - Set-up the parameters for diffs that need to run often\n\n- Easier and more readable - you can define the database connection settings as config values, instead of in a URI.\n\n- Gives you fine-grained control over the settings switches, without requiring any Python code.\n\nUse `--conf` to specify that path to the configuration file. data-diff will load the settings from `run.default`, if it\'s defined.\n\nThen you can, optionally, use `--run` to choose to load the settings of a specific run, and override the settings `run.default`. (all runs extend `run.default`, like inheritance).\n\nFinally, CLI switches have the final say, and will override the settings defined by the configuration file, and the current run.\n\nExample TOML file:\n\n```toml\n# Specify the connection params to the test database.\n[database.test_postgresql]\ndriver = "postgresql"\nuser = "postgres"\npassword = "Password1"\n\n# Specify the default run params\n[run.default]\nupdate_column = "timestamp"\nverbose = true\n\n# Specify params for a run \'test_diff\'.\n[run.test_diff]\nverbose = false\n# Source 1 ("left")\n1.database = "test_postgresql"                      # Use options from database.test_postgresql\n1.table = "rating"\n# Source 2 ("right")\n2.database = "postgresql://postgres:Password1@/"    # Use URI like in the CLI\n2.table = "rating_del1"\n```\n\nIn this example, running `data-diff --conf myconfig.toml --run test_diff` will compare between `rating` and `rating_del1`.\nIt will use the `timestamp` column as the update column, as specified in `run.default`. However, it won\'t be verbose, since that\nflag is overwritten to `false`.\n\nRunning it with `data-diff --conf myconfig.toml --run test_diff -v` will set verbose back to `true`.\n\n\n## How to use from Python\n\nAPI reference: [https://data-diff.readthedocs.io/en/latest/](https://data-diff.readthedocs.io/en/latest/)\n\nExample:\n\n```python\n# Optional: Set logging to display the progress of the diff\nimport logging\nlogging.basicConfig(level=logging.INFO)\n\nfrom data_diff import connect_to_table, diff_tables\n\ntable1 = connect_to_table("postgresql:///", "table_name", "id")\ntable2 = connect_to_table("mysql:///", "table_name", "id")\n\nfor different_row in diff_tables(table1, table2):\n    plus_or_minus, columns = different_row\n    print(plus_or_minus, columns)\n```\n\nRun `help(diff_tables)` or [read the docs](https://data-diff.readthedocs.io/en/latest/) to learn about the different options.\n\n# Technical Explanation\n\nIn this section we\'ll be doing a walk-through of exactly how **data-diff**\nworks, and how to tune `--bisection-factor` and `--bisection-threshold`.\n\nLet\'s consider a scenario with an `orders` table with 1M rows. Fivetran is\nreplicating it contionously from PostgreSQL to Snowflake:\n\n```\n┌─────────────┐                        ┌─────────────┐\n│ PostgreSQL  │                        │  Snowflake  │\n├─────────────┤                        ├─────────────┤\n│             │                        │             │\n│             │                        │             │\n│             │  ┌─────────────┐       │ table with  │\n│ table with  ├──┤ replication ├──────▶│ ?maybe? all │\n│lots of rows!│  └─────────────┘       │  the same   │\n│             │                        │    rows.    │\n│             │                        │             │\n│             │                        │             │\n│             │                        │             │\n└─────────────┘                        └─────────────┘\n```\n\nIn order to check whether the two tables are the same, **data-diff** splits\nthe table into `--bisection-factor=10` segments.\n\nWe also have to choose which columns we want to checksum. In our case, we care\nabout the primary key, `--key-column=id` and the update column\n`--update-column=updated_at`. `updated_at` is updated every time the row is, and\nwe have an index on it.\n\n**data-diff** starts by querying both databases for the `min(id)` and `max(id)`\nof the table. Then it splits the table into `--bisection-factor=10` segments of\n`1M/10 = 100K` keys each:\n\n```\n┌──────────────────────┐              ┌──────────────────────┐\n│     PostgreSQL       │              │      Snowflake       │\n├──────────────────────┤              ├──────────────────────┤\n│      id=1..100k      │              │      id=1..100k      │\n├──────────────────────┤              ├──────────────────────┤\n│    id=100k..200k     │              │    id=100k..200k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=200k..300k     ├─────────────▶│    id=200k..300k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=300k..400k     │              │    id=300k..400k     │\n├──────────────────────┤              ├──────────────────────┤\n│         ...          │              │         ...          │\n├──────────────────────┤              ├──────────────────────┤\n│      900k..100k      │              │      900k..100k      │\n└───────────────────▲──┘              └▲─────────────────────┘\n                    ┃                  ┃\n                    ┃                  ┃\n                    ┃ checksum queries ┃\n                    ┃                  ┃\n                  ┌─┻──────────────────┻────┐\n                  │        data-diff        │\n                  └─────────────────────────┘\n```\n\nNow **data-diff** will start running `--threads=1` queries in parallel that\nchecksum each segment. The queries for checksumming each segment will look\nsomething like this, depending on the database:\n\n```sql\nSELECT count(*),\n    sum(cast(conv(substring(md5(concat(cast(id as char), cast(timestamp as char))), 18), 16, 10) as unsigned))\nFROM `rating_del1`\nWHERE (id >= 1) AND (id < 100000)\n```\n\nThis keeps the amount of data that has to be transferred between the databases\nto a minimum, making it very performant! Additionally, if you have an index on\n`updated_at` (highly recommended) then the query will be fast as the database\nonly has to do a partial index scan between `id=1..100k`.\n\nIf you are not sure whether the queries are using an index, you can run it with\n`--interactive`. This puts **data-diff** in interactive mode where it shows an\n`EXPLAIN` before executing each query, requiring confirmation to proceed.\n\nAfter running the checksum queries on both sides, we see that all segments\nare the same except `id=100k..200k`:\n\n```\n┌──────────────────────┐              ┌──────────────────────┐\n│     PostgreSQL       │              │      Snowflake       │\n├──────────────────────┤              ├──────────────────────┤\n│    checksum=0102     │              │    checksum=0102     │\n├──────────────────────┤   mismatch!  ├──────────────────────┤\n│    checksum=ffff     ◀──────────────▶    checksum=aaab     │\n├──────────────────────┤              ├──────────────────────┤\n│    checksum=abab     │              │    checksum=abab     │\n├──────────────────────┤              ├──────────────────────┤\n│    checksum=f0f0     │              │    checksum=f0f0     │\n├──────────────────────┤              ├──────────────────────┤\n│         ...          │              │         ...          │\n├──────────────────────┤              ├──────────────────────┤\n│    checksum=9494     │              │    checksum=9494     │\n└──────────────────────┘              └──────────────────────┘\n```\n\nNow **data-diff** will do exactly as it just did for the _whole table_ for only\nthis segment: Split it into `--bisection-factor` segments.\n\nHowever, this time, because each segment has `100k/10=10k` entries, which is\nless than the `--bisection-threshold` it will pull down every row in the segment\nand compare them in memory in **data-diff**.\n\n```\n┌──────────────────────┐              ┌──────────────────────┐\n│     PostgreSQL       │              │      Snowflake       │\n├──────────────────────┤              ├──────────────────────┤\n│    id=100k..110k     │              │    id=100k..110k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=110k..120k     │              │    id=110k..120k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=120k..130k     │              │    id=120k..130k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=130k..140k     │              │    id=130k..140k     │\n├──────────────────────┤              ├──────────────────────┤\n│         ...          │              │         ...          │\n├──────────────────────┤              ├──────────────────────┤\n│      190k..200k      │              │      190k..200k      │\n└──────────────────────┘              └──────────────────────┘\n```\n\nFinally **data-diff** will output the `(id, updated_at)` for each row that was different:\n\n```\n(122001, 1653672821)\n```\n\nIf you pass `--stats` you\'ll see e.g. what % of rows were different.\n\n## Performance Considerations\n\n* Ensure that you have indexes on the columns you are comparing. Preferably a\n  compound index. You can run with `--interactive` to see an `EXPLAIN` for the\n  queries.\n* Consider increasing the number of simultaneous threads executing\n  queries per database with `--threads`. For databases that limit concurrency\n  per query, e.g. PostgreSQL/MySQL, this can improve performance dramatically.\n* If you are only interested in _whether_ something changed, pass `--limit 1`.\n  This can be useful if changes are very rare. This is often faster than doing a\n  `count(*)`, for the reason mentioned above.\n* If the table is _very_ large, consider a larger `--bisection-factor`. Explained in\n  the [technical explanation][tech-explain]. Otherwise you may run into timeouts.\n* If there are a lot of changes, consider a larger `--bisection-threshold`.\n  Explained in the [technical explanation][tech-explain].\n* If there are very large gaps in your key column, e.g. 10s of millions of\n  continuous rows missing, then **data-diff** may perform poorly doing lots of\n  queries for ranges of rows that do not exist (see [technical\n  explanation][tech-explain]). We have ideas on how to tackle this issue, which we have\n  yet to implement. If you\'re experiencing this effect, please open an issue and we\n  will prioritize it.\n* The fewer columns you verify (passed with `--columns`), the faster\n  **data-diff** will be. On one extreme you can verify every column, on the\n  other you can verify _only_ `updated_at`, if you trust it enough. You can also\n  _only_ verify `id` if you\'re interested in only presence, e.g. to detect\n  missing hard deletes. You can do also do a hybrid where you verify\n  `updated_at` and the most critical value, e.g a money value in `amount` but\n  not verify a large serialized column like `json_settings`.\n* We have ideas for making **data-diff** even faster that\n  we haven\'t implemented yet: faster checksums by reducing type-casts\n  and using a faster hash than MD5, dynamic adaptation of\n  `bisection_factor`/`threads`/`bisection_threshold` (especially with large key\n  gaps), and improvements to bypass Python/driver performance limitations when\n  comparing huge amounts of rows locally (i.e. for very high `bisection_threshold` values).\n\n# Usage Analytics\n\ndata-diff collects anonymous usage data to help our team improve the tool and to apply development efforts to where our users need them most.\n\nWe capture two events, one when the data-diff run starts and one when it is finished. No user data or potentially sensitive information is or ever will be collected. The captured data is limited to:\n\n- Operating System and Python version\n\n- Types of databases used (postgresql, mysql, etc.)\n\n- Sizes of tables diffed, run time, and diff row count (numbers only)\n\n- Error message, if any, truncated to the first 20 characters.\n\n- A persistent UUID to indentify the session, stored in `~/.datadiff.toml`\n\nIf you do not wish to participate, the tracking can be easily disabled with one of the following methods:\n\n* In the CLI, use the `--no-tracking` flag.\n\n* In the config file, set `no_tracking = true` (for example, under `[run.default]`)\n\n* If you\'re using the Python API:\n\n```python\nimport data_diff\ndata_diff.disable_tracking()    # Call this first, before making any API calls\n\n# Connect and diff your tables without any tracking\n```\n\n\n# Development Setup\n\nThe development setup centers around using `docker-compose` to boot up various\ndatabases, and then inserting data into them.\n\nFor Mac for performance of Docker, we suggest enabling in the UI:\n\n* Use new Virtualization Framework\n* Enable VirtioFS accelerated directory sharing\n\n**1. Install Data Diff**\n\nWhen developing/debugging, it\'s recommended to install dependencies and run it\ndirectly with `poetry` rather than go through the package.\n\n```\n$ brew install mysql postgresql # MacOS dependencies for C bindings\n$ apt-get install libpq-dev libmysqlclient-dev # Debian dependencies\n\n$ pip install poetry # Python dependency isolation tool\n$ poetry install # Install dependencies\n```\n**2. Start Databases**\n\n[Install **docker-compose**][docker-compose] if you haven\'t already.\n\n```shell-session\n$ docker-compose up -d mysql postgres # run mysql and postgres dbs in background\n```\n\n[docker-compose]: https://docs.docker.com/compose/install/\n\n**3. Run Unit Tests**\n\nThere are more than 1000 tests for all the different type and database\ncombinations, so we recommend using `unittest-parallel` that\'s installed as a\ndevelopment dependency.\n\n```shell-session\n$ poetry run unittest-parallel -j 16 #  run all tests\n$ poetry run python -m unittest -k <test> #  run individual test\n```\n\n**4. Seed the Database(s) (optional)**\n\nFirst, download the CSVs of seeding data:\n\n```shell-session\n$ curl https://datafold-public.s3.us-west-2.amazonaws.com/1m.csv -o dev/ratings.csv\n\n# For a larger data-set (but takes 25x longer to import):\n# - curl https://datafold-public.s3.us-west-2.amazonaws.com/25m.csv -o dev/ratings.csv\n```\n\nNow you can insert it into the testing database(s):\n\n```shell-session\n# It\'s optional to seed more than one to run data-diff(1) against.\n$ poetry run preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql\n$ poetry run preql -f dev/prepare_db.pql postgresql://postgres:Password1@127.0.0.1:5432/postgres\n\n# Cloud databases\n$ poetry run preql -f dev/prepare_db.pql snowflake://<uri>\n$ poetry run preql -f dev/prepare_db.pql mssql://<uri>\n$ poetry run preql -f dev/prepare_db.pql bigquery:///<project>\n```\n\n**5. Run **data-diff** against seeded database (optional)**\n\n```bash\npoetry run python3 -m data_diff postgresql://postgres:Password1@localhost/postgres rating postgresql://postgres:Password1@localhost/postgres rating_del1 --verbose\n```\n\n**6. Run benchmarks (optional)**\n\n```shell-session\n$ dev/benchmark.sh #  runs benchmarks and puts results in benchmark_<sha>.csv\n$ poetry run python3 dev/graph.py #  create graphs from benchmark_*.csv files\n```\n\nYou can adjust how many rows we benchmark with by passing `N_SAMPLES` to `dev/benchmark.sh`:\n\n```shell-session\n$ N_SAMPLES=100000000 dev/benchmark.sh #  100m which is our canonical target\n```\n\n\n# License\n\n[MIT License](https://github.com/datafold/data-diff/blob/master/LICENSE)\n\n[dbs]: #supported-databases\n[tech-explain]: #technical-explanation\n[perf]: #performance-considerations\n[slack]: https://locallyoptimistic.com/community/\n',
    'author': 'Datafold',
    'author_email': 'data-diff@datafold.com',
    'maintainer': 'None',
    'maintainer_email': 'None',
    'url': 'https://github.com/datafold/data-diff',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'extras_require': extras_require,
    'entry_points': entry_points,
    'python_requires': '>=3.7,<4.0',
}


setup(**setup_kwargs)
