Metadata-Version: 2.1
Name: python-dlt
Version: 0.2.0a9
Summary: DLT is an open-source python-native scalable data loading framework that does not require any devops efforts to run.
Home-page: https://github.com/dlt-hub
License: Apache-2.0
Keywords: etl
Author: dltHub Inc.
Author-email: services@dlthub.com
Maintainer: Marcin Rudolf
Maintainer-email: marcin@dlthub.com
Requires-Python: >=3.8,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Libraries
Provides-Extra: bigquery
Provides-Extra: dbt
Provides-Extra: gcp
Provides-Extra: postgres
Provides-Extra: redshift
Requires-Dist: PyYAML (>=5.4.1,<6.0.0)
Requires-Dist: SQLAlchemy (>=1.3.5,<2.0.0)
Requires-Dist: astunparse (>=1.6.3,<2.0.0)
Requires-Dist: asyncstdlib (>=3.10.5,<4.0.0)
Requires-Dist: cachetools (>=5.2.0,<6.0.0)
Requires-Dist: click (>=8.1.3,<9.0.0)
Requires-Dist: cron-descriptor (>=1.2.32,<2.0.0)
Requires-Dist: dbt-bigquery (>=1.0.0,<1.2.0); (python_version < "3.11") and (extra == "dbt")
Requires-Dist: dbt-core (>=1.1.0,<1.2.0); extra == "dbt"
Requires-Dist: dbt-redshift (>=1.0.0,<1.2.0); extra == "dbt"
Requires-Dist: gitpython (>=3.1.29,<4.0.0)
Requires-Dist: google-cloud-bigquery (>=2.26.0,<3.0.0); (python_version < "3.11") and (extra == "gcp" or extra == "bigquery")
Requires-Dist: google-cloud-bigquery-storage (>=2.13.0,<3.0.0); (python_version < "3.11") and (extra == "gcp" or extra == "bigquery")
Requires-Dist: grpcio (>=1.50.0,<2.0.0); (python_version < "3.11") and (extra == "gcp" or extra == "bigquery")
Requires-Dist: hexbytes (>=0.2.2,<0.3.0)
Requires-Dist: humanize (>=4.4.0,<5.0.0)
Requires-Dist: json-logging (==1.4.1rc0)
Requires-Dist: jsonlines (>=2.0.0,<3.0.0)
Requires-Dist: makefun (>=1.15.0,<2.0.0)
Requires-Dist: pathvalidate (>=2.5.2,<3.0.0)
Requires-Dist: pendulum (>=2.1.2,<3.0.0)
Requires-Dist: pipdeptree (>=2.3.3,<3.0.0)
Requires-Dist: prometheus-client (>=0.11.0,<0.12.0)
Requires-Dist: psycopg2-binary (>=2.9.1,<3.0.0); extra == "postgres" or extra == "redshift"
Requires-Dist: psycopg2cffi (>=2.9.0,<3.0.0); (platform_python_implementation == "PyPy") and (extra == "postgres" or extra == "redshift")
Requires-Dist: pyarrow (>=8.0.0,<9.0.0); extra == "gcp" or extra == "bigquery"
Requires-Dist: pytz (>=2022.6,<2023.0)
Requires-Dist: requests (>=2.26.0,<3.0.0)
Requires-Dist: requirements-parser (>=0.5.0,<0.6.0)
Requires-Dist: semver (>=2.13.0,<3.0.0)
Requires-Dist: sentry-sdk (>=1.4.3,<2.0.0)
Requires-Dist: setuptools (>=65.6.0,<66.0.0)
Requires-Dist: simplejson (>=3.17.5,<4.0.0)
Requires-Dist: tomlkit (>=0.11.3,<0.12.0)
Requires-Dist: typing-extensions (>=4.0.0,<5.0.0)
Requires-Dist: tzdata (>=2022.1,<2023.0)
Project-URL: Repository, https://github.com/dlt-hub/dlt
Description-Content-Type: text/markdown

# Quickstart Guide: Data Load Tool (DLT)

## **TL;DR: This guide shows you how to load a JSON document into Google BigQuery using DLT.**

![](docs/DLT-Pacman-Big.gif)

*Please open a pull request [here](https://github.com/scale-vector/dlt/edit/master/QUICKSTART.md) if there is something you can improve about this quickstart.*

## Grab the demo

Clone the example repository:
```
git clone https://github.com/scale-vector/dlt-quickstart-example.git
```

Enter the directory:
```
cd dlt-quickstart-example
```

Open the files in your favorite IDE / text editor:
- `data.json` (i.e. the JSON document you will load)
- `credentials.json` (i.e. contains the credentials to our demo Google BigQuery warehouse)
- `quickstart.py` (i.e. the script that uses DLT)

## Set up a virtual environment

Ensure you are using either Python 3.8 or 3.9:
```
python3 --version
```

Create a new virtual environment:
```
python3 -m venv ./env
```

Activate the virtual environment:
```
source ./env/bin/activate
```

## Install DLT and support for the target data warehouse

Install DLT using pip:
```
pip3 install -U python-dlt
```

Install support for Google BigQuery:
```
pip3 install -U python-dlt[gcp]
```

## Understanding the code

1. Configure DLT

2. Create a DLT pipeline

3. Load the data from the JSON document

4. Pass the data to the DLT pipeline

5. Use DLT to load the data

## Running the code

Run the quickstart script in `/examples` folder:

```
python3 quickstart.py
```

Inspect `schema.yml` that has been printed by the script or the generated file:
```
vim schema.yml
```

See results of querying the Google BigQuery table:

`json_doc` table

```
SELECT * FROM `{schema_prefix}_example.json_doc`
```
```
{  "name": "Ana",  "age": "30",  "id": "456",  "_dlt_load_id": "1654787700.406905",  "_dlt_id": "5b018c1ba3364279a0ca1a231fbd8d90"}
{  "name": "Bob",  "age": "30",  "id": "455",  "_dlt_load_id": "1654787700.406905",  "_dlt_id": "afc8506472a14a529bf3e6ebba3e0a9e"}
```

`json_doc__children` table

```
SELECT * FROM `{schema_prefix}_example.json_doc__children` LIMIT 1000
```
```
    # {"name": "Bill", "id": "625", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "0", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
    #   "_dlt_id": "7993452627a98814cc7091f2c51faf5c"}
    # {"name": "Bill", "id": "625", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "0", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
    #   "_dlt_id": "9a2fd144227e70e3aa09467e2358f934"}
    # {"name": "Dave", "id": "621", "_dlt_parent_id": "afc8506472a14a529bf3e6ebba3e0a9e", "_dlt_list_idx": "1", "_dlt_root_id": "afc8506472a14a529bf3e6ebba3e0a9e",
    #   "_dlt_id": "28002ed6792470ea8caf2d6b6393b4f9"}
    # {"name": "Elli", "id": "591", "_dlt_parent_id": "5b018c1ba3364279a0ca1a231fbd8d90", "_dlt_list_idx": "1", "_dlt_root_id": "5b018c1ba3364279a0ca1a231fbd8d90",
    #   "_dlt_id": "d18172353fba1a492c739a7789a786cf"}
```

Joining the two tables above on autogenerated keys (i.e. `p._record_hash = c._parent_hash`)

```
select p.name, p.age, p.id as parent_id,
            c.name as child_name, c.id as child_id, c._dlt_list_idx as child_order_in_list
        from `{schema_prefix}_example.json_doc` as p
        left join `{schema_prefix}_example.json_doc__children`  as c
            on p._dlt_id = c._dlt_parent_id
```
```
    # {  "name": "Ana",  "age": "30",  "parent_id": "456",  "child_name": "Bill",  "child_id": "625",  "child_order_in_list": "0"}
    # {  "name": "Ana",  "age": "30",  "parent_id": "456",  "child_name": "Elli",  "child_id": "591",  "child_order_in_list": "1"}
    # {  "name": "Bob",  "age": "30",  "parent_id": "455",  "child_name": "Bill",  "child_id": "625",  "child_order_in_list": "0"}
    # {  "name": "Bob",  "age": "30",  "parent_id": "455",  "child_name": "Dave",  "child_id": "621",  "child_order_in_list": "1"}
```

## Next steps

1. Replace `data.json` with data you want to explore

2. Check that the inferred types are correct in `schema.yml`

3. Set up your own Google BigQuery warehouse (and replace the credentials)

4. Use this new clean staging layer as the starting point for a semantic layer / analytical model (e.g. using dbt)
