Metadata-Version: 2.4
Name: apache-airflow-providers-fastetl
Version: 0.2.14
Summary: FastETL custom package Apache Airflow provider.
Home-page: https://github.com/gestaogovbr/FastETL
Author: Time de Dados CGINF
Author-email: seges.cginf@economia.gov.br
License: Apache License 2.0
Classifier: Framework :: Apache Airflow
Classifier: Framework :: Apache Airflow :: Provider
Requires-Python: ~=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: apache-airflow>=2.3
Requires-Dist: apache-airflow-providers-microsoft-mssql
Requires-Dist: apache-airflow-providers-mysql
Requires-Dist: apache-airflow-providers-postgres
Requires-Dist: apache-airflow-providers-common-sql
Requires-Dist: alembic>=1.8.1
Requires-Dist: beautifulsoup4>=4.1.11
Requires-Dist: ckanapi>=4.6
Requires-Dist: frictionless>=5.11.1
Requires-Dist: markdown>=3.4.1
Requires-Dist: odfpy>=1.4.1
Requires-Dist: pandas<2,>=1.5.2
Requires-Dist: psycopg2>=2.9.5
Requires-Dist: pygsheets>=2.0.5
Requires-Dist: pyodbc>=4.0.35
Requires-Dist: pysmb>=1.2.6
Requires-Dist: python-slugify>=7.0.0
Requires-Dist: pytz>=2022.6
Requires-Dist: requests>=2.28.1
Requires-Dist: SQLAlchemy>=1.4.44
Requires-Dist: PyYAML==6.0
Requires-Dist: openmetadata-ingestion==1.5.2.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

![FastETL's logo. It's a Swiss army knife with some open tools](docs/images/logo.svg)

<p align="center">
    <em>FastETL framework, modern, versatile, does almost everything.</em>
</p>

Este texto também está disponível em português: 🇧🇷[LEIAME.md](LEIAME.md).

---

[![CI Tests](https://github.com/gestaogovbr/FastETL/actions/workflows/ci-tests.yml/badge.svg)](https://github.com/gestaogovbr/FastETL/actions/workflows/ci-tests.yml)

**FastETL** is a plugins package for Airflow for building data pipelines
for a number of common scenarios.

Main features:
* Full or incremental **replication** of tables in SQL Server, Postgres
  and MySQL databases
* Load data from **GSheets** and from spreadsheets on **Samba/Windows**
  networks
* Extracting **CSV** from SQL
* Clean data using custom data patching tasks (e.g. for messy
  geographical coordinates, mapping canonical values for columns, etc.)
* Using a [Open Street Routing Machine](https://project-osrm.org/)
  service to calculate route distances
* Using [CKAN](https://docs.ckan.org/en/2.10/api/index.html) or
  dados.gov.br's API to update dataset metadata
* Using Frictionless
  [Tabular Data Packages](https://specs.frictionlessdata.io/tabular-data-package/)
  to write OpenDocument Text format data dictionaries

<!-- Contar a história da origem do FastETL -->
This framework is maintained by a network of developers from many teams
at the Ministry of Management and Innovation in Public Services and is
the cumulative result of using
[Apache Airflow](https://airflow.apache.org/), a free and open source
tool, starting in 2019.

**For government:** FastETL is widely used for replication of data queried
via Quartzo (DaaS) from Serpro.

# Installation in Airflow

FastETL implements the standards for Airflow plugins. To install it,
simply add the `apache-airflow-providers-fastetl` package to your
Python dependencies in your Airflow environment.

Or install it with

```bash
pip install apache-airflow-providers-fastetl
```

To see an example of an Apache Airflow container that uses FastETL,
check out the
[airflow2-docker](https://github.com/gestaogovbr/airflow2-docker)
repository.

To ensure appropriate results, please make sure to install the
`msodbcsql17` and `unixodbc-dev` libraries on your Apache Airflow workers.

# Tests

The test suite uses Docker containers to simulate a complete use
environment, including Airflow and the databases. For that reason, to
execute the tests, you first need to install Docker and docker-compose.

For instructions on how to do this, see the
[official Docker documentation](https://docs.docker.com/get-docker/).


To build the containers:

```bash
make setup
```

To run the tests, use:

```bash
make setup && make tests
```

To shutdown the environment, use:

```bash
make down
```

# Usage examples

The main FastETL feature is the `DbToDbOperator` operator. It copies data
between `postgres` and `mssql` databases. MySQL is also supported as a
source.

Here goes an example:

```python
from datetime import datetime
from airflow import DAG
from fastetl.operators.db_to_db_operator import DbToDbOperator

default_args = {
    "start_date": datetime(2023, 4, 1),
}

dag = DAG(
    "copy_db_to_db_example",
    default_args=default_args,
    schedule_interval=None,
)


t0 = DbToDbOperator(
    task_id="copy_data",
    source={
        "conn_id": airflow_source_conn_id,
        "schema": source_schema,
        "table": table_name,
    },
    destination={
        "conn_id": airflow_dest_conn_id,
        "schema": dest_schema,
        "table": table_name,
    },
    destination_truncate=True,
    copy_table_comments=True,
    chunksize=10000,
    dag=dag,
)
```

More detail about the parameters and the workings of `DbToDbOperator`
can bee seen on the following files:

* [fast_etl.py](fastetl/custom_functions/fast_etl.py)
* [db_to_db_operator.py](fastetl/operators/db_to_db_operator.py)

# How to contribute

To be written on the `CONTRIBUTING.md` document (issue
[#4](/gestaogovbr/FastETL/issues/4)).
