Metadata-Version: 2.4
Name: neo4j-etl-lib
Version: 0.3.6
Summary: Building blocks for ETL pipelines.
Keywords: etl,graph,database
Author-email: Bert Radke <bert.radke@pm.me>
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Database
Classifier: Development Status :: 4 - Beta
License-File: LICENSE
Requires-Dist: pydantic>=2.10.5; python_version >= '3.10'
Requires-Dist: neo4j-rust-ext>=6.0.0
Requires-Dist: python-dotenv>=1.0.1; python_version >= '3.10'
Requires-Dist: tabulate>=0.9.0; python_version >= '3.10'
Requires-Dist: click>=8.1.8; python_version >= '3.10'
Requires-Dist: pydantic[email-validator]
Requires-Dist: pytest>=8.3.0 ; extra == "dev" and ( python_version >= '3.8')
Requires-Dist: testcontainers[neo4j] ; extra == "dev" and ( python_version >= '3.9' and python_version < '4.0')
Requires-Dist: pytest-cov ; extra == "dev"
Requires-Dist: bumpver ; extra == "dev"
Requires-Dist: isort ; extra == "dev"
Requires-Dist: pip-tools ; extra == "dev"
Requires-Dist: sphinx ; extra == "dev"
Requires-Dist: sphinx-rtd-theme ; extra == "dev"
Requires-Dist: pydata-sphinx-theme ; extra == "dev"
Requires-Dist: sphinx-autodoc-typehints ; extra == "dev"
Requires-Dist: sphinxcontrib-napoleon ; extra == "dev"
Requires-Dist: sphinx-autoapi ; extra == "dev"
Requires-Dist: sqlalchemy ; extra == "dev"
Requires-Dist: psycopg2-binary ; extra == "dev"
Requires-Dist: graphdatascience>=1.13 ; extra == "gds" and ( python_version >= '3.9')
Requires-Dist: nox>=2024.0.0 ; extra == "nox"
Requires-Dist: pyarrow>=14.0.0 ; extra == "parquet"
Requires-Dist: sqlalchemy ; extra == "sql"
Project-URL: Documentation, https://neo-technology-field.github.io/python-etl-lib/index.html
Project-URL: Home, https://github.com/neo-technology-field/python-etl-lib
Provides-Extra: dev
Provides-Extra: gds
Provides-Extra: nox
Provides-Extra: parquet
Provides-Extra: sql

# Neo4j ETL Toolbox

A robust Python library of building blocks to assemble efficient, scalable ETL pipelines for Neo4j.

It simplifies the process of moving data from SQL, CSV, and Parquet sources into Neo4j by handling common concerns like batching, parallelism, logging, and error handling.

## Key Features

*   **Task-Based Architecture**: Compose pipelines from reusable units of work.
*   **Parallel Loading**: Optimized strategies for high-performance loading without locking issues.
*   **Data Validation**: Integrated Pydantic support for ensuring data quality before loading.
*   **Detailed Reporting**: Built-in tracking of execution time and row counts.
*   **Flexible Sources**: Support for SQL (via SQLAlchemy), CSV, Neo4j and Parquet (via PyArrow).

## Parallel Loading Example

The library provides specialized tasks for parallel data loading. By using a "mix-and-batch" strategy, it can load relationships in parallel while minimizing deadlocks.

Here is an example of defining a parallel CSV loader task (taken from the `examples/nyc-taxi` project):

```python
from pathlib import Path
from etl_lib.core.ETLContext import ETLContext
from etl_lib.core.SplittingBatchProcessor import dict_id_extractor
from etl_lib.task.data_loading.ParallelCSVLoad2Neo4jTask import ParallelCSVLoad2Neo4jTask
from model.trip import Trip # Your Pydantic model

class LoadTripsParallelTask(ParallelCSVLoad2Neo4jTask):
    def __init__(self, context: ETLContext, csv_path: Path):
        super().__init__(
            context,
            file=csv_path,
            model=Trip,
            error_file=Path('errors_parallel.json'),
            batch_size=5000,
            max_workers=10
        )

    def _query(self):
        return """
            UNWIND $batch AS row
            MATCH (pu:Location {id: row.pu_location})
            MATCH (do:Location {id: row.do_location})
            CREATE (t:Trip {
              id: randomUUID(),
              pickup_datetime: row.pickup_datetime,
              dropoff_datetime: row.dropoff_datetime,
              ...
            })
            CREATE (t)-[:STARTED_AT]->(pu)
            CREATE (t)-[:ENDED_AT]->(do)
        """

    def _id_extractor(self):
        # Defines how to route rows to avoid locking on start/end nodes
        return dict_id_extractor(table_size=10, start_key='pu_location', end_key='do_location')
```

## Documentation & Examples

Complete documentation can be found on https://neo-technology-field.github.io/python-etl-lib/index.html

See the examples directory for complete projects:
*   [GTFS Example](https://github.com/neo-technology-field/python-etl-lib/tree/main/examples/gtfs)
*   [MusicBrainz Example](https://github.com/neo-technology-field/python-etl-lib/tree/main/examples/musicbrainz)

## Installation

The library can be installed via:

```bash
pip install neo4j-etl-lib
```

## System Dependencies

Some components or documentation tools require additional system-level packages.

### Graphviz
If you are building the documentation locally and want to generate diagrams (e.g., using `make docs`), you need Graphviz installed.

**Debian/Ubuntu:**
```bash
sudo apt install graphviz
```

**Fedora/RHEL/CentOS:**
```bash
sudo dnf install graphviz
```

**Arch Linux / CachyOS:**
```bash
sudo pacman -S graphviz
```

### Podman + Testcontainers (Linux)
Don't. I could not get this to work without a brittle setup. I currently run my tests by pointing to a running db instance via `.env`. And on CI I use docker and it just works.



