Metadata-Version: 2.4
Name: Flowfile
Version: 0.9.4
Summary: Project combining flowfile core (backend) and flowfile_worker (compute offloader) and flowfile_frame (api)
License-File: LICENSE
Author: Edward van Eechoud
Author-email: evaneechoud@gmail.com
Requires-Python: >=3.10,<3.14
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: XlsxWriter (>=3.2.0,<3.3.0)
Requires-Dist: aiofiles (>=24.1.0,<25.0.0)
Requires-Dist: alembic (>=1.13.0,<2.0.0)
Requires-Dist: azure-storage-blob (>=12.19.0)
Requires-Dist: bcrypt (>=4.3.0,<5.0.0)
Requires-Dist: boto3 (>=1.38.40,<1.38.47)
Requires-Dist: confluent-kafka (>=2.6,<3.0)
Requires-Dist: connectorx (>=0.4.2,<0.5.0)
Requires-Dist: cryptography (>=45.0.5,<47.0.0)
Requires-Dist: databases (>=0.9.0,<0.10.0)
Requires-Dist: deltalake (>=1.1.0,<2.0.0)
Requires-Dist: docker (>=7.0.0)
Requires-Dist: faker (>=23.1.0,<23.2.0)
Requires-Dist: fastapi (>=0.115.2,<0.116.0)
Requires-Dist: fastexcel (>=0.12.0,<0.13.0)
Requires-Dist: gcsfs (>=2024.2.0)
Requires-Dist: google-analytics-data (>=0.18.0)
Requires-Dist: google-cloud-storage (>=2.14.0)
Requires-Dist: httpx (>=0.28.1,<0.29.0)
Requires-Dist: loky (>=3.4.1,<3.5.0)
Requires-Dist: methodtools (>=0.4.7,<0.5.0)
Requires-Dist: openpyxl (>=3.1.2,<3.2.0)
Requires-Dist: passlib (>=1.7.4,<1.8.0)
Requires-Dist: pendulum (==2.1.2) ; python_version < "3.12"
Requires-Dist: pl-fuzzy-frame-match (>=0.6.0)
Requires-Dist: polars (>=1.8.2,<1.40)
Requires-Dist: polars-distance (>=0.4.3)
Requires-Dist: polars-ds (>=0.11.0)
Requires-Dist: polars-expr-transformer (>=0.5.3)
Requires-Dist: polars-grouper (>=0.3.0,<0.4.0)
Requires-Dist: polars-gw (==0.1.2)
Requires-Dist: polars_simed (>=0.4.0)
Requires-Dist: pyarrow (>=18.0.0,<19.0.0)
Requires-Dist: pydantic (>=2.9.2,<2.10.0)
Requires-Dist: pyiceberg[hadoop] (>=0.9.1,<0.10.0)
Requires-Dist: python-jose (>=3.4.0,<4.0.0)
Requires-Dist: python-multipart (>=0.0.12,<0.1.0)
Requires-Dist: pyyaml (>=6.0.3,<7.0.0)
Requires-Dist: s3fs (>=2025.7.0,<2026.0.0)
Requires-Dist: tqdm (>=4.67.1,<5.0.0)
Requires-Dist: uvicorn (>=0.32.0,<0.33.0)
Requires-Dist: websockets (>=12.0)
Requires-Dist: xlsx2csv (>=0.8.6,<0.9.0)
Description-Content-Type: text/markdown

<h1 align="center">
  <img src="https://raw.githubusercontent.com/Edwardvaneechoud/Flowfile/main/.github/images/logo.png" alt="Flowfile Logo" width="100">
  <br>
  Flowfile
</h1>

<p align="center">
  <b>Main Repository</b>: <a href="https://github.com/Edwardvaneechoud/Flowfile">Edwardvaneechoud/Flowfile</a><br>
  <b>Documentation</b>:
  <a href="https://edwardvaneechoud.github.io/Flowfile/">Website</a> -
  <a href="https://github.com/Edwardvaneechoud/Flowfile/blob/main/flowfile_core/README.md">Core</a> -
  <a href="https://github.com/Edwardvaneechoud/Flowfile/blob/main/flowfile_worker/README.md">Worker</a> -
  <a href="https://github.com/Edwardvaneechoud/Flowfile/blob/main/flowfile_frontend/README.md">Frontend</a> -
  <a href="https://dev.to/edwardvaneechoud/building-flowfile-architecting-a-visual-etl-tool-with-polars-576c">Technical Architecture</a>
</p>

<p>
Flowfile is an open-source data platform that combines a visual pipeline builder, a data catalog with Delta Lake storage, scheduling, Kafka ingestion, sandboxed Python execution, and a Polars-compatible Python API — all in a single <code>pip install</code>.
</p>

## Quick Start

```bash
pip install Flowfile
flowfile run ui
```

This starts the backend services and opens the visual ETL interface in your browser.

## What You Get

- **Visual pipeline builder** with 30+ nodes for joins, filters, aggregations, fuzzy matching, pivots, and more
- **Data catalog** with Delta Lake storage, version history, and lineage tracking
- **Scheduling** — interval-based or triggered by catalog table updates
- **Kafka/Redpanda ingestion** as a canvas node with automatic schema inference
- **Sandboxed Python execution** in isolated Docker containers
- **Code generation** — export visual flows as standalone Python/Polars scripts
- **Flow parameters** — `${variable}` substitution, configurable via UI or CLI
- **Cloud storage** — S3, Azure Data Lake Storage, Google Cloud Storage
- **Database connectivity** — PostgreSQL, MySQL, SQL Server, Oracle, DuckDB, and more
- **Python API** with Polars-like syntax and visual flow graph generation

## Python API

```python
import flowfile as ff
from flowfile import col, open_graph_in_editor

df = ff.from_dict({
    "id": [1, 2, 3, 4, 5],
    "category": ["A", "B", "A", "C", "B"],
    "value": [100, 200, 150, 300, 250]
})

result = df.filter(col("value") > 150).with_columns([
    (col("value") * 2).alias("double_value")
])

# Open the pipeline on the visual canvas
open_graph_in_editor(result.flow_graph)
```

## Common Operations

```python
import flowfile as ff
from flowfile import col, when, lit

# Read from various sources
df = ff.read_csv("data.csv")
df_pq = ff.read_parquet("data.parquet")

# Transform
filtered = df.filter(col("value") > 150)
with_status = df.with_columns([
    when(col("value") > 200).then(lit("High")).otherwise(lit("Low")).alias("status")
])

# Aggregate
by_category = df.group_by("category").agg([
    col("value").sum().alias("total"),
    col("value").mean().alias("average")
])

# Join
joined = df.join(other_df, left_on="id", right_on="product_id")

# Visualize any pipeline
ff.open_graph_in_editor(joined.flow_graph)
```

## Code Generation

Export visual flows as standalone Python/Polars scripts:

![Code Generation](https://raw.githubusercontent.com/Edwardvaneechoud/Flowfile/refs/heads/main/.github/images/generated_code.png)

## Package Components

- **Core Service** (`flowfile_core`) — ETL engine, catalog, scheduler, auth
- **Worker Service** (`flowfile_worker`) — CPU-intensive data processing
- **Web UI** — Browser-based visual pipeline builder
- **FlowFrame API** (`flowfile_frame`) — Polars-compatible Python library
- **Scheduler** (`flowfile_scheduler`) — Interval and table-trigger scheduling

## CLI

```bash
flowfile run ui                              # Start web UI
flowfile run core --host 0.0.0.0             # Start core service
flowfile run worker --host 0.0.0.0           # Start worker service
flowfile run flow pipeline.json              # Run a flow
flowfile run flow pipeline.json --param key=value  # Run with parameters
```

## More Options

- **Desktop App**: Download from [GitHub Releases](https://github.com/Edwardvaneechoud/Flowfile#-getting-started)
- **Docker**: `docker compose up -d` for self-hosted deployments
- **Browser Demo**: [demo.flowfile.org](https://demo.flowfile.org) (WASM, no server)

## Resources

- **[Documentation](https://edwardvaneechoud.github.io/Flowfile/)**: Comprehensive guides
- **[Main Repository](https://github.com/Edwardvaneechoud/Flowfile)**: Latest code and examples
- **[Technical Architecture](https://dev.to/edwardvaneechoud/building-flowfile-architecting-a-visual-etl-tool-with-polars-576c)**: Design overview

