Metadata-Version: 2.4
Name: datablade
Version: 0.0.7
Summary: datablade is a suite of functions to provide standard syntax across data engineering projects.
Author-email: Brent Carpenetti <brentwc.git@pm.me>
License-Expression: MIT
Project-URL: Homepage, https://github.com/brentwc/datablade
Project-URL: Repository, https://github.com/brentwc/datablade
Project-URL: Issues, https://github.com/brentwc/datablade/issues
Project-URL: Changelog, https://github.com/brentwc/datablade/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/brentwc/datablade#readme
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: pyarrow
Requires-Dist: numpy
Requires-Dist: openpyxl
Requires-Dist: requests
Provides-Extra: performance
Requires-Dist: polars; extra == "performance"
Requires-Dist: psutil; extra == "performance"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-mock>=3.10.0; extra == "test"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: polars; extra == "dev"
Requires-Dist: psutil; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Provides-Extra: all
Requires-Dist: polars; extra == "all"
Requires-Dist: psutil; extra == "all"
Requires-Dist: pytest>=7.0.0; extra == "all"
Requires-Dist: pytest-cov>=4.0.0; extra == "all"
Requires-Dist: pytest-mock>=3.10.0; extra == "all"
Requires-Dist: black; extra == "all"
Requires-Dist: flake8; extra == "all"
Requires-Dist: mypy; extra == "all"
Requires-Dist: isort; extra == "all"
Requires-Dist: pre-commit; extra == "all"

# datablade

[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**datablade** is a small, single-machine Python toolkit for data engineers who need reliable “file → DataFrame/Parquet → SQL DDL” workflows.

It focuses on:

- Reading common file formats with memory-aware heuristics
- Streaming large files in chunks (without concatenating)
- Normalizing DataFrame columns for downstream systems
- Generating `CREATE TABLE` DDL across a small set of SQL dialects
- Producing bulk-load commands (and executing BCP for SQL Server)

## What datablade Does

datablade helps data engineers:

- **Load data efficiently** from common file formats with automatic memory heuristics
- **Standardize data cleaning** with consistent column naming and type inference
- **Apply first-class schema rules** with full or partial column overrides before Parquet output
- **Generate database schemas** for multiple SQL dialects from DataFrames or Parquet schemas
- **Handle datasets that don't fit in memory** using chunked iteration and optional Polars acceleration
- **Work across databases** with cross-dialect DDL and bulk-load command generation
- **Maintain data quality** with built-in validation and logging

## When to Use datablade

datablade is ideal for:

✅ **ETL/ELT Pipelines** - Building reproducible data ingestion workflows across multiple source formats

✅ **Multi-Database Projects** - Deploying the same schema to SQL Server, PostgreSQL, MySQL, or DuckDB

✅ **Large File Processing** - Streaming CSV/TSV/TXT/Parquet without concatenating

✅ **Data Lake to Warehouse** - Converting raw files to Parquet with optimized schemas

✅ **Ad-hoc Data Analysis** - Quickly exploring and preparing datasets with consistent patterns

✅ **Legacy System Integration** - Standardizing messy column names and data types from external sources

## When datablade is not the right tool

- Real-time streaming ingestion (Kafka, Spark Structured Streaming)
- Distributed compute / cluster execution (Spark, Dask)
- Warehouse-native transformations and modeling (dbt)
- A full-featured schema migration tool (Alembic, Flyway)
- Direct database connectivity/transactions (datablade generates SQL; it does not manage connections)

## Installation

```bash
pip install datablade
```

**Optional dependencies:**

```bash
# For high-performance file reading with Polars
pip install "datablade[performance]"

# For testing
pip install "datablade[test]"

# For development (includes testing + lint/format tooling)
pip install "datablade[dev]"

# All optional dependencies
pip install "datablade[all]"
```

## Features

datablade provides four main modules:

### 📊 `datablade.dataframes`

DataFrame operations and transformations:

- Clean and normalize DataFrame columns
- Auto-detect and convert data types
- Apply `SchemaConfig` rules to cleaned column names before Parquet conversion
- Generate optimized Parquet schemas
- Convert pandas DataFrames to PyArrow tables
- Generate multi-dialect SQL DDL statements
- **Memory-aware file reading** with automatic chunking
- **Polars integration** for high-performance large file processing
- Partitioned Parquet writing for datasets that don't fit in memory

### 🌐 `datablade.io`

Input/output operations for external data:

- Fetch JSON data from URLs
- Download and extract ZIP files

### 🛠️ `datablade.utils`

General utility functions:

- SQL name quoting
- Path standardization
- List flattening
- **Configurable logging** with Python logging module

### 🗄️ `datablade.sql`

Multi-dialect SQL utilities:

- **Multi-dialect support**: SQL Server, PostgreSQL, MySQL, DuckDB
- Dialect-aware identifier quoting
- CREATE TABLE generation for all dialects (from pandas DataFrames)
- CREATE TABLE generation from Parquet schemas (schema-only, via PyArrow)
- Optional `schema_spec` overrides for column types, nullability, and string sizing
- Bulk loading helpers:
    - SQL Server: executes `bcp` via subprocess
    - PostgreSQL/MySQL/DuckDB: returns command strings you can run in your environment

## Quick Start

```python
from datablade import SchemaConfig, configure_logging, read_file_smart
from datablade.dataframes import clean_dataframe_columns, pandas_to_parquet_table
from datablade.io import get_json
from datablade.utils import sql_quotename
from datablade.sql import Dialect, generate_create_table, generate_create_table_from_parquet

# Configure logging
import logging
configure_logging(level=logging.INFO, log_file="datablade.log")

# Read a file into a single DataFrame (materializes)
# Use dtype="string" when you want the most lossless ingest path.
df = read_file_smart("large_dataset.csv", verbose=True, dtype="string")

# Clean DataFrame
df = clean_dataframe_columns(df, verbose=True)

# Apply full or partial schema rules before Parquet conversion
schema_config = SchemaConfig(
    numeric_policy="float64",
    columns={
        "customer_id": "Int64",
        "amount": "Float64",
        "event_ts": "datetime64[ns, UTC]",
    },
)

# Convert to Parquet
table = pandas_to_parquet_table(df, convert=True, schema_config=schema_config)

# Generate SQL DDL for multiple dialects
sql_sqlserver = generate_create_table(df, table='my_table', dialect=Dialect.SQLSERVER)
sql_postgres = generate_create_table(df, table='my_table', dialect=Dialect.POSTGRES)

# Generate SQL DDL directly from an existing Parquet schema (no data materialization)
# Note: nested Parquet types (struct/list/map/union) are dropped with a warning.
ddl_from_parquet = generate_create_table_from_parquet(
    "events.parquet",
    table="events",
    dialect=Dialect.POSTGRES,
)

# Fetch JSON data
data = get_json('https://api.example.com/data.json')
```

`SchemaConfig` rules match cleaned column names. Keep `numeric_policy="infer"` for
the existing behavior, use `"float64"` to promote inferred numeric columns to
nullable `Float64`, or use `"string"` to skip unnamed numeric inference. For truly
lossless CSV ingestion, load text at read time with `dtype="string"` as shown above.

Most file path parameters accept `str` or `pathlib.Path`. To treat case mismatches
as errors on case-insensitive filesystems, use `configure_paths(path_strict=True)`.

### Memory-Aware File Reading

See the file format support matrix in the bundled USAGE doc:

```bash
python -m datablade.docs --show USAGE
```

```python
from datablade.dataframes import (
    excel_to_parquets,
    read_file_chunked,
    read_file_iter,
    read_file_to_parquets,
    stream_to_parquets,
)

# Read large files in chunks
for chunk in read_file_chunked('huge_file.csv', memory_fraction=0.5):
    process(chunk)

# Stream without ever concatenating/materializing
for chunk in read_file_iter('huge_file.csv', memory_fraction=0.3, verbose=True):
    process(chunk)

# Parquet is also supported for streaming (single .parquet files)
for chunk in read_file_iter('huge_file.parquet', memory_fraction=0.3, verbose=True):
    process(chunk)

# Excel streaming is available with openpyxl installed (read-only mode)
for chunk in read_file_iter('large.xlsx', chunksize=25_000, verbose=True):
    process(chunk)

# Partition large files to multiple Parquets
files = read_file_to_parquets(
    'large_file.csv',
    output_dir='partitioned/',
    convert_types=True,
    verbose=True
)

# Stream to Parquet partitions without materializing
files = stream_to_parquets(
    'large_file.csv',
    output_dir='partitioned_streamed/',
    rows_per_file=200_000,
    convert_types=True,
    verbose=True,
)

# Excel streaming to Parquet partitions
files = excel_to_parquets(
    'large.xlsx',
    output_dir='partitioned_excel/',
    rows_per_file=200_000,
    convert_types=True,
    verbose=True,
)
```

## Blade (Optional Facade)

The canonical API is module-level functions (for example, `datablade.dataframes.read_file_iter`).

If you prefer an object-style entrypoint with shared defaults, you can use the optional `Blade` facade:

```python
from datablade import Blade
from datablade.sql import Dialect

blade = Blade(memory_fraction=0.3, verbose=True, convert_types=True)

for chunk in blade.iter("huge.csv"):
    process(chunk)

files = blade.stream_to_parquets("huge.csv", output_dir="partitioned/")

# Generate DDL (CREATE TABLE)
ddl = blade.create_table_sql(
    df,
    table="my_table",
    dialect=Dialect.POSTGRES,
)

# Generate DDL from an existing Parquet file (schema-only)
ddl2 = blade.create_table_sql_from_parquet(
    "events.parquet",
    table="events",
    dialect=Dialect.POSTGRES,
)
```

## Documentation

Docs are bundled with the installed package:

```bash
python -m datablade.docs --list
python -m datablade.docs --show USAGE
python -m datablade.docs --write-dir .\datablade-docs
```

After writing docs to disk, open the markdown files locally:

- README (docs landing page)
- USAGE (file reading, streaming, SQL, IO, logging)
- TESTING (how to run tests locally)
- ARCHITECTURE (pipeline overview)
- OBJECT_REGISTRY (registry reference)

## Testing

Run the test suite:

```bash
# Install with test dependencies
pip install -e ".[test]"

# Run all tests
pytest

# Run with coverage report
pytest --cov=datablade --cov-report=html
```

For detailed testing documentation, use the bundled TESTING doc:

```bash
python -m datablade.docs --show TESTING
```

## Backward Compatibility

All functions are available through the legacy `datablade.core` module for backward compatibility:

```python
# Legacy imports (still supported)
from datablade.core.frames import clean_dataframe_columns
from datablade.core.json import get
```

## Requirements

**Core dependencies:**

- pandas
- pyarrow
- numpy
- openpyxl
- requests

## Design choices and limitations

- **Single-machine focus**: datablade is designed for laptop/VM/server execution, not clusters.
- **Streaming vs materializing**:
    - Use `read_file_iter()` to process arbitrarily large files chunk-by-chunk.
    - `read_file_smart()` returns a single DataFrame and may still be memory-intensive.
- **Chunk concatenation**: the large-file pandas fallback in `read_file_smart()` can
  temporarily spike memory usage during concat. Use `read_file_iter()` or
  `return_type="iterator"` to avoid concatenation.
- **Polars materialization**: when returning a pandas DataFrame, Polars still
  collects into memory; use `return_type="polars"` or `"polars_lazy"` to keep
  Polars frames.
- **Parquet support**:
    - Streaming reads support single `.parquet` files.
    - Parquet “dataset directories” (Hive partitions / directory-of-parquets) are not a primary target API.
- **Parquet → SQL DDL**:
    - Uses the Parquet schema (PyArrow) without scanning data.
    - Complex/nested columns (struct/list/map/union) are dropped and logged as warnings.
- **DDL scope**: `CREATE TABLE` generation is column/type oriented (no indexes/constraints).
- **SQL Server bulk load**: the SQL Server helpers use the `bcp` CLI and require it
  to be installed and available on PATH. When using `-U`/`-P`, credentials are
  passed via process args (logs are redacted); prefer `-T` or `-G` where possible.

**Optional dependencies:**

- polars (for high-performance file reading)
- psutil (for memory-aware operations)
- pytest (for testing)

## License

MIT

## Links

- Documentation: https://github.com/brentwc/datablade/tree/main/src/datablade/docs
- Changelog: https://github.com/brentwc/datablade/blob/main/CHANGELOG.md
- Issues: https://github.com/brentwc/datablade/issues

