datablade / datablade Usage Guide

datablade Usage Guide

```bash

datablade Usage Guide

Installation

pip install git+https://github.com/brentwc/data-prep.git

Optional extras:

pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[performance]

File reading

Read into a DataFrame (convenience)

Use this when you want a single in-memory DataFrame.

from datablade.dataframes import read_file_smart

df = read_file_smart("data.csv", verbose=True)

read_file_smart() may read in chunks internally, but it returns a single DataFrame.

Stream without materializing (recommended for very large files)

Use this when you want to process arbitrarily large files without ever concatenating the full dataset.

from datablade.dataframes import read_file_iter

for chunk in read_file_iter("huge.csv", memory_fraction=0.3, verbose=True):
    process(chunk)

Supported streaming formats:

  • .csv, .tsv, .txt
  • .parquet
  • .json only for JSON Lines (lines=True)

Stream to Parquet partitions

If your goal is “partition to Parquet without ever materializing”, use stream_to_parquets().

from datablade.dataframes import stream_to_parquets

files = stream_to_parquets(
    "huge.csv",
    output_dir="partitioned/",
    rows_per_file=200_000,
    convert_types=True,
    verbose=True,
)
print(len(files))

If you prefer the older helper that may choose chunk sizes automatically, you can also use read_file_to_parquets().

DataFrame helpers

from datablade.dataframes import clean_dataframe_columns, try_cast_string_columns_to_numeric

# clean_dataframe_columns:
# - flattens MultiIndex columns
# - coerces names to strings
# - drops duplicate columns (keeps first)

# try_cast_string_columns_to_numeric:
# - converts object columns containing numeric strings

SQL helpers

import pandas as pd
from datablade.sql import Dialect, generate_create_table, generate_create_table_from_parquet

df = pd.DataFrame({"id": [1, 2], "name": ["a", "b"]})
print(generate_create_table(df, table="t", dialect=Dialect.POSTGRES))

# Schema-only Parquet -> SQL (does not read/scan the data)
ddl = generate_create_table_from_parquet(
    "events.parquet",
    table="events",
    dialect=Dialect.POSTGRES,
)
print(ddl)

Notes:

  • Parquet DDL generation reads only the file schema via PyArrow (no DataFrame materialization).
  • If a Parquet column has no clean mapping for the selected dialect (for example structs/lists/maps), it is dropped and a warning is logged under logger name datablade.

IO helpers

from datablade.io import get_json, get_zip

data = get_json("https://api.example.com/data.json")
get_zip("https://example.com/data.zip", path="./data")

Logging

datablade uses the standard Python logging system under the logger name datablade.

import logging
from datablade.utils import configure_logging

configure_logging(level=logging.INFO)

Write logs to a file:

import logging
from datablade.utils import configure_logging

configure_logging(level=logging.INFO, log_file="pipeline.log")

For rotation, pass a custom handler (e.g., logging.handlers.RotatingFileHandler).

Backward compatibility

Legacy imports under datablade.core.* remain available. New code should prefer datablade.dataframes, datablade.sql, datablade.io, and datablade.utils.

Optional facade (class-style)

The primary API is module-level functions, but datablade also provides an optional convenience facade for users who prefer an object-style entrypoint with shared defaults.

from datablade import Blade
from datablade.sql import Dialect

blade = Blade(memory_fraction=0.3, verbose=True, convert_types=True)

for chunk in blade.iter("huge.csv"):
    ...

files = blade.stream_to_parquets("huge.csv", output_dir="partitioned/")

# Generate DDL (CREATE TABLE) for a dialect
ddl = blade.create_table_sql(
    df,
    table="my_table",
    dialect=Dialect.POSTGRES,
)