datablade Usage Guide
Installation
pip install git+https://github.com/brentwc/data-prep.git
Optional extras:
pip install git+https://github.com/brentwc/data-prep.git#egg=datablade[performance]
File reading
Read into a DataFrame (convenience)
Use this when you want a single in-memory DataFrame.
from datablade.dataframes import read_file_smart
df = read_file_smart("data.csv", verbose=True)
read_file_smart() may read in chunks internally, but it returns a single DataFrame.
Stream without materializing (recommended for very large files)
Use this when you want to process arbitrarily large files without ever concatenating the full dataset.
from datablade.dataframes import read_file_iter
for chunk in read_file_iter("huge.csv", memory_fraction=0.3, verbose=True):
process(chunk)
Supported streaming formats:
.csv,.tsv,.txt.parquet.jsononly for JSON Lines (lines=True)
Stream to Parquet partitions
If your goal is “partition to Parquet without ever materializing”, use stream_to_parquets().
from datablade.dataframes import stream_to_parquets
files = stream_to_parquets(
"huge.csv",
output_dir="partitioned/",
rows_per_file=200_000,
convert_types=True,
verbose=True,
)
print(len(files))
If you prefer the older helper that may choose chunk sizes automatically, you can also use read_file_to_parquets().
DataFrame helpers
from datablade.dataframes import clean_dataframe_columns, try_cast_string_columns_to_numeric
# clean_dataframe_columns:
# - flattens MultiIndex columns
# - coerces names to strings
# - drops duplicate columns (keeps first)
# try_cast_string_columns_to_numeric:
# - converts object columns containing numeric strings
SQL helpers
import pandas as pd
from datablade.sql import Dialect, generate_create_table, generate_create_table_from_parquet
df = pd.DataFrame({"id": [1, 2], "name": ["a", "b"]})
print(generate_create_table(df, table="t", dialect=Dialect.POSTGRES))
# Schema-only Parquet -> SQL (does not read/scan the data)
ddl = generate_create_table_from_parquet(
"events.parquet",
table="events",
dialect=Dialect.POSTGRES,
)
print(ddl)
Notes:
- Parquet DDL generation reads only the file schema via PyArrow (no DataFrame materialization).
- If a Parquet column has no clean mapping for the selected dialect (for example structs/lists/maps),
it is dropped and a warning is logged under logger name
datablade.
IO helpers
from datablade.io import get_json, get_zip
data = get_json("https://api.example.com/data.json")
get_zip("https://example.com/data.zip", path="./data")
Logging
datablade uses the standard Python logging system under the logger name datablade.
import logging
from datablade.utils import configure_logging
configure_logging(level=logging.INFO)
Write logs to a file:
import logging
from datablade.utils import configure_logging
configure_logging(level=logging.INFO, log_file="pipeline.log")
For rotation, pass a custom handler (e.g., logging.handlers.RotatingFileHandler).
Backward compatibility
Legacy imports under datablade.core.* remain available.
New code should prefer datablade.dataframes, datablade.sql, datablade.io, and datablade.utils.
Optional facade (class-style)
The primary API is module-level functions, but datablade also provides an optional convenience facade for users who prefer an object-style entrypoint with shared defaults.
from datablade import Blade
from datablade.sql import Dialect
blade = Blade(memory_fraction=0.3, verbose=True, convert_types=True)
for chunk in blade.iter("huge.csv"):
...
files = blade.stream_to_parquets("huge.csv", output_dir="partitioned/")
# Generate DDL (CREATE TABLE) for a dialect
ddl = blade.create_table_sql(
df,
table="my_table",
dialect=Dialect.POSTGRES,
)