Metadata-Version: 2.4
Name: dataforge-ml
Version: 1.0.1
Summary: A automated feature engineering and designing pipeline library
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: polars>=1.0.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: numpy>=2.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: chardet>=5.0.0
Requires-Dist: iterative-stratification>=0.1.9
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Dynamic: license-file

# DataForgeML

[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/DEVunderdog/DataForgeML)

Automated data profiling and splitting pipeline for ML datasets.

DataForgeML inspects your dataset, detects each column's semantic type (numeric, categorical, boolean, text, datetime, or identifier), computes per-column statistics and missingness, and produces a structured result ready for downstream feature engineering — no manual schema wrangling required.

## Installation

```bash
pip install dataforge-ml
```

## Quick Start

```python
from dataforge_ml import DataLoader, PipelineConfig, StructuralProfiler

df = DataLoader().load("titanic.csv")

config = PipelineConfig()
result = StructuralProfiler(config).profile(df)

print(result.columns["Age"].semantic_type)  # SemanticType.Numeric
print(result.dataset.row_count)             # total rows
```

`DataLoader` auto-detects encoding and delimiter. Supported formats: CSV, TSV, Parquet, JSON, NDJSON, JSONL, XLSX, XLS, Arrow, Feather.

## Column Type Overrides

Override the auto-detected type for any column before profiling:

```python
config = PipelineConfig()
config.set_column_type("PassengerId", "identifier")           # skip stats entirely
config.set_columns_type(["Survived", "Pclass"], "categorical")

result = StructuralProfiler(config).profile(df)
```

To drop a column from all processing entirely, use `exclude_columns`:

```python
config = PipelineConfig(exclude_columns=["PassengerId", "Name"])
```

## Splitting

```python
from dataforge_ml import DataLoader, DataSplitter

df = DataLoader().load("titanic.csv")
splitter = DataSplitter(df, target="Survived", random_seed=42)

# Random train/test split (stratified by default when target is set)
split = splitter.random_split(test_size=0.2)
print(split.train.shape, split.test.shape)

# Chronological split (no temporal leakage)
split = splitter.time_split(time_column="date", test_size=0.2)

# K-fold cross-validation
for fold in splitter.kfold(k=5):
    print(f"Fold {fold.fold_index}: train={fold.train_size}, val={fold.val_size}")
```

## License

MIT
