Metadata-Version: 2.4
Name: autofepg
Version: 0.3.0
Summary: AutoFE - Playground: Automatic Feature Engineering & Selection for Kaggle Playground Competitions
Home-page: https://github.com/thomastschinkel/autofepg
Author: AutoFE-PG Contributors
License: MIT
Project-URL: Homepage, https://github.com/thomastschinkel/autofepg
Project-URL: Repository, https://github.com/thomastschinkel/autofepg
Project-URL: Issues, https://github.com/thomastschinkel/autofepg/issues
Keywords: feature-engineering,machine-learning,kaggle,playground,xgboost,automated-ml
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: xgboost>=1.7.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: flake8>=5.0; extra == "dev"
Requires-Dist: black>=22.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Provides-Extra: gp
Requires-Dist: gplearn>=0.4.2; extra == "gp"
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# 🧪 AutoFE-PG

**Automatic Feature Engineering & Selection for Kaggle Playground Competitions**

![Python 3.8+](https://img.shields.io/badge/python-3.8%2B-blue)
![License: MIT](https://img.shields.io/badge/license-MIT-green)
![Version](https://img.shields.io/badge/version-0.3.0-orange)

AutoFE-PG is a powerful library that automatically generates, evaluates, and selects engineered features to boost your tabular ML models — with zero target leakage.

Version 0.3.0 is a complete refactoring focused on **general-purpose strategies** that work across any tabular competition, featuring advanced binning, digit-based features, Cyclical encoding, Weight of Evidence, and Genetic Programming interactions.

---

## ✨ Key Features

| Feature | Description |
|---|---|
| **Genetic Programming** | Generates complex non-linear interactions using `gplearn` |
| **Digit-Based Logic** | Extracts integer and decimal positions; creates digit-cross-category interactions |
| **Target Representation** | OOF Target Aggregation (mean, std, skew), WoE, and Entropy features |
| **Cyclical Encoding** | Sine/Cosine transformations for periodic numerical features |
| **Advanced Binning** | Both Quantile (qcut) and Equal-width (cut) discretization |
| **External Signal Injection** | Inject historical Priors, WoE, and Entropy from original datasets |
| **Zero Target Leakage** | All target-dependent features use strict out-of-fold (OOF) strategies |
| **Greedy Selection** | Forward selection keeps only features that improve CV score |
| **GPU Acceleration** | Built-in support for XGBoost GPU engines |

---

## 🚀 Quick Start

### Installation

```bash
pip install autofepg
# Optional: for Genetic Programming features
pip install gplearn
```

### Basic Usage

```python
import pandas as pd
from autofepg import select_features

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

X_train = train.drop(columns=["id", "target"])
y_train = train["target"]
X_test = test.drop(columns=["id"])

result = select_features(
    X_train, y_train, X_test,
    task="classification",
    time_budget=3600  # 1 hour limit
)

X_train_new = result["X_train"]
X_test_new = result["X_test"]

print(f"Features added: {len(result['selected_features'])}")
print(f"CV Improvement: {result['base_score']:.6f} -> {result['best_score']:.6f}")
```

### Injecting Historical Signals (Original Data)

If you have access to a "real world" dataset (common in Kaggle Playground synthetic competitions), you can inject its signals without leakage:

```python
result = select_features(
    X_train, y_train, X_test,
    original_df=original_df,
    original_target=original_target,
    task="classification"
)
```

---

## 📖 Feature Strategies (v0.3.0)

### 1. Digits & Discretization
- **Digit Extraction**: Integer positions (units, tens, etc.) and decimal positions.
- **Digit Interactions**: Column-wise and cross-column interactions between digits.
- **Binning**: Discretize continuous variables via Quantile (qcut) or Equal-width (cut) bins.
- **Rounding**: Rounding to various decimal places or magnitudes to find structural splits.

### 2. Specialized Encoding
- **Cyclical Encoding**: Sin/Cos transforms for periodic data.
- **Target Encoding (OOF)**: Out-of-fold mean target per category.
- **Weight of Evidence (WoE)**: OOF WoE scores for binary classification.
- **Entropy**: OOF target entropy per value group.
- **OOF Aggregation**: Mean, Std, and Skew of the target grouped by feature values.

### 3. Non-Linear Interactions
- **Genetic Programming**: Evolves mathematical expressions using the base features (requires `gplearn`).
- **Pair Interactions**: Categorical label-encoding of bigrams.
- **Numerical Products**: NaN-safe products of bigram numerical features.
- **Digit × Category**: Target encoding on the interaction of a column's digit and another category.

### 4. External Data Signals
- **Bayesian Priors**: Historical `P(target|value)` from the original dataset.
- **External WoE**: WoE scores pre-computed from the original dataset.
- **External Entropy**: Group purity/impurity derived from the original dataset.

---

## ⚙️ Configuration

| Parameter | Default | Description |
|---|---|---|
| `task` | `"auto"` | `"classification"`, `"regression"`, or `"auto"` |
| `n_folds` | `5` | Number of CV folds for evaluation |
| `time_budget` | `None` | Max wall-clock seconds for the search |
| `improvement_threshold` | `1e-7` | Min score delta to keep a feature |
| `sample` | `None` | Rows to sample for evaluation (speeds up search) |
| `gp_generations` | `5` | Evolution steps for Genetic Programming |
| `gp_n_components` | `5` | Max GP features to potentially keep |
| `original_df` | `None` | External dataset for Priors/WoE/Entropy |

---

## 📝 Changelog

### v0.3.0 (Current)
- **Refactoring**: Removed competition-specific features (Domain Alignment, Dataset Frequency, Rarity).
- **New Features**: Cyclical Features, OOF/External WoE, OOF/External Entropy, Genetic Programming (gplearn).
- **Enhanced Digits**: Added Decimal Digit extraction.
- **Enhanced Aggregation**: Added Skewness support to OOF Target Aggregation.
- **Simplified API**: Decoupled from specific dataset patterns; focused on universal engineering.

### v0.2.0
- Added original dataset support (Domain Alignment, Bayesian Priors).
- Introduced Cross-Dataset Frequency and Rarity features.

---

## 📄 License

MIT License — Copyright (c) 2026 Thomas Tschinkel.
