Metadata-Version: 2.4
Name: featurely
Version: 0.1.1
Summary: Reusable feature engineering utilities
Author: George Perdrizet
License: MIT
Project-URL: Repository, https://github.com/gperdrizet/featurely
Keywords: feature-engineering,machine-learning,pandas,scikit-learn,data-science
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Education
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: matplotlib>=3.7
Requires-Dist: scipy>=1.10
Requires-Dist: statsmodels>=0.14
Requires-Dist: scikit-learn>=1.3
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: ruff>=0.6; extra == "dev"
Requires-Dist: tomli>=2.0; python_version < "3.11" and extra == "dev"
Requires-Dist: trove-classifiers>=2025.5.9; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.25; extra == "docs"
Dynamic: license-file

# featurely

Reusable feature engineering utilities for tabular machine learning with pandas and scikit-learn.

featurely provides function-based helpers for the screen-then-commit feature engineering loop: build candidate features, test whether they explain variance your current model misses, and keep only the winners.

- **Pipeline evaluation**: cross-validated stage-over-stage comparison with persisted, rerun-safe results and progressive box plots.
- **Candidate screening**: residual correlation scans with Benjamini-Hochberg false discovery rate correction, for individual features and grouped feature sets.
- **Feature builders**: outlier handling, monotonic transforms, geographic encodings (haversine distances, geohash cells, rotated coordinates), quantile bin aggregates, k-means cluster memberships, Gaussian kernel spatial smoothing, and polynomial expansion with PCA component selection.
- **Diagnostics and EDA**: distribution plots, pairwise correlation analysis, and variance inflation factors.

All helpers accept a pandas DataFrame, take explicit column names, and return transformed copies without mutating input.

## Install

```bash
pip install featurely
```

Requires Python 3.10 or newer.

## Quick start

```python
import pandas as pd
import featurely as fl

df = pd.read_csv("my_data.csv")
target = "price"
features = [c for c in df.columns if c != target]

# Establish a baseline
results = fl.add_pipeline_step(None, "raw", df[features], df[target])

# Clean outliers and measure the effect
df_clean = fl.clip_outliers(df, features, threshold=2.25)

results = fl.add_pipeline_step(
    results, "+ cleaned", df_clean[features], df_clean[target]
)

fl.plot_pipeline_steps(results, title="Effect of outlier clipping")

# Build candidates and screen them against baseline residuals
candidates = fl.compute_bin_aggregates(df_clean, "latitude", ["income"], n_bins=10)
scan = fl.run_candidate_scan(df_clean, candidates, target=target)
significant = fl.plot_candidate_scan(scan, title="Candidate scan")

keep = [name for name, is_sig in significant.items() if is_sig]
df_clean = pd.concat([df_clean, candidates[keep]], axis=1)
```

## Documentation

Full API reference, getting-started guide, and a complete worked example on the California housing dataset:

- Documentation: [gperdrizet.github.io/featurely](https://gperdrizet.github.io/featurely/)
- Source and example notebooks: [github.com/gperdrizet/featurely](https://github.com/gperdrizet/featurely)

## License

MIT
