Metadata-Version: 2.4
Name: boruta-quant
Version: 0.1.0
Summary: Temporal-aware Boruta feature selection for quantitative finance. OOS-only importance with purged cross-validation.
Project-URL: Homepage, https://github.com/BlackArbsCEO/boruta-quant
Project-URL: Documentation, https://github.com/BlackArbsCEO/boruta-quant#readme
Project-URL: Repository, https://github.com/BlackArbsCEO/boruta-quant
Project-URL: Issues, https://github.com/BlackArbsCEO/boruta-quant/issues
Project-URL: Changelog, https://github.com/BlackArbsCEO/boruta-quant/blob/main/CHANGELOG.md
Author-email: BlackArbsCEO <bcr@blackarbs.com>
Maintainer-email: BlackArbsCEO <bcr@blackarbs.com>
License: MIT
License-File: LICENSE
Keywords: boruta,cross-validation,feature-selection,machine-learning,out-of-sample,permutation-importance,quantitative-finance,temporal-cv
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Office/Business :: Financial :: Investment
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: beartype>=0.18.0
Requires-Dist: numpy>=2.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.11.0
Provides-Extra: all
Requires-Dist: lightgbm>=4.0.0; extra == 'all'
Requires-Dist: matplotlib>=3.7.0; extra == 'all'
Requires-Dist: seaborn>=0.13.0; extra == 'all'
Requires-Dist: shap>=0.48.0; extra == 'all'
Requires-Dist: xgboost>=2.0.0; extra == 'all'
Provides-Extra: lightgbm
Requires-Dist: lightgbm>=4.0.0; extra == 'lightgbm'
Provides-Extra: shap
Requires-Dist: shap>=0.48.0; extra == 'shap'
Provides-Extra: viz
Requires-Dist: matplotlib>=3.7.0; extra == 'viz'
Requires-Dist: seaborn>=0.13.0; extra == 'viz'
Provides-Extra: xgboost
Requires-Dist: xgboost>=2.0.0; extra == 'xgboost'
Description-Content-Type: text/markdown

# boruta-quant

[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)

**Temporal-aware Boruta feature selection for quantitative finance.**

`boruta-quant` computes feature importance on validation data only, using purged cross-validation to prevent lookahead bias. Built for financial time series where temporal integrity matters.

## Why boruta-quant?

Standard feature selection (SHAP, sklearn permutation importance) computes importance on training data. In financial time series, this leaks future information into feature rankings. `boruta-quant` fixes this:

- **OOS-Only Importance**: Importance computed exclusively on validation folds
- **Purged Cross-Validation**: Train/test gap with purge and embargo windows
- **Shadow Features**: Boruta's all-relevant selection via shadow comparison

## Installation

```bash
# Basic (permutation importance only)
pip install boruta-quant

# With LightGBM
pip install boruta-quant[lightgbm]

# With SHAP support
pip install boruta-quant[shap]

# Everything
pip install boruta-quant[all]
```

### Development

```bash
git clone https://github.com/BlackArbsCEO/boruta-quant.git
cd boruta-quant
uv sync --all-extras --dev
```

## Quick Start

```python
from boruta_quant import BorutaSelector, BorutaSelectorConfig
from boruta_quant.oracle import PermutationImportanceOracle
from boruta_quant.temporal import PurgedTemporalCV, PurgedCVConfig
from boruta_quant.metrics import rank_ic_scorer
from lightgbm import LGBMRegressor

# 1. Configure purged temporal CV
cv = PurgedTemporalCV(PurgedCVConfig(
    n_splits=5,
    purge_window_days=5,      # gap before validation fold
    embargo_window_days=5,    # gap after validation fold
    min_train_size=100,
    test_size_ratio=0.2,
))

# 2. Configure importance oracle (OOS-only)
oracle = PermutationImportanceOracle(
    scoring=rank_ic_scorer,   # Spearman rank correlation
    n_repeats=10,
    random_state=42,
)

# 3. Configure Boruta selector
selector = BorutaSelector(
    config=BorutaSelectorConfig(
        n_trials=20,          # Boruta iterations
        percentile=100,       # shadow threshold percentile
        alpha=0.05,           # significance level
        two_step=True,        # resolve tentative features
        random_state=42,
    ),
    oracle=oracle,
    cv=cv,
)

# 4. Fit — model goes here, not in the constructor
result = selector.fit(
    X, y,
    timestamps=timestamps,   # must be timezone-aware
    model=LGBMRegressor(n_estimators=100, random_state=42),
)

# 5. Results
print(result.accepted_features)    # confirmed important
print(result.rejected_features)    # confirmed unimportant
print(result.tentative_features)   # borderline (resolved if two_step=True)
```

## Importance Oracles

All oracles fit the model on training data but measure importance on validation data only.

| Oracle | How it works | When to use |
|--------|-------------|-------------|
| `PermutationImportanceOracle` | Shuffles one feature in validation set, measures prediction drop | Default — reliable, no refit needed |
| `DropColumnImportanceOracle` | Removes feature, refits model, measures prediction drop | When refit cost is acceptable |
| `BlockPermutationImportanceOracle` | Block-shuffles feature (preserves autocorrelation structure) | Autocorrelated time series |

## Temporal Cross-Validation

```
     Training         Purge   Validation   Embargo
  |--------------| |-------| |----------| |-------|
  ^                                                ^
  train_start                               embargo_end

- Purge: removes observations that could leak into validation
- Embargo: prevents information from validation bleeding forward
```

## Shadow Shuffle Modes

Shadow features are shuffled copies of real features. The shuffle mode controls how temporal structure is handled:

| Mode | Description | Use case |
|------|-------------|----------|
| `ShuffleMode.RANDOM` | Standard i.i.d. permutation | Default — i.i.d. data |
| `ShuffleMode.BLOCK` | Block-preserving shuffle | Autocorrelated features |
| `ShuffleMode.ERA` | Shuffle within eras only | Regime-aware selection |

## Metrics

| Function | Description |
|----------|-------------|
| `rank_ic` | Spearman correlation between predictions and actuals |
| `rank_ic_scorer` | sklearn-compatible scorer wrapping `rank_ic` |
| `directional_accuracy` | Fraction of correct sign predictions (up vs down) |
| `directional_accuracy_scorer` | sklearn-compatible scorer wrapping `directional_accuracy` |
| `auc_score` | Area under ROC curve |
| `auc_scorer` | sklearn-compatible scorer wrapping `auc_score` |

## Design Principles

1. **OOS-Only**: Importance never computed on training data
2. **Fail-Fast**: Invalid temporal data (naive timestamps, unsorted) raises immediately
3. **Type-Safe**: Runtime enforcement with beartype, strict Pyright
4. **Explicit**: All config parameters required — no hidden defaults

## References

- [Boruta Algorithm](https://www.jstatsoft.org/article/view/v036i11) — Kursa & Rudnicki (2010)
- [Advances in Financial Machine Learning](https://www.wiley.com/en-us/Advances+in+Financial+Machine+Learning-p-9781119482086) — Lopez de Prado (2018), Ch. 7 (purged CV)

## License

MIT — see [LICENSE](LICENSE).
