Metadata-Version: 2.4
Name: uefds
Version: 1.0.1
Summary: Universal Evolutionary Feature Discovery and Selection Framework
Author: Panhapich Uk
License: MIT
Project-URL: Repository, https://github.com/Pich09/uefds---GA
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Requires-Dist: xgboost>=2.0
Requires-Dist: shap>=0.44
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Provides-Extra: viz
Requires-Dist: matplotlib>=3.7; extra == "viz"

# UEFDS: Universal Evolutionary Feature Discovery and Selection Framework

UEFDS is a feature-evolution pipeline for automated feature engineering:

1. Stage 1 creates many new feature formulas from your raw features.
2. Compression keeps only distinct survivors (removes near-duplicates).
3. Stage 2 chooses the final survivor subset for model training and usage.

The goal is to produce a large, useful feature library for users to reuse, then a compact final subset for deployment, with lineage tracking so every surviving feature can be traced back to its parents.

## 1) What Problem This Solves

Many datasets need feature engineering before models perform well. Manually crafting features is slow and fragile. UEFDS automates this process by:

- Growing a large pool of derived feature formulas from raw inputs.
- Evolving and scoring formulas generation by generation.
- Keeping stronger, less redundant feature survivors.
- Returning both an expandable candidate library and a final selected subset.

## 2) End-to-End Pipeline

```mermaid
flowchart TD
    A[Raw X, y] --> B[Stage 1: Feature Discovery]
    B --> C[All Generated Features]
    C --> D[Hall of Fame Survivors]
    D --> E[Compression: Keep Distinct Survivors]
    E --> F[Candidate Feature Library for User]
    F --> G[Stage 2: Final Survivor Subset]
    G --> H[Best Subset + Final XGBoost Model]
```

### Feature Evolution View (What Survives)

```mermaid
flowchart LR
    R[Raw Features] --> G0[Gen 0: Random Features]
    G0 --> G1[Gen 1: New Derived Features]
    G1 --> G2[Gen 2: More Derived Features]
    G2 --> HF[Hall of Fame Survivors]
    HF --> CP[Compressed Survivors]
    CP --> S2[Stage 2 Subset Survivors]
```

Think of UEFDS as survival-of-the-most-useful-features, not survival-of-operations.

## 3) Installation

### Core + Dev + Visualization

```bash
pip install -e ".[dev,viz]"
```

Core dependencies include:

- numpy
- scikit-learn
- scipy
- xgboost
- shap

Optional extras:

- dev: pytest
- viz: matplotlib

## 4) Quick Start

```python
import numpy as np
from sklearn.datasets import load_breast_cancer

from uefds import (
    Stage1Config,
    run_stage1,
    build_candidate_library,
    Stage2Config,
    run_stage2,
)

# Load data
bundle = load_breast_cancer()
X, y = bundle.data, bundle.target
feature_names = list(bundle.feature_names)

# Stage 1: discover candidate formulas
stage1_result = run_stage1(
    X,
    y,
    feature_names,
    Stage1Config(
        population_size=40,
        n_generations=15,
        task="classification",
        protected_features=[0],
        presence_quota_k=2,
        seed=42,
    ),
)

print(stage1_result["hall_of_fame"].summary(n=5))

# Compress Hall of Fame into candidate library
candidates = build_candidate_library(
    stage1_result["hall_of_fame"],
    stage1_result["trees_by_id"],
    X,
    feature_names,
    tau=0.9,
    top_n=30,
)

X_candidates = np.column_stack([c.values for c in candidates])

# Stage 2: evolve best subset
stage2_result = run_stage2(
    candidates,
    X_candidates,
    y,
    Stage2Config(
        population_size=20,
        n_generations=10,
        task="classification",
        min_features=2,
        max_features=min(10, len(candidates)),
        cv_folds=3,
        seed=42,
    ),
)

print("Best formulas:")
for f in stage2_result["best_formulas"]:
    print(" ", f)

print("Final metrics:", stage2_result["final_metrics"])
```

## 5) Stage 1: Feature Discovery

### What evolves in Stage 1

Each chromosome is one candidate feature formula (represented internally as an expression tree):

- Leaves reference raw feature indices.
- Internal structure defines how a new derived feature is computed.
- The output of each chromosome is a feature vector over all rows.

### Fitness function

Stage 1 fitness is a weighted sum:

$$
F_i = w_Q Q_i + w_D D_i + w_S S_i
$$

Where:

- $Q_i$: quality score (mutual information with target)
- $D_i$: diversity score ($1 - \max |corr|$ against already-kept features in evaluation pass)
- $S_i$: simplicity score from tree size

Default Stage 1 weights:

- quality: 0.5
- diversity: 0.3
- simplicity: 0.2

### How feature survivors are chosen each generation

- Every generated feature gets a fitness score (quality + diversity + simplicity).
- Top features survive directly via elitism.
- Additional features are created from survivors and re-scored.
- Over generations, weak/redundant features are naturally filtered out.

Internal breeding operators exist, but from a user perspective the key output is the survivor feature pool, not the operator details.

### Protection mechanism

You can protect important raw features by index using:

- protected_features
- presence_quota_k

UEFDS enforces a minimum population presence for protected features each generation.

### Stage 1 output keys

run_stage1 returns:

- hall_of_fame: top formulas archive
- operator_learner: operator success tracker
- evolution_history: generation and lineage records
- final_population: last generation trees
- trees_by_id: dictionary of created trees by tree_id
- best_tree_id: best overall tree_id from Hall of Fame

If your goal is to create more features for downstream usage, focus on:

- hall_of_fame (best discovered formulas)
- trees_by_id (full feature objects)
- build_candidate_library(...) output (clean, reusable candidate features)

## 6) Compression Layer

Compression keeps only distinct feature survivors before Stage 2.

### Method

1. Evaluate selected Hall-of-Fame trees on X.
2. Compute correlation matrix among candidate feature vectors.
3. Convert to distance:

$$
\text{distance}(f_i, f_j) = 1 - |corr(f_i, f_j)|
$$

4. Perform hierarchical clustering.
5. Keep one representative survivor per cluster (highest fitness).

Each compressed candidate stores:

- tree_id
- formula
- fitness
- values
- complexity (tree node count)

## 7) Stage 2: Feature Subset Selection

Stage 2 chromosome = a subset choice over compressed feature survivors.

### Stage 2 fitness

Current Stage 2 fitness combines:

$$
F = w_P P + w_{St} St + w_{Sh} Sh + w_{Si} Si - w_C C
$$

Where:

- $P$: cross-validated model performance (AUC for classification, $R^2$ for regression)
- $St$: fold stability term
- $Sh$: SHAP efficiency score
- $Si$: simplicity term derived from candidate complexity
- $C$: feature count penalty

Default Stage 2 weights:

- performance: 0.50
- stability: 0.15
- shap: 0.15
- simplicity: 0.10
- count_penalty: 0.10

### Why this matters

- Performance ensures predictive power.
- Stability avoids fragile subsets.
- SHAP efficiency prefers non-redundant explanatory features.
- Simplicity favors less complex formulas.
- Count penalty discourages oversized subsets.

### Stage 2 output keys

run_stage2 returns:

- selection_history: subset lineage records
- best_subset_id
- best_feature_indices
- best_formulas
- final_model (trained XGBoost model)
- final_metrics

## 8) Lineage and Interpretability

UEFDS tracks ancestry in both stages.

- Stage 1: trace feature formulas back to random-init ancestors.
- Stage 2: trace winning subset ancestry by parent subset IDs.

Examples:

```python
stage1_result["evolution_history"].print_lineage(stage1_result["best_tree_id"])
stage2_result["selection_history"].print_lineage(stage2_result["best_subset_id"])
```

## 9) Visualization

Run the visualization demo:

```bash
python examples/demo_visualization.py
```

It generates figures in examples/viz_out for:

- best tree
- generation gallery
- fitness evolution
- feature presence curves
- lineage graph

## 10) Example Scripts

- examples/demo_stage1.py
  - Stage 1 only
  - prints Hall of Fame and feature presence summary

- examples/demo_full_pipeline.py
  - Stage 1 + Compression + Stage 2
  - prints final subset and final metrics

- examples/demo_visualization.py
  - Stage 1 with plotting pipeline

## 11) Running Tests

```bash
pytest -q
```

## 12) Repository Layout

- src/uefds/operators.py: operator library and operator learner
- src/uefds/tree.py: expression-tree chromosome and genetic ops
- src/uefds/fitness.py: Stage 1 scoring
- src/uefds/hall_of_fame.py: top-discovery archive
- src/uefds/evolution_history.py: lineage and generation logs
- src/uefds/stage1_discovery.py: Stage 1 loop
- src/uefds/compression.py: correlation clustering compression
- src/uefds/model_engine.py: XGBoost training and CV scoring
- src/uefds/shap_layer.py: SHAP importance and redundancy logic
- src/uefds/stage2_selection.py: Stage 2 loop
- src/uefds/visualization.py: plotting utilities
- src/uefds/__init__.py: package API exports

## 13) Practical Notes

- Seed all runs for reproducibility using Stage1Config.seed and Stage2Config.seed.
- Candidate count and Stage 2 population size strongly affect runtime.
- Start with small generation counts while iterating.
- Increase top_n and Stage 2 generations for stronger final search when ready.

## 14) Minimal Troubleshooting

- Import errors for xgboost/shap: install with pip install -e ".[dev,viz]".
- Empty candidate library: increase top_n, reduce compression strictness (lower tau), or increase Stage 1 generations.
- Very slow Stage 2: reduce population_size, n_generations, or cv_folds.

## 15) License

MIT
