Metadata-Version: 2.4
Name: coevopy
Version: 0.4
Summary: CoEvoPy - A Co-Evolutionary Genetic Algorithm Library for Feature Selection in High-Dimensional Data
Author: Chanaka Sandaruwan
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.21.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: joblib>=1.1.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: threadpoolctl>=2.0.0
Dynamic: author
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# CoEvoPy — Co-Evolutionary Feature Selection

**CoEvoPy** is a Python library for automated feature selection using a **Co-Evolutionary Genetic Algorithm (Co-EGA)**. Instead of evolving a single population over all features, it partitions features into sub-populations that evolve independently while cooperating during fitness evaluation — making it scalable to high-dimensional datasets.

---

## Table of Contents

- [How It Works](#how-it-works)
- [Installation](#installation)
- [Project Structure](#project-structure)
- [Quick Start](#quick-start)
- [Step-by-Step Usage Guide](#step-by-step-usage-guide)
  - [1. Load Your Data](#1-load-your-data)
  - [2. Preprocess](#2-preprocess)
  - [3. Create Sub-Populations](#3-create-sub-populations)
  - [4. Run the Optimizer](#4-run-the-optimizer)
  - [5. Apply the Feature Mask](#5-apply-the-feature-mask)
- [Full End-to-End Example](#full-end-to-end-example)
- [Integration with scikit-learn](#integration-with-scikit-learn)
- [Saving and Reusing a Feature Mask](#saving-and-reusing-a-feature-mask)
- [Parameter Reference](#parameter-reference)
- [Tips and Tuning](#tips-and-tuning)

---

## How It Works

CoEvoPy splits the feature space into overlapping or non-overlapping **sub-populations**. Each sub-population evolves its own binary chromosome (1 = keep feature, 0 = drop it). Fitness is evaluated by combining the current sub-population's candidate solution with the best-known solutions from every other sub-population, then scoring with cross-validated accuracy. This cooperation mechanism lets sub-populations specialise while remaining globally aware.

```
Features ──► make_subpops ──► [SubPop A] [SubPop B] [SubPop C]
                                    │           │           │
                               Co-evolve with shared best solutions
                                    │           │           │
                               final_mask (boolean, length = n_features)
```

---

## Installation

```bash
# From source
git clone https://github.com/your-org/coevopy.git
cd coevopy
pip install -e .

# Dependencies
pip install numpy scikit-learn joblib pandas
```

---

## Project Structure

```
coevopy/
├── __init__.py
├── core.py          # GA primitives: init, selection, crossover, mutation, fitness
├── data.py          # DataLoader: load_tabular, preprocess_tabular
└── optimizer.py     # CoEvoFeatureSelector: the main class
```

---

## Quick Start

```python
from coevopy.data import DataLoader
from coevopy.core import make_subpops
from coevopy.optimizer import CoEvoFeatureSelector

loader = DataLoader()
raw_X, y, feature_names = loader.load_tabular("data.csv")
X, feature_names = loader.preprocess_tabular(raw_X)

subpops = make_subpops(X, target_size=50)

optimizer = CoEvoFeatureSelector(pop_size=25, generations=6, n_jobs=-1)
final_mask = optimizer.fit(X, y, subpops)

X_selected = X[:, final_mask]
print(f"Selected {final_mask.sum()} of {X.shape[1]} features")
```

---

## Step-by-Step Usage Guide

### 1. Load Your Data

`DataLoader.load_tabular` reads a CSV (or any tabular file) and returns the raw feature matrix, labels, and column names.

```python
from coevopy.data import DataLoader

loader = DataLoader()
raw_X, y, feature_names = loader.load_tabular("finance_data.csv")

print(f"Shape  : {raw_X.shape}")          # e.g. (5000, 120)
print(f"Classes: {set(y)}")               # e.g. {0, 1}
print(f"Features (first 5): {feature_names[:5]}")
```

**Returns**

| Variable | Type | Description |
|---|---|---|
| `raw_X` | `np.ndarray` or `pd.DataFrame` | Raw feature matrix, may contain categoricals or NaNs |
| `y` | `np.ndarray` | Target labels (classification or regression) |
| `feature_names` | `list[str]` | Column names in the same order as `raw_X` |

---

### 2. Preprocess

`preprocess_tabular` handles missing values, encodes categoricals, and scales numeric columns. It returns a clean numeric array ready for the optimizer.

```python
X, feature_names = loader.preprocess_tabular(raw_X)

print(f"Preprocessed shape: {X.shape}")
print(f"Feature names after preprocessing: {feature_names[:5]}")
```

> **Note:** Preprocessing may drop some columns (e.g. all-NaN, zero-variance). The returned `feature_names` always stays in sync with the columns in `X`.

---

### 3. Create Sub-Populations

`make_subpops` partitions the feature indices into groups of roughly `target_size` features each. Each group will be evolved by its own sub-population.

```python
from coevopy.core import make_subpops

subpops = make_subpops(X, target_size=50)

print(f"Number of sub-populations: {len(subpops)}")
for i, idx in enumerate(subpops):
    print(f"  Sub-pop {i}: {len(idx)} features, indices {idx[:4]}...")
```

**Choosing `target_size`**

| Dataset size | Recommended `target_size` |
|---|---|
| < 50 features | Use a single sub-pop (set to total feature count) |
| 50–200 features | 30–50 |
| 200–1000 features | 50–100 |
| > 1000 features | 100–200 |

Smaller groups converge faster but may miss cross-group interactions. Larger groups capture interactions better but take longer per generation.

---

### 4. Run the Optimizer

Instantiate `CoEvoFeatureSelector` and call `.fit()`. This runs the co-evolutionary loop and returns a boolean mask.

```python
from coevopy.optimizer import CoEvoFeatureSelector

optimizer = CoEvoFeatureSelector(
    pop_size=25,       # individuals per sub-population
    generations=6,     # number of GA generations
    cv_folds=2,        # cross-validation folds for fitness scoring
    n_jobs=-1,         # parallel workers (-1 = all cores)
    random_state=42    # reproducibility seed
)

final_mask = optimizer.fit(X, y, subpops)
```

**What `.fit()` returns**

`final_mask` is a `np.ndarray` of dtype `bool` with length equal to `X.shape[1]`.

```python
print(final_mask)
# array([True, False, True, True, False, ...])

print(f"Selected {final_mask.sum()} of {X.shape[1]} features")
# Selected 37 of 120 features
```

---

### 5. Apply the Feature Mask

Use the mask to slice your arrays or retrieve the names of selected features.

```python
# Apply to training data
X_selected = X[:, final_mask]

# Get the names of selected features
selected_names = [name for name, keep in zip(feature_names, final_mask) if keep]
print("Selected features:", selected_names)

# Apply to a new dataset (same preprocessing must be applied first)
X_new_selected = X_new[:, final_mask]
```

---

## Full End-to-End Example

The example below mirrors `main.py` but includes model training and evaluation after feature selection.

```python
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from coevopy.data import DataLoader
from coevopy.core import make_subpops
from coevopy.optimizer import CoEvoFeatureSelector


def run_pipeline(dataset_path: str, label_col: str = None):
    # ── 1. Load ──────────────────────────────────────────────────
    loader = DataLoader()
    raw_X, y, feature_names = loader.load_tabular(dataset_path)
    print(f"Loaded  : {raw_X.shape[0]} rows, {raw_X.shape[1]} columns")

    # ── 2. Preprocess ────────────────────────────────────────────
    X, feature_names = loader.preprocess_tabular(raw_X)
    print(f"After preprocessing: {X.shape[1]} features")

    # ── 3. Train/test split (split BEFORE feature selection) ─────
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    # ── 4. Create sub-populations on training data only ──────────
    subpops = make_subpops(X_train, target_size=50)

    # ── 5. Run CoEvoGA ───────────────────────────────────────────
    optimizer = CoEvoFeatureSelector(
        pop_size=25,
        generations=6,
        cv_folds=2,
        n_jobs=-1,
        random_state=42
    )
    final_mask = optimizer.fit(X_train, y_train, subpops)

    print(f"\nFeature selection: {X.shape[1]} → {int(final_mask.sum())} features")
    selected_names = [n for n, m in zip(feature_names, final_mask) if m]
    print("Selected features:", selected_names)

    # ── 6. Apply mask and train a final model ────────────────────
    X_train_sel = X_train[:, final_mask]
    X_test_sel  = X_test[:, final_mask]

    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X_train_sel, y_train)

    y_pred = clf.predict(X_test_sel)
    print("\nClassification report (selected features):")
    print(classification_report(y_test, y_pred))

    return final_mask, clf


if __name__ == "__main__":
    mask, model = run_pipeline("finance_data.csv")
```

---

## Integration with scikit-learn

You can wrap the optimizer into a scikit-learn-compatible transformer for use in pipelines.

```python
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier

from coevopy.core import make_subpops
from coevopy.optimizer import CoEvoFeatureSelector


class CoEvoSelector(BaseEstimator, TransformerMixin):
    """Scikit-learn compatible wrapper for CoEvoFeatureSelector."""

    def __init__(self, pop_size=25, generations=6, cv_folds=2,
                 n_jobs=-1, target_size=50, random_state=42):
        self.pop_size = pop_size
        self.generations = generations
        self.cv_folds = cv_folds
        self.n_jobs = n_jobs
        self.target_size = target_size
        self.random_state = random_state

    def fit(self, X, y=None):
        subpops = make_subpops(X, target_size=self.target_size)
        opt = CoEvoFeatureSelector(
            pop_size=self.pop_size,
            generations=self.generations,
            cv_folds=self.cv_folds,
            n_jobs=self.n_jobs,
            random_state=self.random_state
        )
        self.mask_ = opt.fit(X, y, subpops)
        return self

    def transform(self, X):
        return X[:, self.mask_]

    def get_support(self):
        return self.mask_


# Use inside a Pipeline
pipe = Pipeline([
    ("scaler",   StandardScaler()),
    ("selector", CoEvoSelector(pop_size=25, generations=6, target_size=50)),
    ("clf",      GradientBoostingClassifier(n_estimators=100))
])

pipe.fit(X_train, y_train)
print("Test accuracy:", pipe.score(X_test, y_test))

# Inspect which features were kept
mask = pipe.named_steps["selector"].get_support()
print("Selected features:", [n for n, m in zip(feature_names, mask) if m])
```

---

## Saving and Reusing a Feature Mask

Once you have a mask you're happy with, save it so you don't need to re-run the optimizer on new data.

```python
import numpy as np

# Save
np.save("feature_mask.npy", final_mask)

# Reload later
final_mask = np.load("feature_mask.npy")

# Apply to new data (must go through the same preprocessing first)
loader = DataLoader()
raw_new, _, _ = loader.load_tabular("new_data.csv")
X_new, _ = loader.preprocess_tabular(raw_new)

X_new_selected = X_new[:, final_mask]
```

> **Important:** The column order in `new_data.csv` must exactly match the original training data. Run the same `DataLoader` preprocessing so column ordering and encoding are consistent.

---

## Parameter Reference

### `CoEvoFeatureSelector`

| Parameter | Type | Default | Description |
|---|---|---|---|
| `pop_size` | `int` | `30` | Number of individuals (binary chromosomes) per sub-population |
| `generations` | `int` | `6` | Number of GA generations to run |
| `cv_folds` | `int` | `2` | Cross-validation folds used in fitness evaluation. Higher = more accurate but slower |
| `n_jobs` | `int` | `-1` | Parallel workers for fitness evaluation. `-1` uses all CPU cores |
| `random_state` | `int` | `42` | Random seed for reproducibility |

### `make_subpops`

| Parameter | Type | Default | Description |
|---|---|---|---|
| `X` | `np.ndarray` | — | Preprocessed feature matrix |
| `target_size` | `int` | `50` | Target number of features per sub-population |

---

## Tips and Tuning

**Speed vs accuracy trade-off**

Increase `cv_folds` from `2` to `5` for more reliable fitness, at the cost of ~2.5× longer runtime. For quick exploration, keep `cv_folds=2` and `generations=4`.

**Reproducibility**

Set `random_state` on both `CoEvoFeatureSelector` and any downstream model. The optimizer's internal operations (population initialisation, crossover, mutation) all respect this seed.

**Always split before selecting**

Run `make_subpops` and `optimizer.fit` on **training data only**. Fitting on the full dataset causes data leakage — the feature mask has already "seen" the test labels through cross-validation.

```python
# Correct
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
subpops = make_subpops(X_train, target_size=50)
mask = optimizer.fit(X_train, y_train, subpops)

# Wrong — leaks test data into selection
subpops = make_subpops(X, target_size=50)          # ❌ uses all data
mask = optimizer.fit(X, y, subpops)                 # ❌
```

**Very wide data (> 500 features)**

Increase `target_size` to 100–150 so each sub-population captures enough inter-feature interactions. Also consider increasing `pop_size` to 40–50 so the gene pool is large enough to explore the larger chromosome space.

**Parallel efficiency**

Each individual in a sub-population is evaluated in parallel via `joblib`. If you have many small sub-populations, the overhead of spawning workers can dominate. In that case, try `n_jobs=4` instead of `-1`.
