Metadata-Version: 2.4
Name: overlapindex
Version: 0.1.3a1
Summary: OverlapIndex (OI), an Incremental Cluster Validity index for identifying the degree of overlap of data classes.
License: MIT
License-File: LICENSE
Keywords: incremental cluster validity,cluster validity,ART,machine learning,transfer learning,clustering
Author: Niklas M. Melton
Author-email: niklasmelton@gmail.com
Requires-Python: >=3.9,<3.15
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: art
Requires-Dist: artlib (>=0.1.10,<0.2.0) ; (python_version >= "3.9" and python_version < "3.15") and (extra == "art")
Requires-Dist: matplotlib (>=3.10.8,<4.0.0) ; (python_version >= "3.10" and python_version < "3.15") and (extra == "art")
Requires-Dist: matplotlib (>=3.9.4,<3.10.0) ; (python_version == "3.9") and (extra == "art")
Requires-Dist: numpy (>=2.0.0,<3.0.0) ; python_version >= "3.9" and python_version < "3.13"
Requires-Dist: numpy (>=2.4.1,<3.0.0) ; python_version >= "3.13" and python_version < "3.15"
Requires-Dist: scikit-learn (>=1.0,<2.0.0) ; python_version >= "3.9" and python_version < "3.14"
Requires-Dist: scikit-learn (>=1.7.2,<2.0.0) ; python_version == "3.14"
Requires-Dist: scipy (>=1.16.1,<2.0.0) ; python_version >= "3.13" and python_version < "3.15"
Requires-Dist: scipy (>=1.6.0,<2.0.0) ; python_version >= "3.9" and python_version < "3.13"
Project-URL: Documentation, https://github.com/NiklasMelton/OverlapIndex
Project-URL: Homepage, https://github.com/NiklasMelton/OverlapIndex
Project-URL: Repository, https://github.com/NiklasMelton/OverlapIndex
Description-Content-Type: text/markdown

# OverlapIndex (OI)

This package provides an implementation of the **Overlap Index (OI)**, a cluster-validity measure designed to quantify the degree of overlap between data classes or clusters. The OI can be updated online with ARTMAP-based backends, or computed in batch with offline clustering backends, making it useful for streaming, continual learning, large-scale representation analysis, and embedding-space diagnostics.

The implementation supports multiple swappable clustering backends:

- **Fuzzy ARTMAP** and **Hypersphere ARTMAP** for incremental / online updates.
- **KMeans** and **MiniBatchKMeans** for offline centroid-based analysis.
- **BallCover** for offline greedy landmark-ball covers, useful when the goal is to preserve class-support geometry for downstream shape or topology analysis.

---

## Installation

To install OverlapIndex, simply use pip:

```bash
pip install overlapindex
```

That installs the default batch-oriented dependencies. To enable the incremental
ART backends as well, install the optional ART extra:

```bash
pip install "overlapindex[art]"
```

The core package and optional `art` extra support Python 3.9 through 3.14.

Or to install directly from the most recent source:

```bash
pip install git+https://github.com/NiklasMelton/OverlapIndex.git@develop
```

---

## Overview

The Overlap Index is bounded in the interval **[0, 1]** and has the following interpretation:

- **OI = 1.0**  
  Indicates perfect class separation (no overlap).

- **OI = 0.5**  
  Indicates complete overlap between classes.

- **OI < 0.5**  
  Indicates a degenerate or pathological case in the data distribution.

The index is computed incrementally by tracking shared cluster activations between pairs of classes and aggregating class-wise overlap into a global measure.

---

## Key Properties

- **Incremental and Offline Modes**  
  ARTMAP backends support streaming updates via `add_sample` and mini-batch updates via `add_batch`.
  Offline backends such as `KMeans`, `MiniBatchKMeans`, and `BallCover` support batch computation through `add_batch`.

- **Label-Aware**  
  Can be applied both to labeled raw data and to intermediate representations (e.g., neural network activations).

- **Geometry-Agnostic**  
  Works well on arbitrary geometric structures of data. No geometric constraints are 
  assumed.

---

## Typical Use Cases

The Overlap Index can be used in several settings:

- **Unsupervised clustering evaluation**  
  As an iCVI, OI provides insight into the quality of a clustering partition as it evolves over time.

- **Class separability analysis**  
  Measures the degree of overlap in labeled datasets without requiring a classifier.

- **Representation monitoring in deep learning**  
  Tracks how class separation changes across layers or training epochs.

- **Backbone evaluation for transfer learning**  
  Compares feature extractors, where higher OI values indicate better class 
  separation in the backbone embeddings.

---

## Implementation Notes

- ART-based clustering is performed using `artlib`’s `FuzzyARTMAP` or `HypersphereARTMAP`.
- `artlib` is an optional dependency and is only required when using the
  `"Fuzzy"` or `"Hypersphere"` backends.
- Offline centroid backends fit one clustering model per class and concatenate the resulting class-owned prototypes into global cluster ids.
- The `BallCover` backend fits one greedy ball cover per class and treats ball centers as class-owned prototypes.
- Normalize input features before fitting. Examples in this repository use `MinMaxScaler` for convenience.
- ART backends complement-code inputs internally and therefore require features in the `[0, 1]` interval.
- Offline backends (`KMeans`, `MiniBatchKMeans`, and `BallCover`) consume normalized features directly and do not apply complement coding.
- Overlap is estimated by monitoring shared best-matching units (BMUs) or top prototype activations between class pairs.
- The global OI is computed as the mean of per-class minimum pairwise overlap scores.

---

## Basic Usage

```python
from sklearn.preprocessing import MinMaxScaler
from overlapindex import OverlapIndex

# Normalize features before fitting.
X = MinMaxScaler().fit_transform(X)

# MiniBatchKMeans is the default backend and is recommended for most offline use cases.
oi = OverlapIndex(
    kmeans_k=10,
    kmeans_kwargs={"random_state": 0},
)

# sklearn-style API
oi.fit(X, y)
score = oi.index
```


The fitted value is available through `oi.index`. For users who prefer update methods that return the current score directly, `add_batch(X, y)` is also supported.

### Online ARTMAP Usage

```python
from overlapindex import OverlapIndex

# For ARTMAP backends, batches should already be scaled into [0, 1].

oi = OverlapIndex(
    model_type="Hypersphere",
    rho=0.9,
    match_tracking="MT+",
)

for X_batch, y_batch in stream:
    oi.partial_fit(X_batch, y_batch)
    score = oi.index
```

For single-sample streams, ARTMAP backends also support `add_sample(x, y)`, which updates the model and returns the current score directly. Labeled mini-batches can also be passed to `add_batch(X, y)`.

### API Styles

`OverlapIndex` supports both sklearn-style methods and direct score-returning update methods:

| Method | Returns | Typical use                                                   |
| --- | --- |---------------------------------------------------------------|
| `fit(X, y)` | `self` | Full offline fitting on a labeled dataset.                    |
| `partial_fit(X, y)` | `self` | Incremental batch updates for ARTMAP backends; offline backends refit on the provided batch. |
| `score()` / `score(X, y)` | `float` | Read the current index, or refit on labeled data and return the new score. |
| `predict(X)` | `np.ndarray` | Return the highest-scoring global prototype id for each sample. |
| `fit_predict(X, y)` | `np.ndarray` | Fit and return per-sample prototype ids. |
| `add_batch(X, y)` | `float` | Batch update when the current OI score is needed immediately. |
| `add_sample(x, y)` | `float` | Single-sample online update for ARTMAP backends.              |

After `fit` or `partial_fit`, read the current score from `oi.index` or call `score()`.

For `model_type="KMeans"`, `model_type="MiniBatchKMeans"`, and
`model_type="BallCover"`, `partial_fit(X, y)` is a convenience wrapper around
recomputing the index on the provided labeled batch. Only the ARTMAP backends
perform true incremental updates across calls.

If a batch is empty or contains only one unique class, `OverlapIndex` emits a
`RuntimeWarning` and leaves the score at its default value of `1.0`.

### Clustering Backends

`OverlapIndex` uses `model_type="MiniBatchKMeans"` by default and supports several backend families through the `model_type` parameter:

| `model_type` | Update mode | Description |
| --- | --- | --- |
| `"Fuzzy"` | Online / batch | Incremental Fuzzy ARTMAP backend. Requires the optional `art` extra. |
| `"Hypersphere"` | Online / batch | Incremental Hypersphere ARTMAP backend. Requires the optional `art` extra. |
| `"KMeans"` | Offline batch only | Fits one scikit-learn `KMeans` model per class. |
| `"MiniBatchKMeans"` | Offline batch only | Default backend. Fits one scikit-learn `MiniBatchKMeans` model per class; recommended for larger datasets. |
| `"BallCover"` | Offline batch only | Fits one greedy landmark-ball cover per class. Useful when preserving class-support geometry is important. |

Offline backends should be used with `fit` or `add_batch`. They do not support `add_sample` because their prototypes are fit from a complete labeled batch.

#### KMeans backend

```python
from overlapindex import OverlapIndex

OI = OverlapIndex(
    model_type="KMeans",
    kmeans_k=10,
    kmeans_kwargs={"random_state": 0},
)

OI.fit(X, y)
score = OI.index
```

#### MiniBatchKMeans backend

```python
from overlapindex import OverlapIndex

OI = OverlapIndex(
    model_type="MiniBatchKMeans",
    kmeans_k=10,
    kmeans_kwargs={
        "random_state": 0,
        "batch_size": 8192,
        "n_init": 1,
    },
)

OI.fit(X, y)
score = OI.index
```

#### BallCover backend

```python
from overlapindex import OverlapIndex

OI = OverlapIndex(
    model_type="BallCover",
    ballcover_k="auto",
    ballcover_radius=0.25,
    ballcover_kwargs={
        "metric": "auto",
        "cover_fraction": 1.0,
    },
)

OI.fit(X, y)
score = OI.index
```

The BallCover backend supports one automatic cover parameter at a time:

- `ballcover_k="auto"` with a fixed `ballcover_radius` greedily adds balls until the requested cover fraction is reached.
- `ballcover_k=<int>` with `ballcover_radius="auto"` selects a fixed number of landmarks and infers the radius needed to cover the requested fraction of samples.

`metric="auto"` uses Euclidean distance in lower-dimensional spaces and cosine geometry for high-dimensional inputs such as embedding vectors. Users can override this with `metric="euclidean"` or `metric="cosine"`.

### Iris Dataset Example
```python

from sklearn.datasets import load_iris
import numpy as np
from overlapindex import OverlapIndex

# Load dataset
iris = load_iris()

# Feature matrix (shape: [150, 4])
X = iris.data.astype(np.float64)

# Target vector (shape: [150,])
y = iris.target.astype(np.int64)

# Normalize the data (required)
x_max = X.max(axis=0)
x_min = X.min(axis=0)
X = (X - x_min) / (x_max - x_min)

# Instantiate the OI object
OI = OverlapIndex()

# Calculate the Overlap Index
OI.fit(X, y)
print(OI.index)

# Output:
# 0.9266666666666666
```

Additional runnable examples are available in the `examples/` directory.

---

## Release Verification

For release testing, start from a fresh Poetry environment so the package under
test matches `pyproject.toml` and `poetry.lock`:

```bash
poetry env remove --all
poetry sync --with dev
poetry run python -c "from overlapindex import OverlapIndex; OverlapIndex(model_type='MiniBatchKMeans')"
poetry run python -m pytest -q tests/test_overlap_index_regression.py

poetry sync --with dev --extras art
poetry run python -c "from overlapindex import OverlapIndex; OverlapIndex(model_type='Hypersphere')"
poetry run python -m pytest -q tests/test_overlap_index_regression.py

poetry check
python -m build
twine check dist/*
```

The first install verifies that offline backends work without the optional
`artlib` dependency. The second install verifies the `art` extra and ARTMAP
backends.

---

## Parameters

- `rho` *(float)*  
  Vigilance parameter controlling cluster granularity for ARTMAP backends.

- `r_hat` *(float, Hypersphere ARTMAP only)*  
  Maximum cluster radius for the Hypersphere backend.

- `model_type` *("Fuzzy" | "Hypersphere" | "KMeans" | "MiniBatchKMeans" | "BallCover")*  
  Clustering backend used to create class-owned prototypes. Defaults to `"MiniBatchKMeans"`.

- `match_tracking` *(str)*  
  Match-tracking strategy used during ARTMAP learning.

- `kmeans_k` *(int or dict)*  
  Number of clusters per class for `KMeans` and `MiniBatchKMeans` backends.

- `kmeans_kwargs` *(dict, optional)*  
  Keyword arguments forwarded to the selected scikit-learn KMeans backend.

- `ballcover_k` *(int, dict, or "auto")*  
  Number of balls per class, class-specific ball counts, or `"auto"` for greedy fixed-radius covering.

- `ballcover_radius` *(float, dict, or "auto")*  
  Ball radius, class-specific radii, or `"auto"` when using a fixed number of balls.

- `ballcover_kwargs` *(dict, optional)*  
  Additional BallCover options such as `metric`, `cover_fraction`, `chunk_size`, `max_balls`, and `random_state`.

---

The default parameters are intended for offline batch use with `MiniBatchKMeans`. For online or continual-learning workflows, explicitly choose `model_type="Fuzzy"` or `model_type="Hypersphere"`. For very large ART-based runs, smaller `rho` values (0.5-0.7) may improve run-time performance.

---

## Output

- **`index`**  
  Global Overlap Index across all observed classes.

- **`singleton_index[y]`**  
  Minimum pairwise overlap score for class `y`.

- **`pairwise_index[(y, b)]`**  
  Pairwise overlap score between classes `y` and `b`.

---

## Intended Audience

This package is intended for researchers and practitioners working on:

- incremental and continual learning,
- clustering validation,
- representation learning,
- transfer learning

