Metadata-Version: 2.4
Name: shapley_behaviors
Version: 0.1.2
Summary: Shapley value transformations for behavioral data analysis
Author: Amanda S Barnard
License: MIT
Project-URL: Homepage, https://github.com/amaxiom/shapley_behaviors
Project-URL: Repository, https://github.com/amaxiom/shapley_behaviors
Project-URL: Documentation, https://github.com/amaxiom/shapley_behaviors
Keywords: shapley,machine-learning,explainability,data mining,XAI
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.20
Requires-Dist: pandas>=1.3
Requires-Dist: joblib>=1.0
Requires-Dist: scikit-learn>=1.0
Requires-Dist: scipy>=1.7
Requires-Dist: matplotlib>=3.4
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"

# shapley_behaviors

[![PyPI](https://img.shields.io/pypi/v/shapley_behaviors)](https://pypi.org/project/shapley-behaviors/)
[![Python](https://img.shields.io/pypi/pyversions/shapley_behaviors)](https://pypi.org/project/shapley-behaviors/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Shapley value transformations for explainable behavioral data analysis.**

Traditional clustering asks *which samples are similar?* but not *why do they cluster together?*
Shapley behavioral transformations answer the "why" by decomposing statistical properties — variance, skewness, kurtosis, entropy — into individual sample contributions.
Samples that cluster in behavioral space share the same statistical role in the dataset, providing mechanistic and actionable insights.

Implementation of the methodology from:

> Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of Shapley behaviours in materials data. *Machine Learning: Engineering*, 1, 015004. https://doi.org/10.1088/3049-4761/adaaf6

---

## Features

- Decompose datasets into **variance**, **skewness**, **kurtosis**, and **entropy** behavioral spaces
- **Parallel computation** via joblib for large datasets
- **Outlier detection** directly in behavioral space
- Bundled **interactive explorer scripts** for Jupyter-based analysis with PCA plots, clustering statistics, and region-of-interest annotation
- Antithetic sampling for variance reduction in Monte Carlo estimation

---

## Installation

```bash
pip install shapley_behaviors
```

---

## Quick Start

```python
import numpy as np
from shapley_behaviors import ShapleyBehaviors

X = np.random.randn(500, 20)  # (n_samples, n_features)

sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)

# Transform to a single behavioral space
Phi_variance = sb.transform(X, value_function='variance')

# Or compute all four spaces at once
behavioral_spaces = sb.transform_multiple(X)
# keys: 'variance', 'skewness', 'kurtosis', 'entropy'
```

---

## Outlier Detection

```python
from shapley_behaviors import identify_outliers

outlier_indices, outlier_scores = identify_outliers(Phi_variance, threshold=2.5)
print(f"Detected {len(outlier_indices)} outliers")
```

---

## Understanding Behavioral Spaces

Each space answers a different question about the role of each sample in the dataset:

| Space | Positive values | Negative values | Use case |
|---|---|---|---|
| **Variance** | Stretchers — widen the distribution | Stabilizers — typical samples near the mean | Quality control, process instability |
| **Skewness** | Pull distribution above the mean | Pull distribution below the mean | Biased synthesis, directional drift |
| **Kurtosis** | Tail samples — rare extreme events | Core samples — predictable, well-behaved | Anomaly detection, reliability analysis |
| **Entropy** | High-information — rare, unique combinations | Low-information — common, redundant | Dataset curation, diversity quantification |

### Hopkins Statistic

The Hopkins statistic *H* measures clustering tendency in behavioral space:

| H value | Interpretation |
|---|---|
| > 0.7 | Strong clustering — samples group by behavior |
| ≈ 0.5 | Random distribution — no natural grouping |
| < 0.3 | Regular/uniform distribution |

---

## Convenience Functions

```python
from shapley_behaviors import (
    compute_shapley_variance,
    compute_shapley_skewness,
    compute_shapley_kurtosis,
    compute_shapley_entropy,
)

Phi = compute_shapley_variance(X, n_permutations=100, n_jobs=-1, random_state=42)
```

---

## Explorer Scripts

The package bundles two standalone Jupyter-compatible scripts for comprehensive analysis.
Copy them to your working directory:

```python
from shapley_behaviors import copy_scripts

copy_scripts(".")                                         # all scripts
copy_scripts("./analysis", scripts=["behavioral_space_explorer"])  # one script
```

### Behavioral Space Explorer

Full dataset exploration — PCA plots, Hopkins statistics, outlier detection:

```python
SEED = 42
N_PERMUTATIONS = 1000      # 100 for quick tests, 1000 for publication
N_JOBS = -1

DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"
SELECTED_FEATURES = ["feature1", "feature2"]  # optional highlight

%run -i behavioral_space_explorer.py
```

Outputs:

| File | Contents |
|---|---|
| `{name}_behavioral_spaces.npy` | All four behavioral transformations |
| `{name}_behave_{space}_{label}.png` | PCA plots colored by each label |
| `{name}_hopkins_statistics.csv` | Clustering tendency metrics |
| `{name}_clustering_statistics.csv` | Variance explained, pairwise distances |
| `{name}_outliers_{space}.csv` | Outlier samples per space |

### Behavioral Region Explorer

Targeted analysis of specific PCA regions:

```python
BEHAVIORAL_SPACES_FILE = "behavioral_exploration/mydata_behavioral_spaces.npy"
PLOT_MODE = "combined"  # or "separate"

USER_REGIONS = {
    "high_variance_cluster": {
        "space": "variance",
        "pc1_range": (0.3, 0.6),
        "pc2_range": (-0.2, 0.2),
        "description": "High variance contributors",
        "color": "red",
    },
}

%run -i behavioral_region_explorer.py
```

---

## Parameters

| Parameter | Values | Notes |
|---|---|---|
| `n_permutations` | 50–100 (explore), 200–500 (standard), 1000+ (publication) | Higher = more accurate, slower |
| `n_jobs` | `-1` (all cores), `1` (debug), `N` (N cores) | Parallelises over features |
| `random_state` | any int | Set for reproducibility; uses antithetic sampling |

---

## Runtime Estimates

| Dataset size | n_permutations | Estimated time |
|---|---|---|
| 500 samples | 100 | 2–5 min |
| 500 samples | 1000 | 15–30 min |
| 4000 samples | 100 | 20–30 min |
| 4000 samples | 1000 | 2–3 hours |

---

## API Reference

```python
# Main class
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)
Phi = sb.transform(X, value_function='variance', verbose=True)
spaces = sb.transform_multiple(X, value_functions=['variance', 'skewness', 'kurtosis', 'entropy'])

# Outlier detection
outlier_indices, outlier_scores = identify_outliers(Phi, threshold=3.0, method='zscore')
```

---

## Troubleshooting

| Problem | Solution |
|---|---|
| `ImportError` | `pip install shapley_behaviors` |
| Long runtime | Reduce `n_permutations` to 100 for testing |
| Memory error | Reduce `n_jobs` or process data in batches |
| High additivity error warning | Increase `n_permutations` |
| H ≈ 0.5 (no clustering) | Data may lack natural behavioral groupings |

---

## Citation

```bibtex
@article{liu2025shapley,
  author  = {Liu, Tommy and Barnard, Amanda S.},
  title   = {Understanding interpretable patterns of {Shapley} behaviours in materials data},
  journal = {Machine Learning: Engineering},
  volume  = {1},
  pages   = {015004},
  year    = {2025},
  doi     = {10.1088/3049-4761/adaaf6}
}
```

---

## Links

- [PyPI](https://pypi.org/project/shapley-behaviors/)
- [Repository](https://github.com/amaxiom/shapley_behaviors)
- [Issues](https://github.com/amaxiom/shapley_behaviors/issues)
- [Paper](https://doi.org/10.1088/3049-4761/adaaf6)

---

MIT License — Copyright © 2024 Amanda S. Barnard and Tommy Liu
