Metadata-Version: 2.4
Name: shapley_behaviors
Version: 0.1.1
Summary: Shapley value transformations for behavioral data analysis
Author: Amanda Barnard
License: MIT
Project-URL: Homepage, https://github.com/amaxiom/shapley_behaviors
Project-URL: Repository, https://github.com/amaxiom/shapley_behaviors
Project-URL: Documentation, https://github.com/amaxiom/shapley_behaviors
Keywords: shapley,machine-learning,explainability,data mining,XAI
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.20
Requires-Dist: pandas>=1.3
Requires-Dist: joblib>=1.0
Requires-Dist: scikit-learn>=1.0
Requires-Dist: scipy>=1.7
Requires-Dist: matplotlib>=3.4
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"

# shapley_behaviors

Shapley value transformations for explainable behavioral data analysis.

## Overview

Traditional clustering asks "which samples are similar?" but not "why do they cluster together?"

**Shapley behavioral transformations** answer the "why" by decomposing statistical properties (variance, skewness, kurtosis, entropy) into individual sample contributions. Samples that cluster in behavioral space share the same statistical role in the dataset, providing mechanistic and actionable insights.

This package implements the methodology from:

Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of Shapley behaviours in materials data. *Machine Learning: Engineering*, 1, 015004. https://doi.org/10.1088/3049-4761/adaaf6

## Installation

```bash
pip install shapley_behaviors
```

## Quick Start

```python
import numpy as np
from shapley_behaviors import ShapleyBehaviors

# Load your data (n_samples, n_features)
X = np.random.randn(500, 20)

# Transform to behavioral spaces
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)

Phi_variance = sb.transform(X, value_function='variance')
Phi_skewness = sb.transform(X, value_function='skewness')
Phi_kurtosis = sb.transform(X, value_function='kurtosis')
Phi_entropy = sb.transform(X, value_function='entropy')

# Or compute all at once
behavioral_spaces = sb.transform_multiple(X)
```

## Outlier Detection

```python
from shapley_behaviors import identify_outliers

outlier_indices, outlier_scores = identify_outliers(Phi_kurtosis, threshold=2.5)
print(f"Detected {len(outlier_indices)} outliers")
```

## Getting the Explorer Scripts

The package includes standalone explorer scripts for comprehensive analysis with visualizations, statistics, and outlier detection. Copy them to your working directory:

```python
from shapley_behaviors import copy_scripts

# Copy all scripts to current directory
copy_scripts(".")

# Or copy to a specific directory
copy_scripts("./analysis")

# Or copy only one script
copy_scripts(".", scripts=["behavioral_space_explorer"])
```

## Behavioral Space Explorer

Configure and run in Jupyter:

```python
# Configuration
SEED = 42
N_PERMUTATIONS = 1000      # 100 for quick tests, 1000 for publication
N_JOBS = -1                # -1 uses all CPU cores

DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"

# Optional: select specific features to highlight
SELECTED_FEATURES = ["feature1", "feature2", "feature3"]

# Run the explorer
%run -i behavioral_space_explorer.py
```

The explorer generates:

- `{name}_behavioral_spaces.npy` - All behavioral transformations
- `{name}_behave_{space}_{label}.png` - PCA plots colored by each label
- `{name}_hopkins_statistics.csv` - Clustering tendency metrics
- `{name}_clustering_statistics.csv` - Variance explained, pairwise distances
- `{name}_outliers_{space}.csv` - Outlier samples for each space

## Behavioral Region Explorer

For targeted analysis of specific regions in behavioral space:

```python
# Basic configuration
DATASET_NAME = "mydata"
DATA_FILE = "mydata.csv"
ID_COLUMN = "sample_id"
DROP_COLUMNS = ["col_a", "col_b"]
LABEL_COLUMNS = ["target1", "target2", "category"]
OUTPUT_DIR = "behavioral_exploration"
BEHAVIORAL_SPACES_FILE = "behavioral_exploration/mydata_behavioral_spaces.npy"
PLOT_MODE = "combined"  # or "separate"

# Define regions of interest in PCA space
USER_REGIONS = {
    "high_variance_cluster": {
        "space": "variance",
        "pc1_range": (0.3, 0.6),
        "pc2_range": (-0.2, 0.2),
        "description": "High variance contributors",
        "color": "red"
    },
    "entropy_outliers": {
        "space": "entropy",
        "pc1_range": (-0.5, -0.2),
        "pc2_range": (0.1, 0.4),
        "description": "Low entropy samples",
        "color": "blue"
    }
}

# Run the region explorer
%run -i behavioral_region_explorer.py
```

## Understanding Behavioral Spaces

**Variance Space:** Decomposes how each sample contributes to feature spread. Negative values indicate stabilizers (typical samples near the mean). Positive values indicate stretchers (extreme samples widening distribution). Use case: quality control, identifying process instability.

**Skewness Space:** Decomposes how each sample contributes to distributional asymmetry. Negative values pull distribution below mean. Positive values pull distribution above mean. Near-zero values maintain symmetry. Use case: detecting biased synthesis, directional process drift.

**Kurtosis Space:** Decomposes how each sample contributes to tail heaviness. Negative values indicate core samples (predictable, well-behaved). Positive values indicate tail samples (rare extreme events). Use case: risk assessment, anomaly detection, reliability analysis.

**Entropy Space:** Decomposes how each sample contributes to information content. Positive values indicate high-information samples (rare, unique feature combinations). Negative values indicate low-information samples (common, redundant). Use case: dataset curation, experimental design, diversity quantification.

## Hopkins Statistic

The Hopkins statistic H measures clustering tendency:

- H > 0.7: Strong clustering (samples group by behavior)
- H approximately 0.5: Random distribution (no natural grouping)
- H < 0.3: Regular/uniform distribution

## Parameter Selection

**n_permutations:** Controls Monte Carlo estimation accuracy.

- 50-100: Quick exploration, debugging
- 200-500: Standard analysis
- 1000+: Publication, final results

**n_jobs:** Parallel processing for feature columns.

- -1: Use all available CPU cores
- 1: Single-threaded (for debugging)
- N: Use N cores

**random_state:** Set for reproducibility. The implementation uses antithetic sampling for variance reduction.

## API Reference

**ShapleyBehaviors class:**

```python
sb = ShapleyBehaviors(n_permutations=100, n_jobs=-1, random_state=42)
Phi = sb.transform(X, value_function='variance', verbose=True)
spaces = sb.transform_multiple(X, value_functions=['variance', 'skewness', 'kurtosis', 'entropy'])
```

**identify_outliers function:**

```python
outlier_indices, outlier_scores = identify_outliers(Phi, threshold=3.0, method='zscore')
```

**Convenience functions:**

```python
from shapley_behaviors import (
    compute_shapley_variance,
    compute_shapley_skewness,
    compute_shapley_kurtosis,
    compute_shapley_entropy
)

Phi = compute_shapley_variance(X, n_permutations=100, n_jobs=-1, random_state=42)
```

## Runtime Estimates

- 500 samples, 100 permutations: 2-5 minutes
- 500 samples, 1000 permutations: 15-30 minutes
- 4000 samples, 100 permutations: 20-30 minutes
- 4000 samples, 1000 permutations: 2-3 hours

## Troubleshooting

**ImportError:** Ensure package is installed with `pip install shapley_behaviors`

**Long runtime:** Reduce N_PERMUTATIONS to 100 for testing

**Memory error:** Reduce N_JOBS or process data in batches

**High additivity error warning:** Increase n_permutations

**No clustering detected (H approximately 0.5):** Data may lack natural behavioral groupings

## Citation

If you use this package, please cite:

```
Liu, T., and Barnard, A. S. (2025). Understanding interpretable patterns of
Shapley behaviours in materials data. Machine Learning: Engineering,
1, 015004. https://doi.org/10.1088/3049-4761/adaaf6
```

## License

MIT License. See LICENSE file for details.

## Links

- Repository: https://github.com/amaxiom/shapley_behaviors
- Issues: https://github.com/amaxiom/shapley_behaviors/issues
- Paper: https://doi.org/10.1088/3049-4761/adaaf6
