Metadata-Version: 2.4
Name: datatypical
Version: 0.7.1
Summary: Explainable instance significance discovery for scientific datasets
Home-page: https://github.com/amaxiom/DataTypical
Author: Amanda S. Barnard
Author-email: "Amanda S. Barnard" <amanda.s.barnard@anu.edu.au>
License: MIT
Project-URL: Homepage, https://github.com/amaxiom/DataTypical
Project-URL: Documentation, https://github.com/amaxiom/DataTypical/tree/main/docs
Project-URL: Repository, https://github.com/amaxiom/DataTypical
Keywords: machine-learning,explainable-ai,shapley-values,data-science
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# DataTypical

**Explainable Instance Significance Discovery for Scientific Datasets**

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

DataTypical analyzes datasets through three complementary lenses: archetypal (extreme), prototypical (representative), and stereotypical (target-like), with Shapley value explanations revealing why instances matter and which ones create your dataset's structure.

---

## Key Features

- **Three Significance Types**: Archetypal, prototypical, stereotypical (all computed simultaneously)
- **Shapley Explanations**: Feature-level attributions for why samples are significant
- **Formative Discovery**: Distinguish samples that ARE significant from those that CREATE structure
- **Publication Visualizations**: Dual-perspective scatter plots, heatmaps, and profile plots
- **Multi-Modal Support**: Tabular data, text, and graph networks through unified API
- **Performance Optimized**: Fast exploration mode and efficient Shapley computation

---

## Installation
```bash
pip install datatypical
```

---

## Quick Start
```python
from datatypical import DataTypical
from datatypical_viz import significance_plot, heatmap, profile_plot
import pandas as pd

# Load your data
data = pd.read_csv('your_data.csv')

# Analyze with explanations
dt = DataTypical(shapley_mode=True)
results = dt.fit_transform(data)

# Three significance perspectives (0-1 normalized ranks)
print(results[['archetypal_rank', 'prototypical_rank', 'stereotypical_rank']])

# Visualize: which samples are critical vs replaceable?
significance_plot(results, significance='archetypal')

# Understand: which features drive significance?
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# Explain: why is this sample significant?
top_idx = results['archetypal_rank'].idxmax()
profile_plot(dt, top_idx, significance='archetypal', order='local')
```

---

## What DataTypical Does

### Three Complementary Lenses

| Lens | Finds | Use Cases |
|------|-------|-----------|
| **Archetypal** | Extreme, boundary samples | Edge case discovery, outlier detection, range understanding |
| **Prototypical** | Representative, central samples | Dataset summarization, cluster centers, data coverage |
| **Stereotypical** | Target-similar samples | Optimization, goal-oriented selection, phenotype matching |

**The Power**: All three computed simultaneously—different perspectives reveal different insights.

### Dual Perspective (with Shapley)

When `shapley_mode=True`, DataTypical reveals two views:

- **Actual Significance** (`*_rank`): Samples that ARE significant
- **Formative Significance** (`*_shapley_rank`): Samples that CREATE the structure

This distinction, between what IS significant vs what CREATES structure, is unique to DataTypical.

---

## Example: Drug Discovery
```python
# Analyze compound library
dt = DataTypical(
    shapley_mode=True,
    stereotype_column='activity',  # Target property
    fast_mode=False
)
results = dt.fit_transform(compounds)

# Find critical compounds (high actual + high formative)
critical = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] > 0.8)
]
print(f"Found {len(critical)} critical compounds")

# Find redundant compounds (high actual + low formative)
redundant = results[
    (results['stereotypical_rank'] > 0.8) &
    (results['stereotypical_shapley_rank'] < 0.3)
]
print(f"Found {len(redundant)} replaceable compounds")

# Understand alternative mechanisms
for idx in critical.index:
    profile_plot(dt, idx, significance='stereotypical')
    # Each shows different feature pattern → different mechanism
```

**Discovery**: Multiple structural pathways to high activity.

---

## Key Parameters
```python
DataTypical(
    shapley_mode=False,           # True for explanations
    fast_mode=True,               # False for publication quality
    n_archetypes=8,               # Number of extreme corners
    n_prototypes=8,               # Number of representatives
    stereotype_column=None,       # Target column for stereotypical
    shapley_top_n=500,            # Limit explanations to top N
    shapley_n_permutations=100,   # Number of permutations
    random_state=None,            # Set for reproducible results
    max_memory_mb=8000            # Memory limit
)
```

---

## Visualization Functions
```python
from datatypical_viz import significance_plot, heatmap, profile_plot

# 1. Dual-perspective scatter plot
significance_plot(results, significance='archetypal')

# 2. Feature attribution heatmap
heatmap(dt, results, significance='archetypal', order='actual', top_n=20)

# 3. Individual sample profile
profile_plot(dt, sample_idx, significance='archetypal', order='local')
```

---

## Multi-Modal Support

### Tabular Data
```python
df = pd.DataFrame(...)
dt = DataTypical()
results = dt.fit_transform(df)
```

### Text Data
```python
texts = ["document 1", "document 2", ...]
dt = DataTypical()
results = dt.fit_transform(texts)
```

### Graph Networks
```python
node_features = pd.DataFrame(...)
edges = [(0, 1), (1, 2), ...]
dt = DataTypical()
results = dt.fit_transform(node_features, edges=edges)
```

---

## Performance

| Dataset Size | Without Shapley | With Shapley |
|--------------|-----------------|--------------|
| 1,000 samples | ~5 seconds | ~5 minutes |
| 10,000 samples | ~30 seconds | ~60 minutes |

**Optimization Strategy**:
1. Fast exploration (`fast_mode=True`, no Shapley)
2. Identify interesting samples
3. Detailed analysis (`shapley_mode=True`, subset)
4. Generate publication figures

---

## Use Cases

**Scientific Discovery**: Alternative mechanisms and pathways, boundary definition and edge cases, quality control and validation, coverage analysis and gap identification

**Dataset Curation**: Size reduction while preserving diversity, representative selection, redundancy detection, gap identification for future sampling

**Model Understanding**: Feature importance (global and local), individual sample explanations, pattern recognition across pathways, interpretable explanations

---

## What Makes DataTypical Different

**From outlier detection**: Finds extremes AND explains why

**From clustering**: Finds representatives maximizing coverage AND explains why

**From feature selection**: Explains which features matter for which samples

**From PCA/t-SNE**: Maintains interpretability in original feature space

**The Novel Contribution**: Formative instances distinguish samples that ARE significant from samples that CREATE structure, enabling redundancy detection, identifying structurally important samples, and understanding irreplaceable vs interchangeable samples.

---

## Documentation

Complete documentation, examples, and guides available at:  
**https://github.com/amaxiom/DataTypical**

Includes:
- Getting started tutorials
- Comprehensive examples across scientific domains
- Visualization interpretation guides
- Advanced usage and computation details
- Test suite and benchmarks

---

## Support

- **GitHub Repository**: https://github.com/amaxiom/DataTypical
- **Report Issues**: https://github.com/amaxiom/DataTypical/issues
- **Questions & Discussions**: https://github.com/amaxiom/DataTypical/discussions

---

## Requirements

- Python ≥ 3.8
- NumPy ≥ 1.20
- Pandas ≥ 1.3
- SciPy ≥ 1.7
- scikit-learn ≥ 1.0
- Matplotlib ≥ 3.3
- Seaborn ≥ 0.11
- Numba ≥ 0.55

---

## Citation

If you use DataTypical in your research, please cite:
```bibtex
@software{datatypical2025,
  author = {Barnard, Amanda S.},
  title = {DataTypical: Scientific Data Significance Rankings with Shapley Explanations},
  year = {2026},
  url = {https://github.com/amaxiom/DataTypical},
  version = {0.7}
}
```

---

## License

MIT License - Copyright (c) 2026 Amanda S. Barnard

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.

---

## Acknowledgments

DataTypical builds on foundational work in archetypal analysis (Cutler & Breiman, 1994), facility location optimization (Nemhauser et al., 1978), Shapley value theory (Shapley, 1953), and PCHA optimization (Mørup & Hansen, 2012).
