Metadata-Version: 2.4
Name: pyg-hyper-bench
Version: 0.1.0
Summary: Comprehensive benchmarking framework for hypergraph learning with PyTorch Lightning integration
Author-email: nishide-dev <nishide.dev@gmail.com>
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pyg-hyper-data>=0.1.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: all
Requires-Dist: hydra-core>=1.3.0; extra == 'all'
Requires-Dist: lightning>=2.0.0; extra == 'all'
Requires-Dist: matplotlib>=3.7.0; extra == 'all'
Requires-Dist: omegaconf>=2.3.0; extra == 'all'
Requires-Dist: plotly>=5.0.0; extra == 'all'
Requires-Dist: pyg-hyper-nn>=0.1.0; extra == 'all'
Requires-Dist: seaborn>=0.12.0; extra == 'all'
Requires-Dist: sqlalchemy>=2.0.0; extra == 'all'
Requires-Dist: torch-geometric>=2.4.0; extra == 'all'
Requires-Dist: torch>=2.0.0; extra == 'all'
Provides-Extra: config
Requires-Dist: hydra-core>=1.3.0; extra == 'config'
Requires-Dist: omegaconf>=2.3.0; extra == 'config'
Provides-Extra: database
Requires-Dist: sqlalchemy>=2.0.0; extra == 'database'
Provides-Extra: lightning
Requires-Dist: lightning>=2.0.0; extra == 'lightning'
Provides-Extra: torch
Requires-Dist: pyg-hyper-nn>=0.1.0; extra == 'torch'
Requires-Dist: torch-geometric>=2.4.0; extra == 'torch'
Requires-Dist: torch>=2.0.0; extra == 'torch'
Provides-Extra: viz
Requires-Dist: matplotlib>=3.7.0; extra == 'viz'
Requires-Dist: plotly>=5.0.0; extra == 'viz'
Requires-Dist: seaborn>=0.12.0; extra == 'viz'
Description-Content-Type: text/markdown

# pyg-hyper-bench

[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch 2.9+](https://img.shields.io/badge/pytorch-2.9+-ee4c2c.svg)](https://pytorch.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Comprehensive benchmarking framework for hypergraph learning with statistical evaluation and multi-run support.

## Overview

**pyg-hyper-bench** provides a standardized framework for evaluating hypergraph neural networks with:

- 📊 **Statistical Evaluation**: Multi-run evaluation with mean, std, and 95% confidence intervals
- 🎯 **Standardized Protocols**: Node classification, link prediction, clustering, SSL evaluation
- 🔄 **Reproducibility**: Seed management for consistent results
- 🧪 **Comprehensive Testing**: 55+ tests with real datasets
- 🚀 **Easy to Use**: Simple API for single-run and multi-run evaluation

## Architecture

pyg-hyper-bench follows a clean separation of concerns:

```
pyg-hyper-bench/
├── protocols/              # Evaluation protocols
│   ├── base.py            # BenchmarkProtocol (abstract base)
│   ├── node_classification.py
│   ├── link_prediction.py
│   ├── clustering.py
│   └── ssl_linear_evaluation.py  # SSL linear evaluation
└── evaluators/            # Evaluation engines
    ├── single_run.py      # Single-run evaluator
    └── multi_run.py       # Multi-run with statistics
```

**Design Principles**:
- **pyg-hyper-data**: Datasets + Split utilities (data layer)
- **pyg-hyper-bench**: Evaluation protocols + Evaluators (evaluation layer)

## Installation

### Requirements

- Python 3.12+
- [uv](https://docs.astral.sh/uv/) (recommended) or pip
- PyTorch 2.9+ with CUDA 12.6 (optional, for GPU acceleration)

### Install from source

```bash
# Clone the repository
git clone https://github.com/nishide-dev/pyg-hyper-bench.git
cd pyg-hyper-bench

# Create virtual environment and install dependencies
uv venv
uv sync

# Activate the virtual environment
source .venv/bin/activate  # Linux/macOS
# or
.venv\Scripts\activate  # Windows
```

## Quick Start

### Single-Run Evaluation

```python
from pyg_hyper_bench import SingleRunEvaluator, NodeClassificationProtocol
from pyg_hyper_data.datasets import CoraCocitation

# Load dataset
dataset = CoraCocitation()

# Create protocol
protocol = NodeClassificationProtocol(
    split_type="transductive",
    stratified=True,
    seed=42
)

# Create evaluator
evaluator = SingleRunEvaluator(dataset, protocol, device="cpu")

# Get data splits
split = evaluator.get_split()
train_mask = split["train_mask"]
val_mask = split["val_mask"]
test_mask = split["test_mask"]
data = split["data"]

# Train your model
model = YourHypergraphModel(...)
# ... training code ...

# Evaluate
model.eval()
with torch.no_grad():
    predictions = model(data.x, data.hyperedge_index)
    test_metrics = evaluator.evaluate(
        predictions[test_mask],
        data.y[test_mask],
        split="test"
    )

print(f"Test Accuracy: {test_metrics['accuracy']:.4f}")
print(f"Test F1-macro: {test_metrics['f1_macro']:.4f}")
```

### Multi-Run Evaluation with Statistics

```python
from pyg_hyper_bench import MultiRunEvaluator, NodeClassificationProtocol
from pyg_hyper_data.datasets import CoraCocitation
import torch.nn as nn

# Load dataset
dataset = CoraCocitation()

# Create protocol
protocol = NodeClassificationProtocol(
    split_type="transductive",
    stratified=True,
    seed=42
)

# Create multi-run evaluator
evaluator = MultiRunEvaluator(
    dataset=dataset,
    protocol=protocol,
    n_runs=10,  # Run 10 times with different seeds
    device="cpu",
    verbose=True
)

# Model factory (creates fresh model for each run)
def model_fn(seed):
    torch.manual_seed(seed)
    return YourHypergraphModel(
        in_channels=dataset.num_node_features,
        out_channels=dataset.num_classes
    )

# Training function
def train_fn(model, data, train_mask, val_mask):
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(100):
        model.train()
        optimizer.zero_grad()
        out = model(data.x, data.hyperedge_index)
        loss = criterion(out[train_mask], data.y[train_mask])
        loss.backward()
        optimizer.step()

    return model

# Run evaluation
results = evaluator.run_evaluation(
    model_fn=model_fn,
    train_fn=train_fn,
    splits=["val", "test"]
)

# Print results with statistics
print(results["test"].summary_table(split="test"))
```

**Output**:
```
## Test Results (n=10 runs)

| Metric | Mean ± Std | 95% CI |
|--------|------------|--------|
| accuracy | 0.8550 ± 0.0081 | [0.8305, 0.8796] |
| f1_macro | 0.8434 ± 0.0076 | [0.8204, 0.8665] |
| f1_micro | 0.8550 ± 0.0081 | [0.8305, 0.8796] |
```

## Features

### Evaluation Protocols

#### NodeClassificationProtocol

Standard protocol for node classification tasks:

```python
protocol = NodeClassificationProtocol(
    train_ratio=0.6,      # 60% training
    val_ratio=0.2,        # 20% validation
    test_ratio=0.2,       # 20% testing
    split_type="transductive",  # or "inductive"
    stratified=True,      # Maintain class balance
    seed=42
)
```

**Features**:
- Transductive or inductive splits
- Stratified or random sampling
- Metrics: accuracy, F1-macro, F1-micro
- Based on standard benchmarking setups (HyperGCN, AllSet, ED-HNN)

#### LinkPredictionProtocol

Hyperedge link prediction protocol:

```python
protocol = LinkPredictionProtocol(
    train_ratio=0.7,      # 70% training
    val_ratio=0.1,        # 10% validation
    test_ratio=0.2,       # 20% testing
    negative_sampling_ratio=1.0,  # 1:1 positive:negative ratio
    seed=42
)

# Split data
split = protocol.split_data(data)
train_data = split["train_data"]  # Graph for training
val_pos = split["val_pos_edge"]   # Positive validation samples
val_neg = split["val_neg_edge"]   # Negative validation samples

# Train model and get scores
pos_scores = model.predict(val_pos)
neg_scores = model.predict(val_neg)

# Evaluate
metrics = protocol.evaluate(pos_scores, neg_scores)
print(f"AUC: {metrics['auc']:.4f}")
print(f"MRR: {metrics['mrr']:.4f}")
print(f"Hits@10: {metrics['hits@10']:.4f}")
```

**Features**:
- Binary classification: real hyperedge vs random node set
- Metrics: AUC, AP, MRR, Hits@10/50/100
- Configurable negative sampling ratio
- Based on HyperGCN link prediction task

#### ClusteringProtocol

Unsupervised clustering evaluation:

```python
protocol = ClusteringProtocol(
    seed=42,
    n_clusters=7  # or None for auto-detection from labels
)

# Train model to learn embeddings (unsupervised)
embeddings = model.encode(data)  # [num_nodes, embedding_dim]

# Evaluate clustering quality
metrics = protocol.evaluate(embeddings, data.y)
print(f"NMI: {metrics['nmi']:.4f}")
print(f"ARI: {metrics['ari']:.4f}")
print(f"AMI: {metrics['ami']:.4f}")
```

**Features**:
- Unsupervised evaluation (labels only for evaluation, not training)
- K-Means clustering on learned embeddings
- Metrics: NMI (Normalized Mutual Information), ARI (Adjusted Rand Index), AMI (Adjusted Mutual Information)
- Auto-detection of number of clusters from labels

#### SSLLinearEvaluationProtocol

Linear evaluation protocol for self-supervised learning (SSL) methods:

```python
from pyg_hyper_bench import SSLLinearEvaluationProtocol

# Create protocol for node classification task
protocol = SSLLinearEvaluationProtocol(
    task="node_classification",  # or "hyperedge_prediction"
    classifier_type="logistic_regression",  # or "mlp"
    classifier_epochs=200,
    seed=42
)

# Split data (SSL pre-training does NOT use labels)
split = protocol.split_data(data)

# Get frozen embeddings from SSL model (trained separately)
model.eval()
with torch.no_grad():
    embeddings = model.get_embeddings(data)

# Linear evaluation (train linear classifier on frozen embeddings)
metrics = protocol.evaluate(
    embeddings=embeddings,
    labels=data.y,
    train_mask=split["train_mask"],
    val_mask=split["val_mask"],
    test_mask=split["test_mask"],
)

print(f"Test Accuracy: {metrics['test_accuracy']:.4f}")
```

**Features**:
- **Two evaluation tasks**:
  - `task="node_classification"`: Multi-class node classification
  - `task="hyperedge_prediction"`: Binary node-hyperedge membership prediction
- **Selectable classifiers**: Logistic Regression (sklearn) or MLP (PyTorch)
- **Frozen embeddings**: Evaluates representation quality without fine-tuning
- **Metrics**:
  - Node classification: accuracy, F1-macro, F1-micro
  - Hyperedge prediction: AUC, AP (Average Precision)
- Based on TriCL (AAAI'23) and HypeBoy (KDD'23) evaluation protocols

**Important**: SSL pre-training is done separately (e.g., in pyg-hyper-ssl). This protocol only evaluates the learned representations.

### Evaluators

#### SingleRunEvaluator

For single-run evaluation:

```python
evaluator = SingleRunEvaluator(dataset, protocol, device="cpu")
split = evaluator.get_split()
metrics = evaluator.evaluate(predictions, targets, split="test")
```

#### MultiRunEvaluator

For multi-run evaluation with statistical aggregation:

```python
evaluator = MultiRunEvaluator(
    dataset=dataset,
    protocol=protocol,
    n_runs=10,
    seeds=[42, 43, 44, ...],  # Optional custom seeds
    device="cpu",
    verbose=True
)

results = evaluator.run_evaluation(model_fn, train_fn, splits=["val", "test"])
```

**Statistical Measures**:
- Mean across all runs
- Standard deviation
- 95% confidence interval (using t-distribution)
- Raw results from each run

### Reproducibility

All evaluators support seed management for reproducibility:

```python
# Same seed = same results
protocol1 = NodeClassificationProtocol(seed=42)
protocol2 = NodeClassificationProtocol(seed=42)
split1 = protocol1.split_data(data)
split2 = protocol2.split_data(data)
# split1 == split2 ✓

# Multi-run with custom seeds
evaluator = MultiRunEvaluator(
    dataset=dataset,
    protocol=protocol,
    n_runs=5,
    seeds=[0, 10, 20, 30, 40]  # Custom seeds
)
```

## Verified Performance

Integration tests with real datasets and models:

**Dataset**: CoraCocitation (1,434 nodes, 1,579 hyperedges, 7 classes)

**Model**: Simple Hypergraph GNN (64 hidden dimensions, 10 epochs)

**Results**:
- Single run: **83.22%** test accuracy
- Multi-run (n=3): **85.50% ± 0.81%** test accuracy (95% CI: [83.05%, 87.96%])

See `tests/test_integration.py` for full integration tests.

## Development

### Pre-commit hooks

This project uses pre-commit hooks to ensure code quality:

```bash
# Install pre-commit hooks
uv run pre-commit install
uv run pre-commit install --hook-type commit-msg

# Run hooks manually on all files
uv run pre-commit run --all-files

# Run hooks on staged files (happens automatically on git commit)
git commit -m "Your message"
```

**Hooks**:
- `ruff lint --fix`: Auto-fix linting issues
- `ruff format`: Format code
- `ty check`: Type checking (runs on entire project)

### Running tests

```bash
# Run all tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run specific test file
uv run pytest tests/test_integration.py

# Run with coverage
uv run pytest --cov=src --cov-report=html
```

**Test Coverage**:
- Unit tests: 10 tests for MultiRunEvaluator, 12 tests for LinkPrediction, 12 tests for Clustering, 13 tests for SSL
- Integration tests: 6 tests with real datasets and models
- Total: 55 tests (53 passed, 2 skipped)

### Code quality

```bash
# Format code
uv run ruff format .

# Lint code
uv run ruff check .

# Type checking
uv run ty check
```

### Adding dependencies

```bash
# Add runtime dependency
uv add <package-name>

# Add development dependency
uv add --dev <package-name>

# Update all dependencies
uv lock --upgrade
```

## Project Structure

```
pyg-hyper-bench/
├── src/pyg_hyper_bench/
│   ├── __init__.py
│   ├── protocols/              # Evaluation protocols
│   │   ├── __init__.py
│   │   ├── base.py            # BenchmarkProtocol (abstract)
│   │   ├── node_classification.py
│   │   ├── link_prediction.py
│   │   ├── clustering.py
│   │   └── ssl_linear_evaluation.py  # SSL linear evaluation
│   └── evaluators/             # Evaluation engines
│       ├── __init__.py
│       ├── single_run.py      # Single-run evaluator
│       └── multi_run.py       # Multi-run evaluator
├── tests/
│   ├── test_multi_run_evaluator.py  # Unit tests
│   ├── test_link_prediction.py      # Link prediction tests
│   ├── test_clustering.py           # Clustering tests
│   ├── test_ssl_linear_evaluation.py # SSL evaluation tests
│   └── test_integration.py          # Integration tests
├── docs/
│   └── DESIGN.md              # Detailed design document
├── pyproject.toml             # Project configuration
└── README.md                  # This file
```

## Dependencies

**Core**:
- pyg-hyper-data: Dataset and data utilities
- PyTorch: Deep learning framework
- torch-scatter: Scatter operations for hypergraph aggregation
- NumPy: Numerical computing
- SciPy: Statistical functions
- pandas: Data manipulation
- scikit-learn: Machine learning utilities (classifiers, metrics)
- tqdm: Progress bars

**Development**:
- pytest: Testing framework
- ruff: Linter and formatter
- ty: Type checker

## Related Projects

- [pyg-hyper-data](https://github.com/nishide-dev/pyg-hyper-data): Hypergraph datasets
- [pyg-hyper-nn](https://github.com/nishide-dev/pyg-hyper-nn): Hypergraph neural network layers
- [pyg-hyper-ssl](https://github.com/nishide-dev/pyg-hyper-ssl): Self-supervised learning for hypergraphs

## Citation

If you use this package in your research, please cite:

```bibtex
@software{pyg_hyper_bench,
  title = {pyg-hyper-bench: Benchmarking Framework for Hypergraph Learning},
  author = {Nishide},
  year = {2025},
  url = {https://github.com/nishide-dev/pyg-hyper-bench}
}
```

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## Acknowledgments

This project follows best practices from:
- **TriCL (AAAI'23)**: Multi-seed evaluation with mean ± std reporting, LogisticRegression for linear evaluation
- **HyperGCL**: Logger-based multi-run tracking
- **HypeBoy (KDD'23)**: 20-split evaluation with statistical aggregation, MLP for linear evaluation, hyperedge prediction task

Built with:
- [uv](https://github.com/astral-sh/uv) - Fast Python package manager
- [PyTorch](https://pytorch.org/) - Deep learning framework
- [PyTorch Geometric](https://pytorch-geometric.readthedocs.io/) - Graph neural networks

---

Generated with ❤️ for reproducible hypergraph learning research
