Metadata-Version: 2.4
Name: pyg-hyper-data
Version: 0.1.1
Summary: PyTorch Geometric-based multimodal hypergraph datasets
Author-email: Ryusei Nishide <nishide.dev@gmail.com>
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Requires-Dist: datasets>=4.5.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: all
Requires-Dist: huggingface-hub>=0.20.0; extra == 'all'
Requires-Dist: polars>=0.20.0; extra == 'all'
Requires-Dist: sentence-transformers>=2.0.0; extra == 'all'
Requires-Dist: torch-cluster>=1.6.0; extra == 'all'
Requires-Dist: torch-geometric>=2.4.0; extra == 'all'
Requires-Dist: torch-scatter>=2.1.0; extra == 'all'
Requires-Dist: torch-sparse>=0.6.0; extra == 'all'
Requires-Dist: torch-spline-conv>=1.2.0; extra == 'all'
Requires-Dist: torch>=2.0.0; extra == 'all'
Requires-Dist: transformers>=4.30.0; extra == 'all'
Provides-Extra: large-scale
Requires-Dist: polars>=0.20.0; extra == 'large-scale'
Provides-Extra: multimodal
Requires-Dist: huggingface-hub>=0.20.0; extra == 'multimodal'
Requires-Dist: sentence-transformers>=2.0.0; extra == 'multimodal'
Requires-Dist: transformers>=4.30.0; extra == 'multimodal'
Provides-Extra: torch
Requires-Dist: torch-cluster>=1.6.0; extra == 'torch'
Requires-Dist: torch-geometric>=2.4.0; extra == 'torch'
Requires-Dist: torch-scatter>=2.1.0; extra == 'torch'
Requires-Dist: torch-sparse>=0.6.0; extra == 'torch'
Requires-Dist: torch-spline-conv>=1.2.0; extra == 'torch'
Requires-Dist: torch>=2.0.0; extra == 'torch'
Description-Content-Type: text/markdown

# PyG-Hyper-Data

[![PyPI version](https://badge.fury.io/py/pyg-hyper-data.svg)](https://pypi.org/project/pyg-hyper-data/)
[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
[![PyTorch Geometric](https://img.shields.io/badge/PyG-2.6-ee4c2c.svg)](https://pytorch-geometric.readthedocs.io/)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)
[![Type checked: ty](https://img.shields.io/badge/type%20checked-ty-blue.svg)](https://github.com/astral-sh/ty)

**PyTorch Geometric-based multimodal hypergraph dataset library with advanced features for research reproducibility and production deployment.**

PyG-Hyper-Data is a comprehensive library for hypergraph learning built on PyTorch Geometric. It provides standardized datasets, structural leak prevention, efficient sampling strategies, and benchmark protocols to accelerate hypergraph research and applications.

## 🚀 Key Features

### 📊 **10 Production-Ready Datasets**
- **Citation Networks**: Cora, Citeseer, Pubmed (co-citation) - *DHG-Bench*
- **Authorship Networks**: Cora, DBLP (coauthorship) - *DHG-Bench*
- **3D Shape Recognition**: NTU2012, ModelNet40 - *TriCL*
- **UCI Datasets**: Mushroom, Zoo - *TriCL*
- **Text Classification**: 20 Newsgroups - *TriCL*
- **Flexible Isolated Node Handling**: Preserve or remove isolated nodes for multimodal feature collection

### 🔄 **Dual Data Format Support**
- **HyperData**: Compact hyperedge_index representation
- **HyperHeteroData**: Universal Bipartite Object (UBO) format with typed hyperedges
- **Seamless Conversion**: `data.to_hetero()` and `data.to_hyperedge_index()`
- **Multimodal Attributes**: Text, images, and mixed features
- **AttributedHyperData**: Attribute nodes as independent entities with entity-attribute hyperedges

### 🛡️ **Structural Leak Prevention**
- **Inductive Split**: Complete graph separation for train/val/test
- **Transductive Split**: Stratified node-level splitting
- **Hyperedge Prediction Split**: Link prediction with negative sampling
- **Conservative Assignment**: Zero information leakage guarantee

### ⚡ **Efficient Large-Scale Sampling**
- **HypergraphNeighborLoader**: Prevents "Monster Hyperedge" explosion
- **Hierarchical Sampling**: Independent control for V→E and E→V
- **Cardinality-Aware**: Handles power-law degree distributions
- **Memory-Efficient**: Bounded computation graphs

## 📦 Installation

### Using uv (Recommended)
```bash
# Clone the repository
git clone https://github.com/nishide-dev/pyg-hyper-data.git
cd pyg-hyper-data

# Install core package with basic dependencies
uv sync

# Install with optional features
uv sync --extra multimodal      # Add multimodal support (transformers, etc.)
uv sync --extra large-scale     # Add large-scale data processing (polars)
uv sync --all-extras            # Install all optional features

# Install with development tools (for contributors)
uv sync --all-extras --group dev
```

### Using pip
```bash
# Install core package
pip install -e .

# Or with optional features
pip install -e ".[multimodal]"    # Multimodal support
pip install -e ".[large-scale]"   # Large-scale data processing
pip install -e ".[all]"           # All optional features
```

### Requirements
- Python ≥ 3.12
- PyTorch ≥ 2.0
- PyTorch Geometric ≥ 2.6
- NumPy, Pandas, SciPy

## 🎯 Quick Start

### Load a Dataset
```python
from pyg_hyper_data.datasets import CoraCocitation

# Load dataset (default: isolated nodes removed)
dataset = CoraCocitation()
data = dataset[0]

print(f"Nodes: {data.num_nodes}")  # 1,434 (connected nodes only)
print(f"Hyperedges: {data.num_hyperedges}")  # 1,579
print(f"Features: {data.x.shape[1]}")  # 1,433
print(f"Classes: {int(data.y.max()) + 1}")  # 7
```

### Flexible Isolated Node Handling
```python
from pyg_hyper_data.datasets import CoraCocitation

# Default: isolated nodes removed (TriCL-compatible)
dataset = CoraCocitation(remove_isolated_nodes=True)
data = dataset[0]
print(f"Connected nodes: {data.num_nodes}")  # 1,434

# Preserve all nodes (for multimodal feature collection)
dataset_full = CoraCocitation(remove_isolated_nodes=False)
data_full = dataset_full[0]
print(f"Total nodes: {data_full.num_nodes}")  # 2,708
print(f"Isolated nodes: {data_full.num_nodes - data.num_nodes}")  # 1,274

# Node ID mapping preserved for multimodal features
print(f"Original node IDs preserved: {len(dataset_full.node_id_map)} nodes")
```

### Data Format Conversion
```python
# Convert to HeteroData (UBO format)
hetero_data = data.to_hetero()

print(hetero_data)
# HyperHeteroData(
#   node={ x=[1434, 1433], y=[1434] },
#   hyperedge={ num_nodes=1579 },
#   (node, member_of, hyperedge)={ edge_index=[2, 4786] },
#   (hyperedge, has_member, node)={ edge_index=[2, 4786] }
# )

# Convert back to HyperData
hyper_data = hetero_data.to_hyperedge_index()
```

### Hierarchical Neighbor Sampling
```python
from pyg_hyper_data.loader import HypergraphNeighborLoader
import torch

# Create loader with explosion prevention
loader = HypergraphNeighborLoader(
    data,
    num_neighbors_nodes=10,   # Sample 10 hyperedges per node
    num_neighbors_edges=20,   # Sample 20 nodes per hyperedge (CRITICAL!)
    input_nodes=torch.arange(100),
    batch_size=32,
    num_hops=2,
)

# Iterate over batches
for batch in loader:
    # batch is a HyperHeteroData with sampled subgraph
    print(f"Batch nodes: {batch['node'].num_nodes}")
    # Your model training code here...
```

### Inductive Split (Zero Leakage)
```python
from pyg_hyper_data.transforms import inductive_hypergraph_split

# Create completely independent train/val/test graphs
train_data, val_data, test_data = inductive_hypergraph_split(
    data,
    train_ratio=0.6,
    val_ratio=0.2,
    test_ratio=0.2,
    seed=42
)

print(f"Train: {train_data.num_nodes} nodes, {train_data.num_hyperedges} edges")
print(f"Val:   {val_data.num_nodes} nodes, {val_data.num_hyperedges} edges")
print(f"Test:  {test_data.num_nodes} nodes, {test_data.num_hyperedges} edges")

# Test nodes/edges are completely hidden from training
# No structural information leaks
```

### Hyperedge Prediction (Link Prediction)
```python
from pyg_hyper_data.transforms import hyperedge_prediction_split

# Split for link prediction task
split = hyperedge_prediction_split(
    data,
    train_ratio=0.7,
    val_ratio=0.1,
    test_ratio=0.2,
    seed=42,
    negative_sampling_ratio=1.0,  # 1:1 positive to negative
)

# Access splits
train_data = split['train_data']  # Graph with train edges only
val_pos = split['val_pos_edge']   # Positive val samples
val_neg = split['val_neg_edge']   # Negative val samples
test_pos = split['test_pos_edge']
test_neg = split['test_neg_edge']

print(f"Train graph: {train_data.num_hyperedges} hyperedges")
print(f"Val: {val_pos.shape[1]} positive, {val_neg.shape[1]} negative")
```

### Benchmark Evaluation
```python
from pyg_hyper_data.benchmark import (
    NodeClassificationProtocol,
    HypergraphEvaluator
)

# Create protocol
protocol = NodeClassificationProtocol(
    split_type='transductive',  # or 'inductive'
    stratified=True,            # Maintain class balance
    seed=42
)

# Create evaluator
evaluator = HypergraphEvaluator(dataset, protocol)

# Get split
split_data = evaluator.get_split()
train_mask = split_data['train_mask']
val_mask = split_data['val_mask']
test_mask = split_data['test_mask']

# ... train your model ...
# predictions = model(data)

# Evaluate
test_metrics = evaluator.evaluate(
    predictions[test_mask],
    data.y[test_mask],
    split='test'
)

print(f"Test Accuracy: {test_metrics['accuracy']:.4f}")
print(f"Test F1 (macro): {test_metrics['f1_macro']:.4f}")
print(f"Test F1 (micro): {test_metrics['f1_micro']:.4f}")
```

## 🎭 Attributed Heterogeneous Hypergraphs

Transform hypergraphs with attributes (e.g., title, abstract) into **heterogeneous hypergraphs** where attributes become independent nodes connected to entities via **entity-attribute hyperedges**. This enables advanced multimodal learning with heterogeneous GNNs.

### Basic Usage

```python
from pyg_hyper_data.datasets import CoraCocitation
from pyg_hyper_data.transforms import AddEntityAttributeHyperedges

# Create transform
transform = AddEntityAttributeHyperedges(
    attributes=["title", "abstract"],
    one_to_one=True,
    edge_type="combined"  # or "separate"
)

# Apply to dataset
dataset = CoraCocitation(
    remove_isolated_nodes=False,
    load_text_features=True,
    pre_transform=transform
)

data = dataset[0]  # AttributedHyperData
print(f"Total nodes: {data.num_nodes}")  # Entities + Titles + Abstracts
print(f"Total hyperedges: {data.num_hyperedges}")  # Structural + Entity-Attribute
```

### Convert to Heterogeneous Format

```python
# Convert to HyperHeteroData with typed hyperedges
hetero_data = data.to_hetero()

print(hetero_data.node_types)
# ['entity', 'title', 'abstract', 'structural_hyperedge', 'entity_attribute_hyperedge']

print(hetero_data.edge_types)
# [('entity', 'member_of', 'structural_hyperedge'),
#  ('entity', 'member_of', 'entity_attribute_hyperedge'),
#  ('title', 'member_of', 'entity_attribute_hyperedge'),
#  ('abstract', 'member_of', 'entity_attribute_hyperedge'), ...]
```

### Key Features

- **Typed Hyperedges**: Structural and entity-attribute hyperedges are separated
- **Two Edge Strategies**:
  - `combined`: One hyperedge per entity connecting all its attributes
  - `separate`: One hyperedge per entity-attribute pair
- **Automatic Conversion**: Seamless integration with HyperHeteroData format
- **Heterogeneous GNN Ready**: Compatible with PyG's heterogeneous message passing

### Example: Pubmed Dataset

```python
from pyg_hyper_data.datasets import PubmedCocitation
from pyg_hyper_data.transforms import AddEntityAttributeHyperedges

# Load Pubmed with text features (lazy loading for large datasets)
dataset = PubmedCocitation(
    remove_isolated_nodes=True,
    load_text_features="lazy"  # Lazy loading for 19,717 nodes
)

# Create transform
transform = AddEntityAttributeHyperedges(
    attributes=["title", "abstract"],
    one_to_one=True,
    edge_type="combined"
)

# Apply transform
data = dataset[0]
attributed_data = transform(data)

print(f"Entities: {attributed_data.original_num_nodes}")  # 3,840
print(f"Total nodes: {attributed_data.num_nodes}")  # 3,840 × 3 = 11,520
print(f"Structural hyperedges: {attributed_data.original_num_edges}")  # 7,963
print(f"Entity-attribute hyperedges: {attributed_data.num_hyperedges - attributed_data.original_num_edges}")  # 3,840
```

## 🎨 Multimodal Features: Text Attributes

Cora and Pubmed datasets support text features (paper titles + abstracts) from the [HuggingFace TAG dataset](https://huggingface.co/datasets/Graph-COM/Text-Attributed-Graphs). Title and abstract are provided as **separate attributes** for maximum flexibility in multimodal hypergraph learning. Text features are encoded with [sentence-transformers](https://www.sbert.net/).

### Installation

```bash
# Install with multimodal support
uv sync --extra multimodal

# Or with pip
pip install -e ".[multimodal]"
```

### Basic Usage (Eager Loading)

```python
from pyg_hyper_data.datasets import CoraCocitation

# Load with text features (encoded with all-mpnet-base-v2)
dataset = CoraCocitation(
    remove_isolated_nodes=False,  # Recommended for perfect TAG alignment
    load_text_features=True
)

data = dataset[0]
print(f"Nodes: {data.num_nodes}")  # 2,708

# Access separate title and abstract
print(f"Titles: {len(data.title)}")  # 2,708
print(f"Abstracts: {len(data.abstract)}")  # 2,708

# View sample text
print(data.title[0][:100])
# "Incremental Learning of Context-Free Grammars..."
print(data.abstract[0][:100])
# "We describe a representation and algorithms for learning..."

# Access sentence-transformer embeddings (encoded separately)
print(data.title_embeddings.shape)  # [2708, 768]
print(data.abstract_embeddings.shape)  # [2708, 768]
```

### Lazy Loading (For Large Datasets)

For large datasets like Pubmed (19,717 nodes), use lazy loading to avoid loading all text data into memory:

```python
from pyg_hyper_data.datasets import PubmedCocitation

# Enable lazy loading
dataset = PubmedCocitation(
    remove_isolated_nodes=False,
    load_text_features="lazy"  # Cached but not loaded
)

data = dataset[0]
print(f"Nodes: {data.num_nodes}")  # 19,717

# Load text on-demand when needed
titles = data.load_title()
abstracts = data.load_abstract()
print(f"Loaded {len(titles)} titles")
print(f"Loaded {len(abstracts)} abstracts")

# Load embeddings on-demand
title_embeddings = data.load_title_embeddings()
abstract_embeddings = data.load_abstract_embeddings()
print(title_embeddings.shape)  # [19717, 768]
print(abstract_embeddings.shape)  # [19717, 768]
```

### Custom Encoder Model

Use different sentence-transformer models for different embedding dimensions:

```python
# Use smaller model (384-dim instead of 768-dim)
dataset = CoraCocitation(
    remove_isolated_nodes=False,
    load_text_features=True,
    encoder_model="all-MiniLM-L6-v2"  # 384-dim embeddings
)

data = dataset[0]
print(data.title_embeddings.shape)  # [2708, 384]
print(data.abstract_embeddings.shape)  # [2708, 384]
```

### Supported Datasets

| Dataset | Text Support | Nodes (Full) | Alignment | Notes |
|---------|--------------|--------------|-----------|-------|
| **CoraCocitation** | ✅ Yes | 2,708 | Perfect | Recommended: `remove_isolated_nodes=False` |
| **PubmedCocitation** | ✅ Yes | 19,717 | Validated | Recommended: `load_text_features="lazy"` |
| **CiteseerCocitation** | ❌ No | — | Incompatible | Node count mismatch (DHG: 3,312, TAG: 3,186) |

**Note**: Text features require the full dataset (`remove_isolated_nodes=False`) for perfect alignment with TAG dataset. When isolated nodes are removed, text features are automatically filtered to match the connected nodes.

### Caching Behavior

Text features are cached locally to avoid repeated downloads and encoding:

```
~/.pyg-hyper/data/cocitation-cora/
└── multimodal/
    ├── tag_raw_texts.pt              # Cached from HuggingFace
    └── text_embeddings/
        ├── all_mpnet_base_v2_title.pt      # Title embeddings (768-dim)
        ├── all_mpnet_base_v2_abstract.pt   # Abstract embeddings (768-dim)
        ├── all_MiniLM_L6_v2_title.pt       # Custom encoder (384-dim)
        └── all_MiniLM_L6_v2_abstract.pt    # Custom encoder (384-dim)
```

First load downloads and encodes (30-60s for Cora, 5-10min for Pubmed). Subsequent loads are instant. Title and abstract are encoded and cached separately for maximum flexibility.

### Force Re-download/Re-encode

```python
dataset = CoraCocitation(
    load_text_features=True,
    force_reload=True  # Re-download TAG data and re-encode
)
```

## 📚 Available Datasets

### Citation Networks (Co-citation)
Hyperedges connect papers that are cited together. **Source: DHG-Bench**

| Dataset | Nodes (default) | Nodes (full) | Hyperedges | Features | Classes | Description |
|---------|----------------|--------------|------------|----------|---------|-------------|
| CoraCocitation | 1,434 | 2,708 | 1,579 | 1,433 | 7 | Machine learning papers |
| CiteseerCocitation | 1,458 | 3,312 | 1,079 | 3,703 | 6 | Computer science papers |
| PubmedCocitation | 3,840 | 19,717 | 7,963 | 500 | 3 | Biomedical papers |

**Note**: Default mode (`remove_isolated_nodes=True`) shows connected nodes only (TriCL-compatible). Full mode (`remove_isolated_nodes=False`) preserves all nodes including isolated ones for multimodal feature collection.

### Authorship Networks (Coauthorship)
Hyperedges connect papers sharing common authors. **Source: DHG-Bench**

| Dataset | Nodes (default) | Nodes (full) | Hyperedges | Features | Classes |
|---------|----------------|--------------|------------|----------|---------|
| CoraCoauthorship | 2,388 | 2,708 | 1,072 | 1,433 | 7 |
| DBLPCoauthorship | 41,302 | 41,302* | 22,363 | 1,425 | 6 |

*DBLP source data has isolated nodes already removed.

### 3D Shape Recognition
Hyperedges capture spatial relationships between shape vertices. **Source: TriCL**

| Dataset | Nodes | Hyperedges | Features | Classes |
|---------|-------|------------|----------|---------|
| NTU2012 | 2,012 | 2,012 | 100 | 67 |
| ModelNet40 | 12,311 | 12,311 | 100 | 40 |

### UCI Datasets
Standard machine learning benchmarks as hypergraphs. **Source: TriCL**

| Dataset | Nodes | Hyperedges | Features | Classes |
|---------|-------|------------|----------|---------|
| Mushroom | 8,124 | 298 | 22 | 2 |
| Zoo | 101 | 43 | 16 | 7 |

### Text Classification
Document networks with word co-occurrence hyperedges. **Source: TriCL**

| Dataset | Nodes | Hyperedges | Features | Classes |
|---------|-------|------------|----------|---------|
| News20 | 16,242 | 100 | 100 | 4 |

## 🏗️ Architecture

### Data Structures

**HyperData** - Compact hyperedge_index format:
```python
HyperData(
    x=[N, F],                    # Node features
    hyperedge_index=[2, E],      # COO format: [node_ids, edge_ids]
    y=[N],                       # Node labels
    num_nodes=N,
    num_hyperedges=M
)
```

**HyperHeteroData** - Universal Bipartite Object (UBO):
```python
HyperHeteroData(
    node={ x=[N, F], y=[N] },
    hyperedge={ num_nodes=M },
    (node, member_of, hyperedge)={ edge_index=[2, E] },
    (hyperedge, has_member, node)={ edge_index=[2, E] }
)
```

### Splitting Strategies

| Strategy | Use Case | Leakage Prevention | Output |
|----------|----------|-------------------|--------|
| `random_hypergraph_split` | Transductive node classification | Masks only | Masks |
| `stratified_hypergraph_split` | Imbalanced classes | Stratified masks | Masks |
| `inductive_hypergraph_split` | Inductive learning | Complete separation | 3 subgraphs |
| `hyperedge_prediction_split` | Link prediction | Hidden test edges | Positive + Negative samples |

### Sampling Strategies

**Problem**: Naive sampling causes "Monster Hyperedge" explosion
- One popular hyperedge → connects 10,000 nodes
- Standard NeighborLoader → loads entire hyperedge
- Memory overflow or extreme slowdown

**Solution**: HypergraphNeighborLoader
```python
# Separate control for explosion prevention
HypergraphNeighborLoader(
    data,
    num_neighbors_nodes=10,   # V→E: How many hyperedges to sample
    num_neighbors_edges=20,   # E→V: How many nodes per hyperedge (CRITICAL!)
    ...
)
```

This ensures bounded computation: O(num_neighbors_nodes × num_neighbors_edges) per hop.

## 🔬 Research Background

This library implements techniques from cutting-edge hypergraph learning research:

### Structural Leak Prevention
**Problem**: Test set information leaking through graph structure during training.

**Solutions Implemented**:
- **Inductive Split**: Complete graph separation (GraphSAINT, DataSAIL)
- **Conservative Assignment**: Test nodes → test hyperedges only
- **Normalization Masking**: Degree matrices computed on train set only

**References**:
- Zeng et al. "GraphSAINT: Graph Sampling Based Inductive Learning" (ICLR 2020)
- Rupprecht et al. "DataSAIL: Multi-faceted strategies for information leakage prevention" (2024)

### Hierarchical Sampling
**Problem**: Power-law hyperedge size distribution causes neighborhood explosion.

**Solutions Implemented**:
- **Layer-wise Sampling**: Fixed budget per layer (FastHGNN)
- **Bipartite Sampling**: Independent V→E and E→V control
- **Cardinality Capping**: Maximum nodes per hyperedge

**References**:
- Dong et al. "FastHGNN: Fast Hypergraph Neural Networks" (2020)
- Hamilton et al. "Inductive Representation Learning on Large Graphs" (NeurIPS 2017)

### Universal Bipartite Object (UBO)
**Problem**: Standard hyperedge_index format lacks first-class hyperedge support.

**Solutions Implemented**:
- **HyperHeteroData**: Hyperedges as explicit nodes
- **Dual Edge Types**: Bidirectional message passing
- **Multimodal Attributes**: Separate features per entity type

**References**:
- Chien et al. "You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks" (ICLR 2022)
- Research Report 2: "PyG limitations and UBO architecture"

### Dataset Integration: DHG-Bench
**Problem**: Original node IDs needed for multimodal feature collection (e.g., collecting paper abstracts from APIs).

**Solutions Implemented**:
- **DHG-Bench Integration**: Full datasets with isolated nodes preserved
- **Flexible Node Handling**: `remove_isolated_nodes` parameter for backward compatibility
- **Node ID Mapping**: Preserved mapping for multimodal feature linkage
- **Dual Mode Support**: TriCL-compatible (connected only) or full DHG-Bench (all nodes)

**References**:
- DHG-Bench: "A Comprehensive Benchmark for Deep Hypergraph Learning" (ICLR 2026)
- GitHub: https://github.com/Coco-Hut/DHG-Bench

## 🧪 Development

### Setup Development Environment
```bash
# Clone repository
git clone https://github.com/nishide-dev/pyg-hyper-data.git
cd pyg-hyper-data

# Install with all development dependencies and optional features
uv sync --all-extras --group dev
```

### Run Tests
```bash
# Run all tests
uv run pytest tests/ -v

# Run with coverage
uv run pytest tests/ --cov=pyg_hyper_data --cov-report=html

# Run specific test file
uv run pytest tests/test_datasets/test_cocitation.py -v
```

### Code Quality

**Automatic Quality Checks (Pre-commit)**

This project uses pre-commit hooks to automatically run code quality checks before each commit:

```bash
# Install pre-commit hooks (one-time setup)
uv run pre-commit install

# Run manually on all files
uv run pre-commit run --all-files

# Run on staged files only (happens automatically on commit)
git commit -m "your message"
```

The pre-commit hooks will automatically:
- ✨ Format code with `ruff format`
- 🔍 Lint code with `ruff check --fix`
- 🔒 Type check with `ty check`

**Manual Quality Checks**

```bash
# Format code
uv run ruff format .

# Lint code
uv run ruff check --fix

# Type check
uv run ty check

# Run all checks at once
uv run ruff format . && uv run ruff check && uv run ty check
```

### Project Structure
```
pyg-hyper-data/
├── src/pyg_hyper_data/
│   ├── data/                 # Data structures
│   │   ├── hyper_data.py
│   │   ├── hyper_hetero_data.py
│   │   └── conversion.py
│   ├── datasets/             # Dataset implementations
│   │   ├── base.py
│   │   ├── cocitation.py
│   │   ├── coauthorship.py
│   │   ├── shape.py
│   │   ├── uci.py
│   │   └── text.py
│   ├── transforms/           # Data transforms
│   │   ├── split.py
│   │   └── attributed.py
│   ├── loader/               # Sampling strategies
│   │   └── neighbor.py
│   └── utils/                # Utilities
│       ├── stats.py
│       ├── io.py
│       └── download.py
├── tests/                    # Test suite (140 tests)
├── docs/                     # Documentation
│   └── DESIGN.md
└── examples/                 # Usage examples
```

## 📖 Documentation

- **[Design Document](docs/DESIGN.md)**: Architecture decisions and roadmap
- **[Research Reports](_tmp/deepresearch/)**: Detailed technical analysis

## 🤝 Contributing

We welcome contributions! Please see our contributing guidelines:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes
4. Run tests and quality checks (`uv run pytest && uv run ruff check`)
5. Commit your changes (`git commit -m 'Add amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request

### Code Standards
- **Type hints**: All functions must have type annotations
- **Docstrings**: Google-style docstrings for all public APIs
- **Tests**: Maintain >90% code coverage
- **Formatting**: Use `ruff format`
- **Linting**: Pass `ruff check --fix`

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

---

**Built with ❤️ for the hypergraph learning community**
