Metadata-Version: 2.2
Name: submine
Version: 0.1.3
Summary: Modular subgraph mining library with unified API
Keywords: graph-mining,subgraph-mining,gspan,frequent-subgraph-mining
Author: Ridwan Amure
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Project-URL: Homepage, https://github.com/instabaines/submine
Project-URL: Repository, https://github.com/instabaines/submine
Project-URL: Issues, https://github.com/instabaines/submine/issues
Requires-Python: >=3.9
Requires-Dist: networkx>=2.8
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov>=4; extra == "dev"
Requires-Dist: build>=1; extra == "dev"
Requires-Dist: twine>=5; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Description-Content-Type: text/markdown

# submine

**submine** is a research‑grade Python library for frequent subgraph mining that provides a unified, safe, and extensible interface over heterogeneous mining algorithms implemented in Python, C++, and Java.

The goal of _submine_ is to let users focus on **what** to mine rather than **how** each algorithm expects its input. Users select an algorithm and parameters; _submine_ automatically validates inputs, converts graph formats, and executes the backend in a controlled and reproducible manner.

---

## Key Features

- **Algorithm‑centric API**
  You specify the mining algorithm and parameters; _submine_ handles format adaptation and execution.

- **Direct format transcoding (no redundant rewrites)**
  Input graphs are converted directly into the native format required by the selected algorithm.

- **Multi‑format graph support**
  Edge lists, gSpan datasets, single‑graph `.lg` files, and GEXF are supported out of the box.

- **Safe and reproducible execution**
  Parameter validation, deterministic format detection, and hardened subprocess execution are enforced by default.

- **Extensible design**
  New algorithms can be added via a clean backend interface without modifying core logic.

---

## Supported Algorithms

### gSpan (Frequent Subgraph Mining)

- **Graph type:** Multiple graphs (transactional dataset)
- **Typical use case:** Discovering frequent substructures across many graphs
- **Backend:** C++

The gSpan backend in _submine_ is a C++ implementation adapted and extended from the widely used **gBoost / gSpan reference implementations**, with additional input validation, format handling, and Python bindings for safe integration.

### SoPaGraMi (Single‑Graph Pattern Mining)

- **Graph type:** Single large graph
- **Typical use case:** Social, biological, or information networks
- **Backend:** C++

SoPaGraMi is used for scalable subgraph mining on a single graph, where frequency is defined structurally rather than transactionally.

---

## Supported Input Formats

_submine_ automatically detects the input format and converts it to the format required by the chosen algorithm:

- **Edge lists**: `.txt`, `.edgelist`
- **gSpan datasets**: `.data`, `.data.x`, `.data.N`
- **SoPaGraMi graphs**: `.lg`
- **GEXF**: `.gexf`

Format detection is deterministic and does not rely on user‑supplied flags.

---

## Installation

### Standard installation

```bash
pip install submine
```

### Development installation

```bash
pip install -e ".[dev]"
```

---

## Basic Usage

### gSpan example

```python
from submine.api import mine_subgraphs

results = mine_subgraphs(
    data="graphs.data",
    algorithm="gspan",
    min_support=5
)
```

**Parameters**

- `data` (str or path): Path to the input graph dataset
- `algorithm` (str): Mining algorithm (`"gspan"`, `"sopagrami"`, …)
- `min_support` (int): Minimum support threshold (algorithm‑specific semantics)

---

### SoPaGraMi example

```python
results = mine_subgraphs(
    data="citeseer.lg",
    algorithm="sopagrami",
    min_support=100,
    sorted_seeds=4,
    dump_images_csv=True,
    dump_sample_embeddings=False,
    out_dir="."
)
```

**SoPaGraMi‑specific parameters**

- `min_support` (int): Minimum frequency threshold
- `sorted_seeds` (int): Seed sorting strategy (implementation‑specific)
- `dump_images_csv` (bool): Whether to dump pattern images as CSV metadata
- `dump_sample_embeddings` (bool): Whether to dump sample embeddings (experimental)
- `out_dir` (str or path): Output directory for results (default: `./sopagrami_result`)

---

## Design Philosophy

- **No algorithm‑specific I/O burden on the user**
  Users never manually convert graph formats.

- **Minimal assumptions about graph structure**
  Directed/undirected and labeled/unlabeled graphs are handled at the backend level.

- **Research‑grade transparency**
  Backends are explicitly documented and citable.

---

## Citation

If you use **gSpan**, please cite:

```bibtex
@inproceedings{yan2002gspan,
  title={gspan: Graph-based substructure pattern mining},
  author={Yan, Xifeng and Han, Jiawei},
  booktitle={Proceedings of the IEEE International Conference on Data Mining},
  pages={721--724},
  year={2002}
}
```

If you use **SoPaGraMi**, please cite:

```bibtex
@article{nguyen2020fast,
  title={Fast and scalable algorithms for mining subgraphs in a single large graph},
  author={Nguyen, Lam BQ and Vo, Bay and Le, Ngoc-Thao and Snasel, Vaclav and Zelinka, Ivan},
  journal={Engineering Applications of Artificial Intelligence},
  volume={90},
  pages={103539},
  year={2020}
}
```

To cite this library:

```bibtex
@misc{amure_submine,
  title  = {submine: A Unified Subgraph Mining Library},
  author = {Amure, Ridwan},
  year   = {2025},
  url    = {https://github.com/instabaines/submine}
}
```
