Metadata-Version: 2.4
Name: hdbscan-rs
Version: 0.3.0
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Dist: numpy>=1.20
Summary: High-performance HDBSCAN clustering, compatible with scikit-learn
Keywords: clustering,hdbscan,machine-learning,density-based
License-Expression: MIT OR Apache-2.0
Requires-Python: >=3.12
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Repository, https://github.com/JasonLovesDoggo/hdbscan-rs

# hdbscan-rs

High-performance HDBSCAN clustering for Python, powered by a Rust core. Drop-in compatible with scikit-learn's API, but significantly faster -- especially on small and large datasets.

## Installation

```sh
pip install hdbscan-rs
```

Requires Python >= 3.12 and NumPy >= 1.20. Pre-built wheels available for Linux, macOS, and Windows.

## Quick start

```python
import numpy as np
from hdbscan_rs import HDBSCAN

data = np.random.randn(10000, 2)

clusterer = HDBSCAN(min_cluster_size=15)
labels = clusterer.fit_predict(data)

print(f"Found {labels.max() + 1} clusters, {(labels == -1).sum()} noise points")
```

## API

```python
HDBSCAN(
    min_cluster_size=5,       # Smallest group that counts as a cluster
    min_samples=None,         # Controls density estimate (default: min_cluster_size)
    metric="euclidean",       # "euclidean", "manhattan", "cosine", "minkowski", "precomputed"
    p=None,                   # Minkowski p parameter
    alpha=1.0,                # Mutual reachability scaling factor
    cluster_selection_epsilon=0.0,  # Merge clusters below this distance
    cluster_selection_method="eom", # "eom" (Excess of Mass) or "leaf"
    allow_single_cluster=False,
)
```

### Methods

- **`fit_predict(X)`** -- Fit and return cluster labels (numpy array, -1 = noise)
- **`fit(X)`** -- Fit the model without returning labels
- **`approximate_predict(X)`** -- Predict labels for new points (returns labels, probabilities)

### Properties (after fitting)

- **`labels_`** -- Cluster labels (-1 = noise)
- **`probabilities_`** -- Membership strength [0, 1]
- **`outlier_scores_`** -- GLOSH outlier scores [0, 1]

## Performance

Best-of-3 wall time on a 4-core AMD EPYC. Data is `make_blobs` with 5 centers, `min_cluster_size=10`. Numbers are from the native Rust core; the Python binding adds <5ms overhead for data conversion.

| Config | sklearn HDBSCAN | hdbscan (C) | hdbscan-rs | vs sklearn | vs C |
|--------|----------------:|------------:|-----------:|-----------:|-----:|
| 1Kx2D | 8.9 ms | 12.7 ms | **2.6 ms** | 3.4x | 4.9x |
| 5Kx2D | 128 ms | 80.2 ms | **10.6 ms** | 12.1x | 7.6x |
| 10Kx2D | 455 ms | 189 ms | **18.4 ms** | 24.7x | 10.3x |
| 50Kx2D | 12,812 ms | 1,024 ms | **124 ms** | 103x | 8.2x |
| 5Kx10D | 241 ms | 136 ms | **62 ms** | 3.9x | 2.2x |
| 1Kx256D | 246 ms | 230 ms | **19 ms** | 12.6x | 11.8x |
| 500x1536D | 424 ms | 444 ms | **28 ms** | 14.9x | 15.7x |

Memory usage is 5-60x lower than Python-based implementations.

## Migrating from sklearn

```python
# Before
from sklearn.cluster import HDBSCAN
clusterer = HDBSCAN(min_cluster_size=15)

# After
from hdbscan_rs import HDBSCAN
clusterer = HDBSCAN(min_cluster_size=15)
```

The API matches sklearn's interface. Input should be a 2D NumPy array of float64. Results are sklearn-compatible (ARI > 0.99 across the test suite).

## Precomputed distances

```python
from hdbscan_rs import HDBSCAN
import numpy as np

# Compute your own distance matrix
dist_matrix = np.array([[0, 1, 5], [1, 0, 3], [5, 3, 0]], dtype=np.float64)

clusterer = HDBSCAN(min_cluster_size=2, metric="precomputed")
labels = clusterer.fit_predict(dist_matrix)
```

## License

Licensed under either of Apache License, Version 2.0 or MIT License, at your option.

