Metadata-Version: 2.4
Name: blazematch
Version: 0.1.1
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: License :: OSI Approved :: MIT License
Requires-Dist: pandas>=1.5
Requires-Dist: pyarrow>=12.0
Requires-Dist: duckdb>=0.9
Requires-Dist: scikit-learn>=1.2
Requires-Dist: numpy>=1.23
Requires-Dist: matplotlib>=3.5
Requires-Dist: scipy>=1.9
Requires-Dist: faiss-cpu>=1.7 ; extra == 'ann'
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: pytest-cov ; extra == 'dev'
Requires-Dist: maturin>=1.7 ; extra == 'dev'
Requires-Dist: ruff ; extra == 'dev'
Requires-Dist: lightgbm>=3.3 ; extra == 'lightgbm'
Requires-Dist: altair>=5.0 ; extra == 'viz'
Requires-Dist: xgboost>=1.7 ; extra == 'xgboost'
Provides-Extra: ann
Provides-Extra: dev
Provides-Extra: lightgbm
Provides-Extra: viz
Provides-Extra: xgboost
License-File: LICENSE
Summary: Rust-accelerated record linkage at scale
Author-email: Joseph <59439026+JosephKBS@users.noreply.github.com>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/JosephKBS/blazematch
Project-URL: Repository, https://github.com/JosephKBS/blazematch

# Blazematch

**Rust-accelerated record linkage and deduplication for Python.**

[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
[![Rust](https://img.shields.io/badge/rust-%23000000.svg?logo=rust&logoColor=white)](https://www.rust-lang.org/)
[![Tests](https://img.shields.io/badge/tests-153%20Python%20%2B%2022%20Rust-brightgreen.svg)](#development)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

Blazematch pairs a high-level Python API with a Rust core to make entity matching fast without sacrificing usability. Define blocking rules, pick similarity metrics, and fit a model — all in a few lines of pandas-friendly code. The heavy lifting (similarity computation, feature matrix construction, model inference) runs in Rust with Rayon parallelism and GIL release, so Python never becomes the bottleneck.

---

## Highlights

- **11 similarity metrics** computed in parallel via Rayon
- **SQL blocking** powered by DuckDB — write rules like `"l.city = r.city"`
- **ANN blocking** with TF-IDF + FAISS for fuzzy candidate generation
- **10 models** — 6 supervised + 4 unsupervised (no labels needed)
- **Deduplication mode** with connected-component clustering
- **8 visualizations** with Altair (interactive) and matplotlib backends
- **Streaming support** for datasets larger than RAM
- **RAM estimation** before you run the pipeline

## Installation

```bash
pip install blazematch
```

Build from source (requires [Rust](https://rustup.rs/) and [maturin](https://github.com/PyO3/maturin)):

```bash
git clone https://github.com/yourusername/blazematch.git
cd blazematch
pip install maturin
maturin develop --release
```

### Optional extras

```bash
pip install blazematch[ann]       # FAISS-based embedding blocking
pip install blazematch[viz]       # Altair interactive charts
pip install blazematch[xgboost]   # XGBoost model
pip install blazematch[lightgbm]  # LightGBM model
```

---

## Quick Start

### Record Linkage

```python
import pandas as pd
from blazematch import Linker, LinkConfig, BlockingRule, FieldComparison, Metric

df_left = pd.DataFrame({
    "name": ["Alice Smith", "Bob Jones", "Carol White"],
    "city": ["New York", "Boston", "New York"],
    "dob":  ["1990-01-15", "1985-06-20", "1992-03-10"],
})

df_right = pd.DataFrame({
    "name": ["A. Smith", "Robert Jones", "Carol Whyte"],
    "city": ["New York", "Boston", "New York"],
    "dob":  ["1990-01-15", "1985-06-20", "1992-03-10"],
})

config = LinkConfig(
    blocking_rules=[BlockingRule("l.city = r.city")],
    comparisons=[
        FieldComparison("name", "name", Metric.JARO_WINKLER),
        FieldComparison("dob",  "dob",  Metric.DATE_DISTANCE, date_format="%Y-%m-%d"),
    ],
)

linker = Linker(df_left, df_right, config)
linker.block().compute_features().estimate()
results = linker.predict(threshold=0.8)
print(results)
#    left_idx  right_idx     score
# 0         0          0  0.923456
# 1         2          2  0.891234
```

### Deduplication

```python
from blazematch import Deduplicator

dedup = Deduplicator(df, config)
dedup.block().compute_features().estimate()
clusters = dedup.cluster(threshold=0.8)
# clusters: DataFrame with [record_idx, cluster_id]
```

### Supervised Training

When you have labelled pairs, swap `.estimate()` for `.fit()`:

```python
labels = pd.DataFrame({
    "left_idx":  [0, 1, 2],
    "right_idx": [0, 1, 2],
    "label":     [1, 0, 1],
})

linker.block().compute_features().fit(labels, model="random_forest")
results = linker.predict(threshold=0.5)
```

Available models: `"logistic"`, `"random_forest"`, `"gradient_boosting"`, `"svm"`, `"xgboost"`, `"lightgbm"`.

### Preprocessing

Apply text cleaning before comparison:

```python
FieldComparison(
    "name", "name", Metric.JARO_WINKLER,
    preprocess=["lowercase", "strip_whitespace", "remove_punctuation"],
)
```

Available steps: `lowercase`, `strip_whitespace`, `remove_punctuation`, `normalize_unicode`.

### Embedding Blocking (ANN)

For fuzzy blocking on free-text fields without exact-match rules:

```python
from blazematch import EmbeddingBlockingConfig

config = LinkConfig(
    blocking_rules=[BlockingRule("l.city = r.city")],
    comparisons=[...],
    embedding_blocking=EmbeddingBlockingConfig(
        fields=["name"],
        top_k=10,
        min_sim=0.3,
    ),
)
```

Requires `pip install blazematch[ann]`.

### Save and Load Models

```python
linker.model.save("my_model.pkl")

# Later...
from blazematch import MatchModel
model = MatchModel.load("my_model.pkl")
```

### RAM Estimation

Estimate memory requirements before running the pipeline:

```python
linker.block()
estimate = linker.estimate_ram(model="random_forest")
print(estimate.summary())
# Peak memory: 1.2 GB | System RAM: 16.0 GB | OK to proceed
```

### Visualization

```python
from blazematch import plot_score_distribution, plot_waterfall, plot_roc

plot_score_distribution(results, threshold=0.8)
plot_waterfall(linker, pair_idx=0)
plot_roc(results, labels)
```

All plot functions accept `backend="altair"` (default) or `backend="matplotlib"`.

| Function | Purpose |
|----------|---------|
| `plot_score_distribution` | Histogram of match scores with optional threshold line |
| `plot_waterfall` | Per-feature contribution waterfall for a single pair |
| `plot_comparison_heatmap` | Mean feature values for matches vs. non-matches |
| `plot_precision_recall` | Precision-recall curve with average precision |
| `plot_roc` | ROC curve with AUC |
| `plot_threshold_analysis` | Match count across threshold values |
| `plot_match_weights` | Per-feature model weights or importances |
| `plot_comparison_viewer` | Sampled pair comparisons with field-level detail |

---

## Similarity Metrics

| Metric | Type | Description |
|--------|------|-------------|
| `JARO_WINKLER` | String | Edit-distance variant favoring common prefixes |
| `LEVENSHTEIN` | String | Normalized edit distance (0-1) |
| `DAMERAU_LEVENSHTEIN` | String | Levenshtein + transpositions |
| `EXACT` | String | Binary 0/1 equality |
| `SOUNDEX` | String | Phonetic encoding match |
| `JACCARD` | String | Token-set intersection over union |
| `COSINE` | String | Token-level cosine similarity |
| `TOKEN_SORT_RATIO` | String | Order-invariant fuzzy match (sorted tokens + Levenshtein) |
| `NUMERIC` | Numeric | Absolute distance |
| `NUMERIC_SIMILARITY` | Numeric | Scaled proximity (0-1) |
| `DATE_DISTANCE` | Numeric | Absolute days between dates |

All string metrics are computed in parallel via Rayon. Numeric metrics operate on `f64` slices passed zero-copy from numpy.

---

## Models

### Supervised (require labelled pairs)

| Model | Key | Training | Inference |
|-------|-----|----------|-----------|
| Logistic Regression | `"logistic"` | scikit-learn | Rust (ndarray GEMV + sigmoid) |
| Random Forest | `"random_forest"` | scikit-learn | Rust (parallel tree traversal) |
| Gradient Boosting | `"gradient_boosting"` | scikit-learn | Rust (parallel tree traversal + sigmoid) |
| SVM (RBF) | `"svm"` | scikit-learn | Rust (parallel RBF kernel + Platt scaling) |
| XGBoost | `"xgboost"` | xgboost | Native C++ |
| LightGBM | `"lightgbm"` | lightgbm | Native C++ |

### Unsupervised (no labels needed)

| Model | Key | Algorithm |
|-------|-----|-----------|
| Fellegi-Sunter | `"fellegi_sunter"` | EM-based probabilistic matching with m/u weights |
| K-Means | `"kmeans"` | k=2 clustering with optional silhouette auto-tuning |
| GMM | `"gmm"` | Gaussian mixture with BIC-based component selection |
| DBSCAN | `"dbscan"` | Density-based clustering with auto eps estimation |

---

## Benchmarks

Pipeline timing at various scales (macOS Apple Silicon, single postcode blocking rule, 3 comparisons):

| Records per side | Candidate pairs | Blocking | Features (Rust) | Inference | Full pipeline |
|------------------|-----------------|----------|-----------------|-----------|---------------|
| 1,000 | ~5K | 15 ms | 3 ms | <1 ms | 40 ms |
| 5,000 | ~25K | 30 ms | 15 ms | <1 ms | 120 ms |
| 10,000 | ~50K | 50 ms | 35 ms | <1 ms | 250 ms |
| 50,000 | ~250K | 149 ms | 162 ms | 3 ms | 820 ms |

```bash
python benchmarks/bench_pipeline.py
python benchmarks/bench_pipeline.py --scales 10000 100000
```

---

## Architecture

```
Python API                          Rust Core (PyO3 + Rayon)
-----------------------------------+--------------------------------------
Linker / Deduplicator               similarity.rs    11 parallel metrics
  |-- RuleBlocker (DuckDB SQL)      features.rs      single-call feature matrix
  |-- EmbeddingBlocker (FAISS)      inference.rs     logistic regression (GEMV)
  |-- FeatureComputer ------------> tree_inference.rs RF / GBDT (flat numpy arrays)
  |-- MatchModel / RF / GBDT -----> svm_inference.rs  SVM RBF + Platt scaling
  |-- SVMModel ----------------->
  |-- XGBoostModel (native C++)     All batch functions release the GIL
  |-- LightGBMModel (native C++)    and use Rayon for CPU parallelism.
  |-- FellegiSunterModel            Numeric data passed as zero-copy numpy.
  |-- KMeans / GMM / DBSCAN         Tree models use flat numpy arrays with
  +-- Visualize (Altair / mpl)      offsets to avoid per-call serialization.
```

**Design principles:**

- **Single Rust call** for all feature computation — one Python-to-Rust crossing per batch, not per-metric
- **Zero-copy numeric data** — numpy arrays passed directly to Rust via PyO3, no `.tolist()` conversion
- **Flat tree serialization** — sklearn tree structures are flattened into contiguous numpy arrays with offset indices, eliminating repeated list-to-Vec conversion on every predict call
- **GIL-free parallelism** — every Rust batch function calls `py.allow_threads()` before Rayon work
- **Streaming support** — `block_iter()`, `compute_chunked()`, and `predict_iter()` for datasets that don't fit in RAM

---

## API Reference

### Core

| Class | Description |
|-------|-------------|
| `Linker(df_left, df_right, config)` | Main record linkage pipeline |
| `Deduplicator(df, config)` | Self-join deduplication with clustering |
| `LinkConfig(...)` | Pipeline configuration (blocking rules, comparisons, options) |
| `FieldComparison(left, right, metric, ...)` | A single field comparison definition |
| `BlockingRule(sql)` | SQL join condition for candidate generation |
| `Metric` | Enum of available similarity metrics |

### Blocking

| Class | Description |
|-------|-------------|
| `RuleBlocker` | DuckDB-powered SQL blocking |
| `EmbeddingBlocker` | TF-IDF + FAISS approximate nearest neighbor blocking |
| `EmbeddingBlockingConfig(fields, top_k, min_sim, ...)` | ANN blocking configuration |

### Models

| Class | Description |
|-------|-------------|
| `MatchModel` | Logistic regression (Rust inference) |
| `RandomForestModel` | Random forest (Rust inference) |
| `GradientBoostingModel` | Gradient boosting (Rust inference) |
| `SVMModel` | SVM with RBF kernel (Rust inference) |
| `XGBoostModel` | XGBoost wrapper (native C++ inference) |
| `LightGBMModel` | LightGBM wrapper (native C++ inference) |
| `FellegiSunterModel` | Unsupervised EM probabilistic matching |
| `KMeansModel` | K-Means clustering with optional auto-tuning |
| `GMMModel` | Gaussian mixture with BIC auto-tuning |
| `DBSCANModel` | DBSCAN with auto eps estimation |

### Utilities

| Function / Class | Description |
|------------------|-------------|
| `estimate_ram(...)` | Estimate memory requirements for a pipeline configuration |
| `RAMEstimate` | Result object with `summary()`, `can_proceed`, `peak_bytes` |
| `apply_preprocess(series, steps)` | Apply preprocessing chain to a pandas Series |

---

## Development

```bash
# Setup
git clone https://github.com/yourusername/blazematch.git
cd blazematch
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
maturin develop --release

# Tests
pytest tests/ -v             # 153 Python tests
cargo test                   # 22 Rust unit tests

# Benchmarks
cargo bench                  # Rust micro-benchmarks (Criterion)
python benchmarks/bench_pipeline.py

# Lint
ruff check python/
```

### Project Structure

```
blazematch/
  src/                       # Rust source (PyO3 extension module)
    lib.rs                   # Module registration
    similarity.rs            # 11 parallel similarity metrics
    features.rs              # Single-call feature matrix computation
    inference.rs             # Logistic regression batch prediction
    tree_inference.rs        # Random forest / GBDT batch prediction
    svm_inference.rs         # SVM RBF kernel batch prediction
    utils.rs                 # Shared utilities (sigmoid)
  python/blazematch/         # Python source
    linker.py                # Main pipeline orchestrator
    dedup.py                 # Deduplication wrapper
    blocking.py              # Rule-based and ANN blocking
    features.py              # Feature computation bridge to Rust
    model.py                 # Supervised model classes
    fellegi_sunter.py        # Unsupervised EM model
    clustering.py            # K-Means, GMM, DBSCAN models
    config.py                # Configuration dataclasses
    preprocess.py            # Text preprocessing pipeline
    visualize.py             # 8 chart functions
    estimate_ram.py          # Memory estimation
  tests/                     # pytest suite
  benches/                   # Criterion benchmarks
```

## Requirements

- Python >= 3.9
- Rust toolchain (for building from source)
- Core: pandas, numpy, pyarrow, duckdb, scikit-learn, scipy, matplotlib

## License

See [LICENSE](LICENSE) for details.

