# GoldenMatch

> Zero-config entity resolution — deduplicate and match records with fuzzy, exact, probabilistic (Fellegi-Sunter), and LLM scoring. Scales from a laptop CSV to 100M+ rows on Ray; the zero-tuning probabilistic path beats hand-rolled, expert-tuned Splink head-to-head.

## Interfaces
- MCP Server: `goldenmatch mcp-serve` (16 agent tools + 24 data tools + 7 memory tools + 7 identity tools = 54 total)
- Remote MCP: https://goldenmatch-mcp-production.up.railway.app/mcp/ (54 tools, Smithery: https://smithery.ai/servers/benzsevern/goldenmatch)
- A2A Server: `goldenmatch agent-serve --port 8200` (31 skills advertised in the agent card)
- CLI: `goldenmatch dedupe`, `goldenmatch autoconfig`, `goldenmatch match`, `goldenmatch memory ...`, `goldenmatch identity ...`, + more
- Python API: `import goldenmatch` -- `dedupe_df()`, `match_df()`, `score_strings()`, `evaluate()`, `AgentSession.autoconfigure()`, `add_correction()`, `learn()`, `memory_stats()`, `get_memory()`, ~106 exports
- TypeScript / Edge: `npm install goldenmatch` -- same API in browsers, Cloudflare Workers, Vercel Edge, Deno; optional WASM via `await enableWasm()` swaps in the Rust score-core kernel (pure-TS stays the default + byte-identical fallback)
- REST API: `goldenmatch serve` on port 8000 (incl. `POST /autoconfig`, `GET /controller/telemetry`)
- SQL: Postgres extension + DuckDB UDFs at `packages/rust/extensions/` (`goldenmatch_autoconfig`, `goldenmatch_dedupe_full`, `gm_telemetry`)

## AutoConfigController telemetry (v1.7-v1.12, cross-surface)

Every interface above returns the same JSON shape from `goldenmatch.web.controller_telemetry.serialize_telemetry`: `{stop_reason, health, scoring, blocking, cluster, column_priors, decisions, committed_matchkeys, negative_evidence}`. Write one parser, reuse across web / TUI / CLI / SQL / MCP / A2A / REST.

## Install
- `pip install goldenmatch` (native acceleration ships by default on common platforms)
- TypeScript: `npm install goldenmatch`
- Quality scanning: `pip install goldenmatch[quality]`
- Data transforms: `pip install goldenmatch[transform]`
- Embeddings: `pip install goldenmatch[embeddings]`
- Distributed (50M+): `pip install goldenmatch[ray]`

## Accuracy
- DBLP-ACM: 96.4% F1 out of the box (zero-config weighted controller path)
- Beats hand-rolled, expert-tuned Splink head-to-head: the zero-tuning probabilistic (Fellegi-Sunter) auto-config wins on every dataset Splink scores under one shared evaluator -- historical_50k F1 0.778 vs 0.757 (cluster B³ 0.844 vs 0.789), febrl3 0.991 vs 0.965, synthetic_person 0.998 vs 0.996. Bake-off: `docs/benchmarks/2026-06-09-splink-bakeoff.md`
- DQbench composite: 91.04
- PPRL: 92.4% F1 on FEBRL4

## Quick Examples

### Deduplicate a CSV (zero-config)
```python
import goldenmatch as gm
result = gm.dedupe("customers.csv")
result.golden.write_csv("deduped.csv")
print(f"{result.total_clusters} clusters, {result.match_rate:.1%} match rate")
```

### Deduplicate with explicit config
```python
result = gm.dedupe("customers.csv",
    exact=["email"],
    fuzzy={"name": 0.85, "address": 0.80},
    blocking=["zip"],
)
```

### Match across two files
```python
result = gm.match("file_a.csv", "file_b.csv", fuzzy={"name": 0.85})
```

### Privacy-preserving linkage (no raw data shared)
```python
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv",
    fields=["first_name", "last_name", "dob", "zip"])
```

### Evaluate accuracy
```python
metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}")
```

## Config Template (YAML)

```yaml
matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.5
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.3
      - field: zip
        scorer: exact
        weight: 0.2

blocking:
  strategy: adaptive
  keys:
    - fields: [zip]

golden_rules:
  default_strategy: most_complete
```

## Key Types

- `DedupeResult` — `.golden` (DataFrame), `.dupes`, `.unique`, `.clusters` (dict), `.scored_pairs` (list), `.stats`, `.total_clusters`, `.match_rate`
- `MatchResult` — same shape as DedupeResult for cross-file matching
- `GoldenMatchConfig` — Pydantic model, loadable from YAML via `gm.load_config("config.yaml")`

## Performance & Scale
- Backend tiers: Polars in-memory (<500K), DuckDB out-of-core (500K-50M), Ray distributed (>=50M)
- Verified at 100M: full dedupe in 9.2 min on a 5-node Ray cluster (80 CPU), 20,000,000 clusters recovered exactly, driver peak 0.36 GB RSS -- recall-complete (correct across any partitioning) and driver-collect-free end to end. A faster per-partition path (~213 s on a 4-worker run) is available via `GOLDENMATCH_DISTRIBUTED_BLOCK_SHUFFLE=0` for inputs where duplicates co-locate within partitions.
- 1M exact dedupe: ~7.8s. 100K fuzzy: ~12.8s
- LLM scorer: ~$0.04 per dataset (budget-capped, opt-in)

## Scorers
exact, jaro_winkler, levenshtein, token_sort, ensemble, dice, jaccard, soundex_match, embedding, record_embedding, name_freq_weighted_jw, given_name_aliased_jw

## Transforms
lowercase, uppercase, strip, soundex, metaphone, digits_only, alpha_only, normalize_whitespace, token_sort, first_token, last_token, substring:start:end, legal_form_strip, address_normalize, naics_normalize

## Bundled Reference Data
Five OSS packs ship with the wheel; auto-config swaps the matching scorer/transform in when the column name pattern matches AND the profiled `col_type` agrees:
- Surnames (US Census 2010, top 10K) → `name_freq_weighted_jw` on last_name/surname columns. Lifts F1 0.667→0.915 on the common-name FP fixture.
- Given-name aliases (~140 pairs: William↔Bill, Katherine↔Kate, ...) → `given_name_aliased_jw` on first_name/given_name columns.
- Business legal forms (Inc, LLC, Ltd, GmbH, S.A., ...) → prepends `legal_form_strip` on company/business/org/firm/legal_name columns.
- USPS Pub. 28 addresses → prepends `address_normalize` on address/street/addr_line/mailing_address columns. Handles `#5`→`apt 5`, `P.O. Box`→`PO Box`.
- NAICS 2022 industries (2,125 codes, all 5 hierarchy levels) → prepends `naics_normalize` on naics/sic/industry_code/business_type columns.

The `col_type` gate (PR #224) skips the refinement when column-name regex matches but profiled shape disagrees — a `last_name` column holding numeric IDs keeps its caller-specified scorer. See `docs/reference-data`.

## Learning Memory (v1.6.0)
Persistent corrections + threshold learning. Off by default; enable with `memory.enabled = true`.
- Store: SQLite (default) or Postgres. Path: `.goldenmatch/memory.db`.
- Collection points: review queue, boost tab, unmerge_record/cluster, LLM scorer, MCP `agent_approve_reject`, REST `/reviews/decide`, Python `add_correction()`.
- Re-anchors via `record_hash`; ambiguous rehydrations report `stale_ambiguous`. Postflight reports `Memory: N applied, M stale, K stale-ambiguous, J unanchorable`.
- CLI: `goldenmatch memory stats|learn|export|import|show`.
- Python: `goldenmatch.add_correction(...)`, `learn()`, `memory_stats()`, `get_memory()`. Result objects expose `result.memory_stats`.
- MCP tools: `list_corrections`, `add_correction`, `learn_thresholds`, `memory_stats`, `memory_export`.
- Learner runs at `learning.threshold_min_corrections` (default 10) per matchkey via trust-weighted grid search.

## Golden Suite
GoldenMatch is the headline package of a 6-package suite that composes into one pipeline: GoldenCheck (profile + validate) → GoldenFlow (standardize) → GoldenMatch (dedupe) → GoldenAnalysis (cross-cutting reporting), orchestrated by GoldenPipe, with InferMap for schema mapping. All ship on both PyPI and npm; the TypeScript ports are edge-safe with an optional WASM backend.

## Docs
- [Learning Memory](https://benseverndev-oss.github.io/goldenmatch/learning-memory)
- [Full docs](https://benseverndev-oss.github.io/goldenmatch/): 23 guides
- [Full API reference](https://benseverndev-oss.github.io/goldenmatch/python-api): 101 exports
- [PyPI](https://pypi.org/project/goldenmatch/)
- [GitHub](https://github.com/benseverndev-oss/goldenmatch)
