# GoldenMatch

> Entity resolution toolkit — deduplicate records and match across datasets using fuzzy, probabilistic, and LLM-powered scoring.

## Interfaces
- MCP Server: `goldenmatch mcp-serve` (13 agent tools + 17 data tools = 30 total)
- Remote MCP: https://goldenmatch-mcp-production.up.railway.app/mcp/ (30 tools, Smithery: https://smithery.ai/servers/benzsevern/goldenmatch)
- A2A Server: `goldenmatch agent-serve --port 8200` (10 skills)
- CLI: `goldenmatch dedupe`, `goldenmatch match`, + 18 more commands
- Python API: `import goldenmatch` — `dedupe_df()`, `match_df()`, `score_strings()`, `evaluate()`, ~101 exports
- REST API: `goldenmatch serve` on port 8000

## Install
- `pip install goldenmatch`
- Quality scanning: `pip install goldenmatch[quality]`
- Data transforms: `pip install goldenmatch[transform]`
- Embeddings: `pip install goldenmatch[embeddings]`

## Quick Examples

### Deduplicate a CSV (zero-config)
```python
import goldenmatch as gm
result = gm.dedupe("customers.csv")
result.golden.write_csv("deduped.csv")
print(f"{result.total_clusters} clusters, {result.match_rate:.1%} match rate")
```

### Deduplicate with explicit config
```python
result = gm.dedupe("customers.csv",
    exact=["email"],
    fuzzy={"name": 0.85, "address": 0.80},
    blocking=["zip"],
)
```

### Match across two files
```python
result = gm.match("file_a.csv", "file_b.csv", fuzzy={"name": 0.85})
```

### Privacy-preserving linkage (no raw data shared)
```python
result = gm.pprl_link("hospital_a.csv", "hospital_b.csv",
    fields=["first_name", "last_name", "dob", "zip"])
```

### Evaluate accuracy
```python
metrics = gm.evaluate("data.csv", config="config.yaml", ground_truth="gt.csv")
print(f"F1: {metrics['f1']:.1%}, Precision: {metrics['precision']:.1%}")
```

## Config Template (YAML)

```yaml
matchkeys:
  - name: exact_email
    type: exact
    fields:
      - field: email
        transforms: [lowercase, strip]

  - name: fuzzy_name
    type: weighted
    threshold: 0.85
    fields:
      - field: first_name
        scorer: jaro_winkler
        weight: 0.5
        transforms: [lowercase, strip]
      - field: last_name
        scorer: jaro_winkler
        weight: 0.3
      - field: zip
        scorer: exact
        weight: 0.2

blocking:
  strategy: adaptive
  keys:
    - fields: [zip]

golden_rules:
  default_strategy: most_complete
```

## Key Types

- `DedupeResult` — `.golden` (DataFrame), `.dupes`, `.unique`, `.clusters` (dict), `.scored_pairs` (list), `.stats`, `.total_clusters`, `.match_rate`
- `MatchResult` — same shape as DedupeResult for cross-file matching
- `GoldenMatchConfig` — Pydantic model, loadable from YAML via `gm.load_config("config.yaml")`

## Performance Limits
- In-memory: up to ~500K records. Use DuckDB backend or chunked mode for larger datasets
- 1M exact dedupe: ~7.8s. 100K fuzzy: ~12.8s
- LLM scorer: ~$0.04 per dataset (budget-capped, opt-in)
- PPRL auto-config: 92.4% F1 on FEBRL4

## Scorers
exact, jaro_winkler, levenshtein, token_sort, ensemble, dice, jaccard, soundex_match, embedding

## Transforms
lowercase, uppercase, strip, soundex, metaphone, digits_only, alpha_only, normalize_whitespace, token_sort, first_token, last_token, substring:start:end

## Docs
- [Full docs](https://benzsevern.github.io/goldenmatch/): 22 guides
- [Full API reference](https://benzsevern.github.io/goldenmatch/python-api): 101 exports
- [PyPI](https://pypi.org/project/goldenmatch/)
- [GitHub](https://github.com/benzsevern/goldenmatch)
