Metadata-Version: 2.4
Name: name-variants
Version: 0.1.3
Summary: Multilingual name romanization lookup tables: Chinese, Japanese, Korean, Arabic, Vietnamese, Indian, Persian, Hebrew, Thai, Greek, Turkish, Russian, Indonesian/Malay
Project-URL: Homepage, https://github.com/SecurityRonin/name-variants
Project-URL: Repository, https://github.com/SecurityRonin/name-variants
Project-URL: Issues, https://github.com/SecurityRonin/name-variants/issues
Author: SecurityRonin
License-Expression: MIT
License-File: LICENSE
Keywords: arabic,cjk,multilingual,names,ner,nlp,pseudonymization,romanization,transliteration
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Requires-Dist: click>=8.0
Provides-Extra: dev
Requires-Dist: bandit[toml]>=1.7; extra == 'dev'
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: mcp<2.0,>=1.0; extra == 'dev'
Requires-Dist: pip-audit>=2.7; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.9; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Provides-Extra: mcp
Requires-Dist: mcp<2.0,>=1.0; extra == 'mcp'
Provides-Extra: pandas
Requires-Dist: pandas>=1.3; extra == 'pandas'
Description-Content-Type: text/markdown

[![PyPI](https://img.shields.io/pypi/v/name-variants?style=flat-square)](https://pypi.org/project/name-variants/)
[![Tests](https://img.shields.io/github/actions/workflow/status/SecurityRonin/name-variants/ci.yml?style=flat-square&label=tests)](https://github.com/SecurityRonin/name-variants/actions)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue?style=flat-square)](LICENSE)
[![Sponsor](https://img.shields.io/badge/sponsor-h4x0r-ea4aaa?logo=github-sponsors&style=flat-square)](https://github.com/sponsors/h4x0r)
[![name-variants MCP server](https://glama.ai/mcp/servers/SecurityRonin/name-variants/badges/score.svg)](https://glama.ai/mcp/servers/SecurityRonin/name-variants)

# name-variants

**`"Chan"` is simultaneously 陳, 陈 and 찬 and จัน — `lookup()` returns all of them.**

Every romanization system produces a member of an equivalence class: no canonical form, no ordering dependency, no silent data loss. `share_cluster("Hsu", "Xu")` is `True`. `lookup("Chan")` returns a Chinese surname cluster *and* a Korean given-name cluster, sorted by bearer count.

Available as a Python library, CLI, pandas accessor, and [Model Context Protocol (MCP)](https://modelcontextprotocol.io) server.

```bash
pip install name-variants
```

---

## The core idea

A `NameCluster` is a frozenset of co-equal representations. `陳`, `陈`, `chen`, `chan`, `tan`, `chern` are all members of the same Chinese surname cluster — none is more "real" than another. `lookup()` returns every cluster that contains your query, sorted by frequency:

```python
from name_variants import lookup, share_cluster

clusters = lookup("Chan")
# [NameCluster(language='chinese', 8 forms),
#  NameCluster(language='korean_given', 3 forms)]

# Both Chinese scripts are in the same cluster — co-equal
assert "陳" in clusters[0]   # Traditional
assert "陈" in clusters[0]   # Simplified

# Membership is case-insensitive
assert "CHAN" in clusters[0]

# Ambiguity is surfaced, not suppressed
assert len(clusters) == 2    # Chinese AND Korean, not one-or-the-other
```

---

## API

### lookup() — all matching clusters

```python
from name_variants import lookup

lookup("Chan")
# [NameCluster(language='chinese', 8 forms),
#  NameCluster(language='korean_given', 3 forms)]

lookup("Nguyen")
# [NameCluster(language='vietnamese', 4 forms)]

lookup("Smith")
# []
```

Results are sorted by `frequency` descending — most statistically likely interpretation first.

### share_cluster() — equivalence check

```python
from name_variants import share_cluster

share_cluster("Chan", "Chen")        # True  — same Chinese cluster
share_cluster("Chou", "Zhou")        # True  — Wade-Giles = Pinyin
share_cluster("Chiang", "Jiang")     # True  — Chiang Kai-shek / 蒋介石
share_cluster("Hsu", "Xu")           # True  — Taiwan diaspora romanization
share_cluster("Tsao", "Cao")         # True  — Ts'ao Ts'ao / 曹操
share_cluster("Chan", "Kim")         # False — different names
share_cluster("", "Chan")            # False — empty input
```

### dialect() — Chinese romanization system tag

```python
from name_variants import dialect

dialect("chen")   # "mandarin_pinyin"
dialect("chan")   # "cantonese"
dialect("tan")    # "hokkien"
dialect("陳")     # "traditional"

dialect("chou")   # "wade_giles"
dialect("hsu")    # "wade_giles"
dialect("Smith")  # None
```

### normalize() — text preprocessing

```python
from name_variants import normalize

normalize("  NGUYỄN  ")                    # "nguyễn"
normalize("Nguyễn", strip_diacritics=True) # "nguyen"
normalize("chan​")                          # strips zero-width spaces
```

---

## CLI

```bash
nv lookup Chan
# [chinese] (~90M bearers)
#   陳  陈  chan  chen  tan  ...
# [korean_given]
#   찬  chan  chahn

nv match Chan Chen          # true
nv match Chan Kim           # false
nv match --exit-code Chan Chen && echo same   # shell-scripting friendly

nv canonicalize-csv names.csv --col name --out out.csv
# adds {name}_canonical column

nv dedupe names.csv --col name --out out.csv
# adds cluster_id column grouping romanization variants
```

---

## Pandas

```bash
pip install "name-variants[pandas]"
```

```python
import pandas as pd
import name_variants  # registers .nv accessor

s = pd.Series(["Chan", "Chen", "Smith", "Park"])

s.nv.lookup()
# 0    [NameCluster(chinese, ...), NameCluster(korean_given, ...)]
# 1    [NameCluster(chinese, ...)]
# 2    []
# 3    [NameCluster(korean, ...)]

s.nv.cluster_id()
# 0    a3f2b1c4d5e6   ← same as row 1 (Chan and Chen share chinese cluster)
# 1    a3f2b1c4d5e6
# 2                   ← empty string for unknown
# 3    9b8c7d6e5f4a

a = pd.Series(["Chan", "Park"])
b = pd.Series(["Chen", "Bak"])
a.nv.share_cluster_with(b)   # [True, True]
```

---

## MCP server (Model Context Protocol)

`name-variants` ships a built-in [Model Context Protocol](https://modelcontextprotocol.io) server, exposing name lookup as MCP tools that any MCP-compatible AI client (Claude Desktop, Claude Code, Cursor, etc.) can call directly.

**Claude Code:**
```bash
claude mcp add name-variants -- uvx --from "name-variants[mcp]" nv-mcp
```

**Claude Desktop** — add to `claude_desktop_config.json`:
```json
{
  "mcpServers": {
    "name-variants": {
      "command": "uvx",
      "args": ["--from", "name-variants[mcp]", "nv-mcp"]
    }
  }
}
```

Three MCP tools are exposed:

| Tool | Arguments | Returns |
|---|---|---|
| `lookup` | `text: str` | list of `{language, forms[], frequency}` clusters |
| `share_cluster` | `a: str, b: str` | `true` / `false` |
| `dialect` | `text: str` | romanization system string or `null` |

---

## Language tables

| Language | Entries | Coverage |
|---|---|---|
| `chinese` | 140 | Pinyin + Wade-Giles + Cantonese + Hokkien + Hakka + Teochew + Traditional |
| `japanese` | 143 | Hepburn + macron variants |
| `korean` | 100 | Revised Romanization + McCune-Reischauer |
| `arabic` | 92 | Multiple transliteration systems |
| `vietnamese` | 84 | Diacritics + stripped forms |
| `russian` | 79 | Multiple transliteration systems |
| `indonesian_malay` | 77 | — |
| `persian` | 80 | — |
| `indian_hindi` | 80 | — |
| `hebrew` | 75 | — |
| `turkish` | 74 | Dotted-İ variants |
| `greek` | 60 | — |
| `thai` | 68 | — |
| `indian_bengali` | 56 | — |
| `indian_tamil` | 53 | — |
| `chinese_given` | 120 | Common given-name characters with Pinyin |
| `korean_given` | 70 | Common given-name syllables |
| `japanese_given` | 107 | Common given-name kanji |

```python
from name_variants import ALL_TABLES
list(ALL_TABLES.keys())   # all 18 table names
```

## Chinese writing systems

| System | Examples |
|---|---|
| Mandarin Pinyin | Zhou, Zhang, Wang, Xu |
| Wade-Giles | Chou, Chang, Wang, Hsu, Tsao, Kuo, Hsieh |
| Cantonese (Jyutping/Yale) | Chan, Wong, Ng, Lam, Tsui |
| Hokkien/Min Nan | Tan, Ng, Lim, Goh |
| Hakka | Fong, Thong |
| Teochew | Teo, Ng |
| Postal romanization | Peking, Nanking, Chungking |
| Traditional characters | 陳, 劉, 張, 楊, 趙 |
| Simplified characters | 陈, 刘, 张, 杨, 赵 |

## NameCluster reference

```python
@dataclass(frozen=True)
class NameCluster:
    forms: frozenset[str]    # all representations — co-equal
    language: str            # "chinese", "korean", "vietnamese", etc.
    frequency: int | None    # approximate global bearer count

    def __contains__(self, text: str) -> bool  # case-insensitive
    def __iter__(self)                          # iterate all forms
    def __len__(self)
```

---

## Why equivalence classes instead of a canonical key?

A canonical-key model forces a false choice: `"Chan"` must map to either `陳`, `陈` *or* `찬`, not both. Table ordering becomes load-bearing — whichever table is consulted last wins. Romanizations must be stripped from given-name tables to prevent collisions.

The `NameCluster` model eliminates this: every romanization system's output is just another member of a frozenset. `lookup()` returns all matching clusters. Ambiguity is surfaced, not suppressed. The most likely interpretation comes first by frequency.

---

## Contributing

```bash
git clone https://github.com/SecurityRonin/name-variants
cd name-variants
pip install -e ".[dev]"
pytest
```

Data files are in `name_variants/*_names.py` and `name_variants/*_surnames.py`. Each entry is a plain Python dict — easy to read and edit:

```python
"陈": {
    "forms": ["陳", "chen", "chan", "tan", ...],
    "frequency": 90_000_000,
    "dialects": {
        "chen": "mandarin_pinyin",
        "chan": "cantonese",
        "tan":  "hokkien",
        "陳":   "traditional",
    },
},
```

Adding a new variant is one edit to one entry — forms, frequency, and dialect tag colocated.

---

[Privacy Policy](https://securityronin.com/privacy/) · [Terms of Service](https://securityronin.com/terms/) · © 2026 Security Ronin Ltd
