Metadata-Version: 2.4
Name: zh-dict-mcp
Version: 0.1.0
Summary: MCP server for Chinese figurative language lookup, backed by CC-CEDICT
Author: outsiderrr
License: MIT
Project-URL: Homepage, https://github.com/outsiderrr/zh-dict-mcp
Project-URL: Repository, https://github.com/outsiderrr/zh-dict-mcp
Project-URL: Issues, https://github.com/outsiderrr/zh-dict-mcp/issues
Keywords: mcp,model-context-protocol,chinese,dictionary,cedict,cc-cedict,nlp,figurative-language,ai-writing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Natural Language :: Chinese (Simplified)
Classifier: Natural Language :: Chinese (Traditional)
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: LICENSE-CC-CEDICT
Requires-Dist: mcp>=1.0
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Dynamic: license-file

# zh-dict-mcp

MCP server for Chinese figurative language lookup, backed by CC-CEDICT.

**What it does**: given a Chinese word or phrase, tells you whether its figurative usage has been lexicalized (recorded in the dictionary as an independent sense) or is a one-off creative expression.

**Why it exists**: LLMs writing Chinese dialogue, fiction, or roleplay tend to invent purple-prose figurative expressions that no real person would say (e.g., "他把心锁进铁盒里" / "墙比夜更厚"). This tool gives you an objective dictionary-backed check.

---

## Install + use in one line

In Claude Code (or any MCP-aware client):

```bash
claude mcp add zh-dict-mcp -- uvx zh-dict-mcp
```

Or paste into your `.mcp.json`:

```json
{
  "mcpServers": {
    "zh-dict-mcp": {
      "command": "uvx",
      "args": ["zh-dict-mcp"]
    }
  }
}
```

That's it. `uvx` pulls the package from PyPI on first run, caches it, launches the stdio MCP server. No `pip install` needed.

The `lookup_dictionary` tool is now available in your Claude Code sessions.

---

## What you get

A single MCP tool:

```
lookup_dictionary(word: string) → JSON
```

**Example**: `lookup_dictionary("看见")` returns:

```json
{
  "word": "看见",
  "found_in_cedict": true,
  "simplified": "看见",
  "traditional": "看見",
  "pinyin": "kan4 jian4",
  "definitions": ["to see", "to catch sight of"],
  "tags": {
    "has_figurative": false,
    "is_neologism": false,
    "is_slang": false,
    "has_idiom_marker": false
  }
}
```

**Example**: `lookup_dictionary("内卷")` returns:

```json
{
  "word": "内卷",
  "found_in_cedict": true,
  "definitions": [
    "(embryology) to involute; involution",
    "(neologism, attested by 2017) (of a society) to become more and more involuted..."
  ],
  "tags": { "is_neologism": true, ... }
}
```

**Example**: `lookup_dictionary("锁进铁盒里")` (a creative one-off) returns:

```json
{
  "word": "锁进铁盒里",
  "found_in_cedict": false,
  "found_in_whitelist": false,
  "definitions": []
}
```

---

## Use cases

- **AI-generated dialogue review**: catch live metaphors LLM invents but no real speaker would use
- **AI writing lint**: pipeline filter for game NPC dialogue / interactive fiction / chatbot scripts
- **Lexicalization research**: check whether a figurative expression has been recorded in standard dictionaries
- **New word verification**: confirm neologisms / slang with `(neologism, attested by YEAR)` attribution
- **Idiom / 典故 lookup**: get figurative sense for idioms like "滑铁卢" → "(fig.) a defeat"

---

## Data source

[CC-CEDICT](https://www.mdbg.net/chinese/dictionary?page=cc-cedict) — open Chinese-English dictionary, 12.5万条目, community-maintained, weekly updates.

License: CC BY-SA 4.0. Bundled in package. See `LICENSE-CC-CEDICT`.

Why CC-CEDICT vs 现代汉语词典 (XDHYCD) or other sources:

| Source | Coverage on AI-writing test set | Notes |
|---|---|---|
| chinese-xinhua (GitHub data) | 46% | Heavy classical/古汉语 bias |
| 现代汉语词典 第7版 (XDHYCD7th) | 56% | Doesn't list literal compound words (放下/抓住/等等) |
| **CC-CEDICT** | **~95%** | Modern usage + neologisms + `(fig.)` / `(slang)` / `(neologism)` markers |

CC-CEDICT explicitly tags figurative senses, neologisms with attestation years, slang, and idioms — exactly the structure needed for figurative-language analysis.

---

## Optional: project whitelist

For project-specific overrides (e.g., words CC-CEDICT happens to miss):

```yaml
# my_whitelist.yaml
allowed:
  - word: 凛然
    note: Standard literary usage, CC-CEDICT misses it
  - word: 头疼
    note: Override to include "annoyance" figurative sense
```

Pass via CLI:

```json
{
  "mcpServers": {
    "zh-dict-mcp": {
      "command": "uvx",
      "args": ["zh-dict-mcp", "--whitelist", "/abs/path/to/my_whitelist.yaml"]
    }
  }
}
```

Or via env var `ZH_DICT_WHITELIST=/path/to/file.yaml`.

When a word is in the whitelist, the result includes `"found_in_whitelist": true` and the note.

---

## Python API (no MCP needed)

Use the lookup library directly without launching a server:

```python
from zh_dict_mcp import DictionaryLookup

lookup = DictionaryLookup()  # bundled CC-CEDICT loads in ~200ms
result = lookup.lookup("滑铁卢")

print(result.found)              # True
print(result.definitions)        # ['Waterloo (Belgium)', 'Battle of Waterloo (1815)', '(fig.) a defeat']
print(result.tags.has_figurative)  # True
print(result.pinyin)             # 'Hua2 tie3 lu2'
```

With custom whitelist:

```python
from pathlib import Path
lookup = DictionaryLookup(whitelist_path=Path("my_whitelist.yaml"))
```

`lookup.py` has zero external dependencies (stdlib only). The `mcp` dependency is only needed for the MCP server.

---

## Install standalone (no MCP, just Python library)

```bash
pip install zh-dict-mcp
```

Or with uv:

```bash
uv add zh-dict-mcp
```

---

## Limitations

- **English-language definitions** (CC-CEDICT is a Chinese-English dictionary). Works well with LLMs that handle cross-lingual judgment (Claude, GPT-4+, Gemini). For monolingual Chinese consumers you'd need a translation layer.
- **Sense matching is on the caller** — this tool returns all senses; deciding whether the speaker's intended sense matches a returned sense is left to the LLM or human reviewer.
- **Single-word / single-phrase lookup** — doesn't parse full sentences. Wrap with your own NLP layer for sentence-level work.
- **9.4 MB data bundle** — CC-CEDICT data is included in the wheel for offline use.

---

## How it fits with broader writing-quality pipelines

This tool is one piece of a larger "AI-generated text quality" framework. Typical usage flow:

```
LLM generates Chinese dialogue
   ↓
Scan for figurative expressions (比喻 / 借代 / 委婉 / ...)
   ↓
For each: lookup_dictionary(expression)
   ↓
  ├── found + sense matches intent → pass
  └── not found or sense mismatch → flag for rewrite
```

A reference review prompt for this flow is documented in [Forgewright](https://github.com/outsiderrr/Forgewright) (the project that spawned this tool).

---

## Project status

v0.1.0 — initial release. Validated on a 39-case test set covering 6 categories (dead metaphors / live metaphors / literal words / boundary cases / idioms / neologisms) with 100% accuracy.

Bug reports and PRs welcome.

## License

- **Code**: MIT (see `LICENSE`)
- **CC-CEDICT data**: CC BY-SA 4.0 (see `LICENSE-CC-CEDICT`)
