Metadata-Version: 2.4
Name: mojihen
Version: 0.1.0
Summary: LLM-generated CJK corruption linter — catches valid-but-wrong kanji/hanzi that grep and tests miss
License: MIT
Keywords: cjk,japanese,linter,llm,unicode,corruption,pre-commit,ci
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: tomli>=1.1.0; python_version < "3.11"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# mojihen

**LLM-generated CJK corruption linter.** Catches *valid-but-wrong* kanji, hanzi,
and hangul that language models emit silently — the class of bug that grep, unit
tests, and every existing Unicode safety tool passes as **false-green**.

```
demo/sample.py:20:1  MH001 HIGH  '闾'  -> likely: 閾
  '闾' is a known LLM corruption (likely intended: 閾)  [rare_drift]
demo/sample.py:23:1  MH001 HIGH  '耒'  -> likely: 耐
  '耒' is a known LLM corruption (likely intended: 耐)  [decomposition]
```

---

## The problem

When an LLM writes Japanese, Chinese, or Korean copy, it does not corrupt bytes —
it substitutes a **real, valid character** that looks or sounds close to the
intended one. The wrong glyph is itself a legitimate Unicode codepoint.

### Six observed cases (LLM-generated Japanese)

| Intended | LLM emitted | Class | Why it hid |
|---|---|---|---|
| 閾 (threshold) | 闾 (village gate, U+95FE) | rare drift | 閾 is uncommon; LLM drifted to adjacent codepoint |
| 耐 (endure) | 耒 (plow radical, U+8012) | decomposition | 耐→耒耗 radical fragment; 耒 alone near-absent in modern JA |
| 滞 (stagnation) | 滹 (river name, U+6EF9) | radical | Radical visual confusion |
| 亊 (rare variant) | 事 (matter) | rare variant | U+4E8A vs U+4E8B, adjacent, visually identical |
| 愛 (love) | 感 (feeling) | visual/semantic | Both common; low-confidence in corpus (see below) |
| 敏 (nimble) | 敢 (bold) | shape | Stroke near-miss; low-confidence |

### Why existing tools miss it

- **grep / ripgrep**: searches for the *intended* string; the wrong glyph simply
  does not match. Silent.
- **Unit tests**: assertions were written against the already-corrupted value.
  They pass. This actually happened.
- **Unicode safety linters** (`bidichk`, `anti-trojan-source`, `unicode-safety-check`):
  target *adversarial* unicode (invisible chars, bidi overrides, homoglyphs). These
  substitutions are visible, in-script, non-adversarial. Out of scope for those tools.
- **Chinese Spell Check (CSC) research**: models that correct *human* typos;
  not packaged as a dev linter / CI gate / agent hook.

mojihen is **first-in-category** for this failure mode.

---

## Install

```bash
pip install mojihen
```

Python 3.9+ required. Zero runtime dependencies beyond stdlib.
(`tomllib` is used on Python 3.11+; on older versions, config file parsing
gracefully degrades to defaults if `tomli` is not installed.)

---

## CLI usage

```bash
# Scan a file or directory
mojihen src/

# Scan with explicit options
mojihen src/ --format tty --fail-on high

# Output machine-readable JSON
mojihen src/ --format json > findings.json

# Output SARIF (for GitHub code scanning)
mojihen src/ --format sarif > mojihen.sarif

# Scan all text (bypass type-aware extraction)
mojihen src/ --all-text

# Use a custom config
mojihen src/ --config path/to/mojihen.toml
```

### Exit codes

| Code | Meaning |
|------|---------|
| 0 | No findings at or above the fail threshold |
| 1 | One or more findings at or above the fail threshold |
| 2 | Usage error, or agent hook blocked a write |

---

## pre-commit

Add to `.pre-commit-config.yaml`:

```yaml
repos:
  - repo: https://github.com/hryoma1217/mojihen
    rev: v0.1.0
    hooks:
      - id: mojihen
```

This uses the bundled `.pre-commit-hooks.yaml` which runs
`mojihen --fail-on high` on every staged file.

---

## GitHub Action

```yaml
# .github/workflows/mojihen.yml
name: CJK corruption check
on: [push, pull_request]

jobs:
  mojihen:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hryoma1217/mojihen@v0.1.0
        with:
          paths: src/
          fail-on: high
          format: sarif
          sarif-output: mojihen.sarif
```

Findings appear in the GitHub Security tab (code scanning).

---

## Agent hook (Claude Code / Codex)

The killer use-case: scan **just-written text** before it reaches the filesystem,
and bounce corrupt output back to the model immediately.

### Claude Code (PostToolUse)

In `.claude/settings.json`:

```json
{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          { "type": "command", "command": "mojihen hook --stdin" }
        ]
      }
    ]
  }
}
```

### Codex

In `.codex/config.toml`:

```toml
[hooks]
post_write = "mojihen hook --stdin"
```

### What happens on corruption

```
mojihen: BLOCKED - LLM CJK corruption detected

  src/strings.py:3:18  MH001 HIGH  '闾'  -> likely: 閾
  src/strings.py:5:12  MH001 HIGH  '耒'  -> likely: 耐

  Verify the intended CJK text and rewrite before proceeding.
```

The hook exits 2; the agent sees the block reason and retries with corrected text.

See `hooks/claude-code.md` and `hooks/codex.md` for full setup instructions.

---

## Configuration

Create `mojihen.toml` in your project root (or `[tool.mojihen]` in `pyproject.toml`):

```toml
# mojihen.toml
fail_on = "high"              # "high" | "medium"
langs = ["ja", "zh", "ko"]
extract = "auto"              # "auto" (type-aware) | "all-text"
allow = []                    # literal strings/chars to never flag
corpus = []                   # extra corpus JSON paths
```

### Inline suppression

Suppress findings on a specific line:

```python
# Intentional use of the archaic character (corpus fixture)
FIXTURE = "闾"  # mojihen: ignore

# Suppress only a specific rule
FIXTURE = "闾"  # mojihen: ignore[MH001]
```

---

## How the corpus works

`src/mojihen/data/seed.json` is a versioned, schema-validated list of known-wrong chars:

```json
{
  "version": 1,
  "entries": [
    {
      "wrong": "闾",
      "intended": ["閾"],
      "lang": "ja",
      "class": "rare_drift",
      "evidence": "observed in LLM Japanese output",
      "confidence": "high"
    }
  ]
}
```

### Confidence tiers

| Tier | Meaning | CLI behaviour |
|------|---------|---------------|
| `high` | Rare char; near-zero false positives | Fails CI by default |
| `medium` | Somewhat common; context-dependent | Warns; optionally fails |
| `low` | Common char; production evidence but ambiguous | Info only |

High-confidence entries are chars like `闾` (U+95FE) that are essentially absent
from modern Japanese/Chinese text and almost certainly signal LLM drift.
Common kanji like `感` are kept at `low` to avoid flooding legitimate text with
false positives.

### Contributing a new entry

1. Confirm the wrong char is a known-bad substitution with evidence (build log,
   diff, screenshot).
2. Confirm `wrong != intended` and both contain valid CJK.
3. Choose `"confidence": "high"` only if the wrong char is rare in normal text.
4. Add to `src/mojihen/data/seed.json` and run: `python -m unittest discover -s tests`
5. The precision gate (`test_precision.py`) must still pass with zero MH001 high
   findings on the clean fixture sentences.

---

## Detectors

| ID | Name | Confidence |
|----|------|------------|
| MH001 | Corpus hit | high/medium/low (per entry) |
| MH002 | Mixed-script token (Han + Latin/Cyrillic in one identifier) | medium |
| MH003 | Isolated CJK in ASCII identifier / key / URL | medium |
| MH004 | Rare/archaic codepoint (needs Unihan freq table) | deferred |
| MH005 | Decomposition garble (needs radical table) | deferred |

MH004 and MH005 are deferred in v1 — the known MH005 cases (耒耗, etc.) are
already covered by individual MH001 corpus entries.

---

## Escape decoding

mojihen decodes all escape forms **before** inspecting text, because LLMs
frequently emit corrupted characters as `\uXXXX` escapes:

| Form | Example | Decoded |
|------|---------|---------|
| `\uXXXX` | `\u95FE` | 闾 |
| `\u{XXXXXX}` | `\u{95FE}` | 闾 |
| Surrogate pair | `\uD83D\uDE00` | 😀 |
| `\xXX` | `\x41` | A |
| HTML decimal | `&#38398;` | 闾 |
| HTML hex | `&#x95FE;` | 闾 |
| Named entity | `&amp;` | & |

---

## Limitations and false-positive controls

- **Common kanji**: Characters like `感` (feeling), `末` (end), `士` (person)
  appear in thousands of legitimate Japanese words. They are only added to the
  corpus at `low` confidence. Use `--fail-on high` (the default) to avoid noise.
- **Context-free**: mojihen does not understand grammar or intent — it pattern-
  matches against a corpus. False positives in unusual text can be suppressed
  with `allow = [...]` in config or inline `# mojihen: ignore`.
- **MH002/MH003** are medium-confidence and require `--fail-on medium` to fail CI.
  They are informational by default.
- The clean-corpus precision gate (`tests/test_precision.py`) must stay green;
  this is the automated false-positive guard.

---

## 日本語について (Japanese section)

`mojihen`（文字変）は、LLMが生成した日本語・中国語・韓国語のテキストに含まれる
「正しいUnicodeコードポイントだが意図と異なる漢字」を検出するリンターです。

grepや単体テストではこの種の文字化けを検出できません。なぜなら間違った文字も
正規のUnicodeであり、テストはすでに化けた値に対して書かれているからです。

`mojihen`は既知の誤用パターンを収録したコーパス（`src/mojihen/data/seed.json`）と、
エスケープ形式（`\uXXXX`、`&#NNNN;`等）のデコードを組み合わせて、
CI・pre-commit・AIエージェントのフック（PostToolUse）として動作します。

---

## Development

```bash
git clone https://github.com/hryoma1217/mojihen
cd mojihen
pip install -e ".[dev]"
python -m unittest discover -s tests -v
```

---

## License

MIT. Copyright 2026 hryoma1217.
