Metadata-Version: 2.4
Name: moine
Version: 0.1.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Rust
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Dist: pytest>=8,<9 ; extra == 'dev'
Requires-Dist: ruff>=0.14,<0.15 ; extra == 'dev'
Requires-Dist: ty>=0.0.38,<0.1 ; extra == 'dev'
Requires-Dist: zensical>=0.0.32 ; extra == 'docs'
Requires-Dist: pytest>=8,<9 ; extra == 'test'
Provides-Extra: dev
Provides-Extra: docs
Provides-Extra: test
License-File: LICENSE-APACHE
License-File: LICENSE-MIT
Summary: Lattice Path Edit Distance and romanization-aware string comparison
Home-Page: https://github.com/tagucci/moine
License-Expression: MIT OR Apache-2.0
Requires-Python: >=3.11
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Documentation, https://tagucci.github.io/moine/
Project-URL: Homepage, https://github.com/tagucci/moine
Project-URL: Issues, https://github.com/tagucci/moine/issues
Project-URL: Repository, https://github.com/tagucci/moine

# mòine

`mòine` is a Python and Rust library for romanization-aware string comparison.

It implements [Lattice Path Edit Distance (Kaji, 2023)](https://aclanthology.org/2023.emnlp-industry.24/),
a distance metric that compares strings through possible reading paths rather
than only through visible surface characters.

This is useful when romanized input and written Japanese or Chinese look far
apart as strings, but stay close in reading space.

```python
>>> import moine

>>> moine.distance("moine", "モイニャ", lang="ja")
2

>>> moine.distance("もいにゃ", "モイニャ", lang="ja")
0

>>> moine.distance("weishiji", "威士忌", lang="zh")
0

>>> moine.distance("布納哈奔", "布納哈本", lang="zh")
0
```

The project is inspired by Nobuhiro Kaji's EMNLP 2023 Industry Track paper,
"Lattice Path Edit Distance: A Romanization-aware Edit Distance for Extracting
Misspelling-Correction Pairs from Japanese Search Query Logs".

## Name

The project name comes from `Moine`, a peated malt from Bunnahabhain and one of
the developer's favorite Scotch whiskies. In Japanese, the name has several
plausible katakana renderings, such as `モイニャ`, `モーイン`, and `モアンヌ`, which
makes it a fitting name for a project about readings, spelling variation, and
ambiguity in input sequences.

## Features

- Japanese comparison with
  [UniDic-CWJ](https://clrd.ninjal.ac.jp/unidic/download.html)-derived reading
  artifacts.
- Chinese comparison with
  [CC-CEDICT](https://cc-cedict.org/wiki/)-derived no-tone pinyin artifacts.
- Plain string Levenshtein-compatible distance helpers.
- Lattice-aware Damerau-Levenshtein distance for adjacent transpositions.
- Normalized similarity / `ratio` helpers in `0.0..=1.0`.
- [RapidFuzz](https://github.com/rapidfuzz/RapidFuzz)-inspired APIs such as
  `cdist` and partial matching helpers.

## When To Use

mòine is best used after another system has produced candidates: lexical
retrieval, n-gram search, BM25, embeddings, a product catalog, or an entity
list. Use mòine to rescore those candidates in reading space.

| Good fit | Poor fit |
| --- | --- |
| Romanized, kana, kanji, or pinyin input mixed together | Same-script typo matching only |
| Query correction, search suggest, and candidate reranking | Replacing a full search engine |
| Japanese and Mandarin pinyin Chinese entity matching | Cantonese/Jyutping or arbitrary languages |
| Pipelines that can download dictionary artifacts explicitly | Install-only workflows with no data step |
| Hundreds or thousands of candidates after retrieval | Brute-force scoring over a whole corpus |

## Installation

Install the Python package:

```bash
pip install moine
uv pip install moine
```

Install the Rust command-line tool:

```bash
cargo install moine
```

The packages do not bundle dictionary data. Download the language artifacts you
need explicitly:

```bash
uv run python -m moine download ja
uv run python -m moine download zh

moine download ja
moine download zh
```

## Quick Start

Use the top-level Python API when you want mòine to load the default dictionary
for a language:

```python
import moine

print(moine.distance("もいにゃ", "モイニャ", lang="ja"))  # 0
print(moine.ratio("ピィート", "ピート", lang="ja"))  # 0.7142857142857143
print(moine.partial_ratio("ウイスキー", "ういすきーをのんでいます", lang="ja"))  # 1.0
print(moine.distance("weishiji", "威士忌", lang="zh"))  # 0
```

Load a dictionary explicitly when you want to control startup cost or artifact
location:

```python
import moine

dictionary = moine.load_dict(lang="ja")
moine.set_default_dictionary(dictionary)

print(moine.distance("もいにゃ", "モイニャ", lang="ja"))  # 0
```

Use `cdist` for query-by-choice matrices:

```python
import moine

scores = moine.cdist(
    ["もいにゃ", "ぴーと", "ピィート"],
    ["モイニャ", "ピート", "ピーと", "ピィート"],
    lang="ja",
    metric="damerau_distance",
    score_cutoff=1,
)
```

For search or entity matching, generate candidates with your existing system
and use mòine as a reading-aware reranker:

```python
import moine

query = "moine"
candidates = ["モイニャ", "モーイン", "モアンヌ", "ストイーシャ"]

scores = moine.cdist(
    [query],
    candidates,
    lang="ja",
    metric="distance",
)[0]

ranked = sorted(zip(candidates, scores), key=lambda item: item[1])
print(ranked)
# [('モイニャ', 2), ('モーイン', 2), ('モアンヌ', 3), ('ストイーシャ', 7)]
```

Score interpretation is intentionally simple: `distance=0` means the best
reading paths are identical, distance metrics are smaller-is-better, `ratio`
and `normalized_similarity` are in `0.0..=1.0` and larger-is-better, and
`score_cutoff` filters in the RapidFuzz style.

## Command Line

Most users only need the public runtime commands:

```bash
moine download ja
moine download zh
moine list
moine where
moine compare --left "もいにゃ" --right "モイニャ" \
  --artifact-metadata /path/to/moine-unidic-cwj-202512/metadata.yaml
moine chinese-compare --left weishiji --right 威士忌 \
  --artifact-metadata /path/to/moine-cedict-20260520/metadata.yaml
```

The artifact bundle, verification, archive, and diagnostic commands are
maintainer-facing tools for producing and checking release assets. They are
documented in [docs/development.md](docs/development.md) and
[docs/release_process.md](docs/release_process.md).

## Documentation

- [Project documentation](https://tagucci.github.io/moine/)
- [Installation](https://tagucci.github.io/moine/installation/)
- [Python usage](https://tagucci.github.io/moine/usage/)
- [API reference](https://tagucci.github.io/moine/api/)
- [Rust usage](https://tagucci.github.io/moine/rust/)
- [Dictionary artifacts](https://tagucci.github.io/moine/artifacts/)
- [Browser demo](https://tagucci.github.io/moine/demo/)
- [Rust docs](https://docs.rs/moine)

Developer and maintainer notes live under [docs/](docs/), starting with
[docs/development.md](docs/development.md) and
[docs/release_process.md](docs/release_process.md).
See [CONTRIBUTING.md](CONTRIBUTING.md) before opening pull requests.

## How It Differs From RapidFuzz

RapidFuzz is the better fit when both inputs should be compared directly as
surface strings and you need a broad set of highly optimized fuzzy-matching
scorers. mòine focuses on a narrower problem: comparing strings through
possible reading paths before edit distance is computed.

## Limitations

- mòine does not reproduce the original paper's private search-query-log
  evaluation.
- Dictionary-backed comparison requires separately distributed dictionary
  artifacts.
- UniDic matching intentionally does not use MeCab/Viterbi costs.
- Chinese support is Mandarin pinyin only; it does not model Cantonese/Jyutping
  or non-Mandarin readings.
- `processor`, `score_hint`, NumPy dtype options, and worker parallelism are not
  part of the initial `cdist` API.

## Reference

> [!CAUTION]
> This project is not the official implementation by the paper author.

```bibtex
@inproceedings{kaji-2023-lattice,
    title = "Lattice Path Edit Distance: A {R}omanization-aware Edit Distance for Extracting Misspelling-Correction Pairs from {J}apanese Search Query Logs",
    author = "Kaji, Nobuhiro",
    editor = "Wang, Mingxuan  and
      Zitouni, Imed",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-industry.24/",
    doi = "10.18653/v1/2023.emnlp-industry.24",
    pages = "233--242",
    abstract = "Edit distance has been successfully used to extract training data, i.e., misspelling-correction pairs, of spelling correction models from search query logs in languages including English. However, the success does not readily apply to Japanese, where misspellings are often dissimilar to correct spellings due to the romanization-based input methods. To address this problem, we introduce lattice path edit distance, which utilizes romanization lattices to efficiently consider all possible romanized forms of input strings. Empirical experiments using Japanese search query logs demonstrated that the lattice path edit distance outperformed baseline methods including the standard edit distance combined with an existing transliterator and morphological analyzer. A training data collection pipeline that uses the lattice path edit distance has been deployed in production at our search engine for over a year."
}
```

## License

mòine source code is licensed under either MIT or Apache-2.0. See
[LICENSE-MIT](LICENSE-MIT) and [LICENSE-APACHE](LICENSE-APACHE).

Dictionary data is separate. UniDic-derived and CC-CEDICT-derived artifacts
carry their own license and attribution metadata, and should keep dictionary
license information separate from the mòine source-code license. See
[THIRD_PARTY_NOTICES.md](THIRD_PARTY_NOTICES.md).

