Metadata-Version: 2.4
Name: nuficlean
Version: 0.1.0
Summary: Nufi language text normalisation: Bana → Komako standard orthography, tone stripping, mojibake repair
License-Expression: MIT
Project-URL: Repository, https://github.com/your-org/nuficlean
Keywords: nufi,cameroonian,nlp,text-normalisation,bana,komako
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# nuficlean

Python library for normalising **Nufi** (Babanki-Tungo / Nʉ Fì) text: converts Bana orthography to Komako standard, strips low-tone diacritics, and repairs common encoding issues.

## Install

```bash
pip install nuficlean
```

Or from source:

```bash
pip install -e /path/to/nuficlean
```

## Usage

```python
from nuficlean import clean

clean("kòlə̀'")        # → "kwele'"
clean("tōh mēndɑ̀'")  # → "tōh mēndɑ'"
clean("mbɑ̀ɑ̀")        # → "mbɑɑ"
```

Batch cleaning:

```python
from nuficlean import clean_lines

clean_lines(["kòlə̀'", "mbɑ̀ɑ̀"])  # → ["kwele'", "mbɑɑ"]
```

CLI:

```bash
nuficlean "kòlə̀'"
echo "mbɑ̀ɑ̀" | nuficlean
```

## Pipeline

1. **Mojibake repair** — fixes Latin-1 → UTF-8 misencoding
2. **Apostrophe / quote unification** — maps `'`, `` ` ``, `ʼ`, `"`, `«`, `»`, etc. to ASCII
3. **Bana → Komako rewrite** — longest-match substitution table (e.g. `kòlə̀'` → `kwèlè'`, `ɛ̀` → `a`)
4. **Low-tone stripping** — removes grave-accent tone marks (`à`→`a`, `ɑ̀`→`ɑ`, …)
5. **NFC recomposition**

## License

MIT
