Metadata-Version: 2.4
Name: nuficlean
Version: 0.3.2
Summary: Python library for Nufi (Fe'éfě'e) text: Clafrica keyboard mapping, Bana→Komako normalisation, low-tone stripping, and encoding repair
License-Expression: MIT
Project-URL: Repository, https://github.com/tchamna/nuficlean
Keywords: nufi,cameroonian,nlp,text-normalisation,bana,komako
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: regex

# nuficlean

Python library for **Nufi** (Fe'éfě'e / Babanki-Tungo) text utilities:

- **Bana → Komako normalisation** — converts Bana orthography to the standard Komako form, strips low-tone diacritics, and repairs encoding issues
- **Clafrica keyboard mapping** — converts ASCII shortcut sequences (Clafrica input method) into the corresponding Nufi Unicode characters

## Install

```bash
pip install nuficlean
```

---

## Bana normalisation

### `clean(text)`

Applies the full normalisation pipeline to a string.

```python
from nuficlean import clean

clean("kòlə̀'")        # → "kwele'"
clean("nàh")           # → "lah"
clean("mɛ̀ɛ̀")         # → "maa"
clean("tōh mēndɑ̀'")  # → "tōh mēndɑ'"
```

### `clean_lines(lines)`

Cleans a list of strings.

```python
from nuficlean import clean_lines

clean_lines(["kòlə̀'", "nàh", "mɛ̀ɛ̀"])
# → ["kwele'", "lah", "maa"]
```

### Pipeline

1. **Mojibake repair** — fixes Latin-1 → UTF-8 misencoding
2. **Apostrophe / quote unification** — maps `'`, `` ` ``, `ʼ`, `"`, `«`, `»` → ASCII
3. **Bana → Komako rewrite** — longest-match substitution (`kòlə̀'` → `kwele'`, `ɛ̀` → `a`, …)
4. **Low-tone stripping** — removes grave-accent tone marks (`à`→`a`, `ɑ̀`→`ɑ`, …)
5. **NFC recomposition**

### CLI

```bash
nuficlean "kòlə̀'"
echo "mɛ̀ɛ̀" | nuficlean
```

---

## Clafrica keyboard mapping

The Clafrica input method uses ASCII shortcut sequences to type Nufi characters.
`nuficlean` ships the canonical mapping table and exposes it through two functions
and a class.

### `apply_clafrica(text)`

Converts all Clafrica shortcuts in *text* to Unicode, preserving whitespace.

```python
from nuficlean import apply_clafrica

apply_clafrica("af1 e2 n*")   # → "ɑ̀ é ŋ"
apply_clafrica("eu3 af5")     # → "ə̄ ɑ̂"
apply_clafrica("uu1 o*2")     # → "ʉ̀ ɔ́"
apply_clafrica("N* O*")       # → "Ŋ Ɔ"
```

**Live-typing mode** — pass `preserve_ambiguous_trailing=True` to leave the last
token untouched while the user may still extend it:

```python
apply_clafrica("af", preserve_ambiguous_trailing=True)  # → "af"  (could become af1, af2…)
apply_clafrica("af1")                                   # → "ɑ̀"
```

### `finalize_clafrica(text)`

Like `apply_clafrica` but also resolves any trailing ambiguous shortcut —
use this when the user confirms input (e.g. presses Enter or Space).

```python
from nuficlean import finalize_clafrica

finalize_clafrica("eu3")   # → "ə̄"
finalize_clafrica("af1")   # → "ɑ̀"
finalize_clafrica("n*")    # → "ŋ"
```

### `ClafricaEngine` — advanced use

Instantiate the engine directly when you need a custom mapping or extra entries.

```python
from nuficlean import ClafricaEngine

# Add project-specific shortcuts on top of the default table
engine = ClafricaEngine(extra={"nkap": "ŋkɑ̄p"})
engine.apply_mapping("nkap e2")   # → "ŋkɑ̄p é"
engine.finalize_input("eu3")      # → "ə̄"
engine.lookup("af1")              # → "ɑ̀"
engine.lookup("xyz")              # → None

# Fully custom table (replaces the default)
engine = ClafricaEngine(mapping={"a1": "à", "e1": "è"})
```

### Shortcut reference

| Shortcut | Output | Notes |
|----------|--------|-------|
| `af` | `ɑ` | open-a |
| `eu` | `ə` | schwa |
| `ai` | `ε` | epsilon |
| `o*` | `ɔ` | open-o |
| `uu` | `ʉ` | u-bar |
| `n*` | `ŋ` | eng |
| `N*` | `Ŋ` | Eng (uppercase) |
| `a1` `a2` `a3` | `à` `á` `ā` | low / mid / high tone |
| `af1` `af2` `af3` | `ɑ̀` `ɑ́` `ɑ̄` | open-a tones |
| `eu1` `eu2` `eu3` | `ə̀` `ə́` `ə̄` | schwa tones |
| `o*1` `o*2` `o*3` | `ɔ̀` `ɔ́` `ɔ̄` | open-o tones |

Tone digits: `1` = low, `2` = mid, `3` = high, `5` = rising, `7` = falling.

> **Tip:** The `clafrica` package on PyPI provides the same keyboard mapping
> as a standalone library if you don't need the Bana normalisation.
> `pip install clafrica`

---

## License

MIT
