Metadata-Version: 2.4
Name: text2tashkeel
Version: 0.1.0a2
Summary: Lightweight Arabic diacritization (tashkeel) — a model picker over bundled ONNX models, no PyTorch, no network
Author-email: JarbasAi <jarbasai@mailfence.com>
Maintainer: TigreGotico
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/TigreGotico/text2tashkeel
Project-URL: Repository, https://github.com/TigreGotico/text2tashkeel
Project-URL: Issues, https://github.com/TigreGotico/text2tashkeel/issues
Keywords: arabic,diacritization,tashkeel,harakat,nlp,onnx,tts,g2p
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Natural Language :: Arabic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: onnxruntime>=1.16
Provides-Extra: hf
Requires-Dist: huggingface_hub>=0.20; extra == "hf"
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == "test"
Requires-Dist: huggingface_hub>=0.20; extra == "test"
Provides-Extra: bench
Requires-Dist: datasets>=2.0; extra == "bench"
Requires-Dist: matplotlib>=3.5; extra == "bench"
Requires-Dist: huggingface_hub>=0.20; extra == "bench"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: datasets>=2.0; extra == "dev"
Requires-Dist: matplotlib>=3.5; extra == "dev"
Requires-Dist: huggingface_hub>=0.20; extra == "dev"
Dynamic: license-file

# text2tashkeel

A **utility for lightweight Arabic diacritization** (tashkeel) — it puts the
missing vowel marks back into Arabic text. Not one model but a **model picker**: a
single tiny API over interchangeable diacritization models, all running on
`onnxruntime` — **no PyTorch, no API keys, offline by default.** Pick the model 
that fits your accuracy/speed/size budget; the only runtime
dependencies are `numpy` and `onnxruntime`.

```python
from text2tashkeel import Diacritizer
Diacritizer().diacritize("بسم الله الرحمن الرحيم")              # default model - 2.04% DER
Diacritizer("rawi-v2-int8").diacritize("بسم الله الرحمن الرحيم")  # lean single model
```

> **More than vowels.** Most diacritizers only add the short-vowel marks to text
> that is *already spelled correctly*. The default rawi models also **restore the
> hamza (ء) and the silent dagger-alef** — so they fix real, inconsistently-spelled
> input (e.g. a bare `ا` typed for `أ`), not just clean text. This is rare among
> diacritizers; [here's exactly why and how](docs/11-what-makes-rawi-different.md#111-a-wider-task-rawi-restores-hamza-and-the-dagger-alef-not-just-vowels).

## Install

```bash
pip install text2tashkeel
```

The wheel is small (~10 MB): it bundles our best models which work fully offline (no downloads, no torch). 
The full-precision (fp32) variants are fetched from Hugging Face **on first use** if you opt in:

```bash
pip install text2tashkeel        # int8 + flagship, offline
pip install text2tashkeel[hf]    # + auto-download fp32 models on demand
```

Without `[hf]`, asking for a non-bundled model raises a clear message with its
Hugging Face link. You can also point at **your own model** (e.g. one trained on a
different corpus) with `register_model(...)` — see below. For development:
`pip install -e ".[test]"` then `pytest`.

## Models

Two models cover almost every use; both ship in the wheel and run offline:

| Use case | Model | DER ↓ | latency | size |
|----------|-------|------:|--------:|-----:|
| **best accuracy (default)** ⭐ | `rawi-ensemble` | **2.04%** | ~2 ms | 4.9 MB |
| **fastest & smallest** | `rawi-v2-int8` | 2.30% | **~1 ms** | **2.5 MB** |

**22 model configurations** are available — the rawi family (V1/V2/V3 + INT8), two
independent diacritizers (`bilstm` and `libtashkeel`), and gated ensembles of them —
for comparison, research, or special cases:

```python
from text2tashkeel import available_models, Diacritizer
available_models()                 # all models
available_models(bundled_only=True)  # the models that ship in the wheel (offline)
Diacritizer("rawi-v2-int8").diacritize("بسم الله الرحمن الرحيم")
```

**Bundled vs fetched.** `available_models(bundled_only=True)` lists the models that
ship in the wheel. Everything else downloads from Hugging Face on first use with `[hf]` installed;
each model's weights live in its own repo
([`rawi`](https://huggingface.co/TigreGotico/rawi),
[`rawi-v2`](https://huggingface.co/TigreGotico/rawi-v2),
[`rawi-v3`](https://huggingface.co/TigreGotico/rawi-v3),
[`rawi-ensemble`](https://huggingface.co/TigreGotico/rawi-ensemble),
[`bilstm`](https://huggingface.co/TigreGotico/bilstm-diacritizer),
[`libtashkeel`](https://huggingface.co/TigreGotico/libtashkeel-diacritizer)), all
grouped in the [**Arabic Diacritizers** collection](https://huggingface.co/collections/TigreGotico/arabic-diacritizers-tashkeel-6a247318559bcc49e128aa5f).

**Bring your own model.** Trained a diacritizer on a different corpus? Point at it:

```python
from text2tashkeel import register_model, Diacritizer
register_model("my-rawi", "my_model.onnx", "my_vocab.json", arch="rawi")  # or arch="rawi-v3"
Diacritizer("my-rawi").diacritize("نص عربي")
```

`Diacritizer` is callable (`d("...")`) and lazily builds **one** onnxruntime
session it reuses — construct once, call many times. Full credits and licenses for
every model: [`docs/07-credits-and-license.md`](docs/07-credits-and-license.md).

## CLI

```bash
text2tashkeel "الحمد لله رب العالمين"          # flagship default
echo "محمد رسول الله" | text2tashkeel
text2tashkeel -m rawi-v2-int8 < input.txt > output.txt
```

## Benchmarks

Measured DER/WER for every model across the corpus's train/test/val splits is in [`benchmarks/`](benchmarks/README.md). 
