Metadata-Version: 2.4
Name: script-fidelity
Version: 0.1.1
Summary: Reference-free script fidelity metric for multilingual ASR.
Author: Anonymous
License-Expression: MIT
Keywords: asr,speech-recognition,evaluation,unicode,script-fidelity,fleurs
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: evaluate
Requires-Dist: evaluate<1.0,>=0.4.0; extra == "evaluate"
Provides-Extra: dev
Requires-Dist: evaluate<1.0,>=0.4.0; extra == "dev"
Requires-Dist: pytest>=8.0; extra == "dev"

# script-fidelity

`script-fidelity` is a small Python package for Script Fidelity Rate (SFR), a
reference-free metric for multilingual ASR. SFR measures the fraction of
countable hypothesis characters that belong to the expected Unicode script for a
target language.

Quick signals:

- Install with `uv add script-fidelity`
- Load with HF Evaluate via `themechanism/script_fidelity_rate`
- Supports 102 FLEURS language configs, excluding `all`
- PyPI: <https://pypi.org/project/script-fidelity/>

Use SFR with WER and CER. SFR checks script validity; WER and CER measure
transcription error against references.

## install

For package development in this repo:

```bash
uv sync --extra dev
```

For a downstream project:

```bash
uv add script-fidelity
```

Run the CLI without adding it to a project:

```bash
uvx --from script-fidelity sfr score --language ps_af --text "کابل کې ښه هوا ده"
```

## python use

```python
from script_fidelity import compute_sfr, compute_sfr_batch

score = compute_sfr("کابل کې ښه هوا ده", language="ps_af")
scores = compute_sfr_batch(
    ["کابل کې ښه هوا ده", "this is romanized output"],
    language="pashto",
)
```

Digits count by default, matching the paper. Treat digits as neutral with
`digit_policy="ignore"`.

```python
compute_sfr("کابل 2026", language="ps_af", digit_policy="ignore")
```

## HF Evaluate use

Local metric:

```python
import evaluate

sfr = evaluate.load("./metrics/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")
```

Hub metric after publishing:

```python
import evaluate

sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")
sfr.compute(predictions=["کابل کې ښه هوا ده"], language="ps_af")
```

## CLI

```bash
sfr score --language ps_af --text "کابل کې ښه هوا ده"
sfr audit predictions.jsonl --language ps_af --text-column prediction
sfr audit predictions.csv --language bn_in --text-column transcript --format csv
```

## ASR batch example

```python
from script_fidelity import compute_corpus_sfr

predictions = [
    item["text"]
    for item in whisper_outputs
]

summary = compute_corpus_sfr(predictions, language="bn_in")
print(summary["sfr_percent"])
print(summary["dominant_script_counts"])
```

## pandas dataframe example

```python
import pandas as pd
from script_fidelity import compute_sfr

df = pd.read_json("predictions.jsonl", lines=True)
df["sfr"] = df["prediction"].map(lambda text: compute_sfr(text, language="ps_af"))
```

## Transformers compute_metrics example

```python
import evaluate

wer = evaluate.load("wer")
sfr = evaluate.load("themechanism/script_fidelity_rate", module_type="metric")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    pred_text = processor.batch_decode(predictions, skip_special_tokens=True)
    label_text = processor.batch_decode(labels, skip_special_tokens=True)
    return {
        "wer": wer.compute(predictions=pred_text, references=label_text),
        "sfr": sfr.compute(predictions=pred_text, language="ps_af")["sfr"],
    }
```

## CI gate example

```python
from script_fidelity import compute_corpus_sfr

summary = compute_corpus_sfr(predictions, language="ml_in")
if summary["sfr"] < 0.90:
    raise SystemExit("SFR regression: Malayalam output is below 90% target script")
```

## shared-script caveats

SFR is a script check, not a language identifier. Pashto, Urdu, Persian, Arabic,
Central Kurdish, and Sindhi share Arabic-script Unicode blocks. Latin-script
languages mostly detect romanization or non-Latin substitution, not language
identity. Pair SFR with language ID or lexical checks when shared-script
confusions matter.

Use `dominant_script()` and `script_distribution()` to inspect failures:

```python
from script_fidelity import dominant_script, script_distribution

dominant_script("this is romanized output")
script_distribution("বাংলা भाषा")
```

## FLEURS codes

The registry covers the 102 FLEURS language configs listed by `sfr languages`.
These paper languages have short aliases:

| FLEURS code | Alias | Script |
|---|---|---|
| `ps_af` | `pashto` | Arabic |
| `ur_pk` | `urdu` | Arabic |
| `ar_eg` | `arabic` | Arabic |
| `fa_ir` | `persian`, `farsi` | Arabic |
| `hi_in` | `hindi` | Devanagari |
| `bn_in` | `bengali`, `bangla` | Bengali |
| `ml_in` | `malayalam` | Malayalam |
| `ta_in` | `tamil` | Tamil |
| `so_so` | `somali` | Latin |
| `ka_ge` | `georgian` | Georgian |

For the full reviewed registry, see
`script_fidelity/data/fleurs_registry.json`.

Full code table:

| Code | Language | Script |
|---|---|---|
| `af_za` | Afrikaans | Latin |
| `am_et` | Amharic | Ethiopic |
| `ar_eg` | Arabic | Arabic |
| `as_in` | Assamese | Bengali |
| `ast_es` | Asturian | Latin |
| `az_az` | Azerbaijani | Latin |
| `be_by` | Belarusian | Cyrillic |
| `bg_bg` | Bulgarian | Cyrillic |
| `bn_in` | Bengali | Bengali |
| `bs_ba` | Bosnian | Latin |
| `ca_es` | Catalan | Latin |
| `ceb_ph` | Cebuano | Latin |
| `ckb_iq` | Central Kurdish | Arabic |
| `cmn_hans_cn` | Mandarin Chinese | Han |
| `cs_cz` | Czech | Latin |
| `cy_gb` | Welsh | Latin |
| `da_dk` | Danish | Latin |
| `de_de` | German | Latin |
| `el_gr` | Greek | Greek |
| `en_us` | English | Latin |
| `es_419` | Spanish | Latin |
| `et_ee` | Estonian | Latin |
| `fa_ir` | Persian | Arabic |
| `ff_sn` | Fulah | Latin |
| `fi_fi` | Finnish | Latin |
| `fil_ph` | Filipino | Latin |
| `fr_fr` | French | Latin |
| `ga_ie` | Irish | Latin |
| `gl_es` | Galician | Latin |
| `gu_in` | Gujarati | Gujarati |
| `ha_ng` | Hausa | Latin |
| `he_il` | Hebrew | Hebrew |
| `hi_in` | Hindi | Devanagari |
| `hr_hr` | Croatian | Latin |
| `hu_hu` | Hungarian | Latin |
| `hy_am` | Armenian | Armenian |
| `id_id` | Indonesian | Latin |
| `ig_ng` | Igbo | Latin |
| `is_is` | Icelandic | Latin |
| `it_it` | Italian | Latin |
| `ja_jp` | Japanese | Han, Hiragana, Katakana |
| `jv_id` | Javanese | Latin |
| `ka_ge` | Georgian | Georgian |
| `kam_ke` | Kamba | Latin |
| `kea_cv` | Kabuverdianu | Latin |
| `kk_kz` | Kazakh | Cyrillic |
| `km_kh` | Khmer | Khmer |
| `kn_in` | Kannada | Kannada |
| `ko_kr` | Korean | Hangul |
| `ky_kg` | Kyrgyz | Cyrillic |
| `lb_lu` | Luxembourgish | Latin |
| `lg_ug` | Ganda | Latin |
| `ln_cd` | Lingala | Latin |
| `lo_la` | Lao | Lao |
| `lt_lt` | Lithuanian | Latin |
| `luo_ke` | Luo | Latin |
| `lv_lv` | Latvian | Latin |
| `mi_nz` | Maori | Latin |
| `mk_mk` | Macedonian | Cyrillic |
| `ml_in` | Malayalam | Malayalam |
| `mn_mn` | Mongolian | Cyrillic |
| `mr_in` | Marathi | Devanagari |
| `ms_my` | Malay | Latin |
| `mt_mt` | Maltese | Latin |
| `my_mm` | Burmese | Myanmar |
| `nb_no` | Norwegian Bokmal | Latin |
| `ne_np` | Nepali | Devanagari |
| `nl_nl` | Dutch | Latin |
| `nso_za` | Northern Sotho | Latin |
| `ny_mw` | Chichewa | Latin |
| `oc_fr` | Occitan | Latin |
| `om_et` | Oromo | Latin |
| `or_in` | Odia | Odia |
| `pa_in` | Punjabi | Gurmukhi |
| `pl_pl` | Polish | Latin |
| `ps_af` | Pashto | Arabic |
| `pt_br` | Portuguese | Latin |
| `ro_ro` | Romanian | Latin |
| `ru_ru` | Russian | Cyrillic |
| `sd_in` | Sindhi | Arabic |
| `sk_sk` | Slovak | Latin |
| `sl_si` | Slovenian | Latin |
| `sn_zw` | Shona | Latin |
| `so_so` | Somali | Latin |
| `sr_rs` | Serbian | Cyrillic |
| `sv_se` | Swedish | Latin |
| `sw_ke` | Swahili | Latin |
| `ta_in` | Tamil | Tamil |
| `te_in` | Telugu | Telugu |
| `tg_tj` | Tajik | Cyrillic |
| `th_th` | Thai | Thai |
| `tr_tr` | Turkish | Latin |
| `uk_ua` | Ukrainian | Cyrillic |
| `umb_ao` | Umbundu | Latin |
| `ur_pk` | Urdu | Arabic |
| `uz_uz` | Uzbek | Latin |
| `vi_vn` | Vietnamese | Latin |
| `wo_sn` | Wolof | Latin |
| `xh_za` | Xhosa | Latin |
| `yo_ng` | Yoruba | Latin |
| `yue_hant_hk` | Cantonese | Han |
| `zu_za` | Zulu | Latin |
