Metadata-Version: 2.4
Name: gherbal
Version: 1.0.1
Summary: FastText-based multilingual language identification with HuggingFace integration
Project-URL: Homepage, https://github.com/omneity-labs/gherbal
Project-URL: Repository, https://github.com/omneity-labs/gherbal
Project-URL: Issues, https://github.com/omneity-labs/gherbal/issues
Project-URL: Changelog, https://github.com/omneity-labs/gherbal/blob/main/CHANGELOG.md
Author-email: Omar Kamali <omar@omneitylabs.com>
License-Expression: MIT
Keywords: fasttext,language-identification,lid,multilingual,nlp
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Requires-Dist: fasttext>=0.9.2
Requires-Dist: huggingface-hub>=0.14.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: tqdm>=4.60.0
Requires-Dist: transformers>=4.20.0
Requires-Dist: unscript>=0.1.0
Provides-Extra: all
Requires-Dist: datasets>=2.0.0; extra == 'all'
Requires-Dist: matplotlib>=3.5.0; extra == 'all'
Requires-Dist: seaborn>=0.12.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: datasets>=2.0.0; extra == 'dev'
Requires-Dist: matplotlib>=3.5.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: seaborn>=0.12.0; extra == 'dev'
Provides-Extra: eval
Requires-Dist: matplotlib>=3.5.0; extra == 'eval'
Requires-Dist: seaborn>=0.12.0; extra == 'eval'
Provides-Extra: train
Requires-Dist: datasets>=2.0.0; extra == 'train'
Description-Content-Type: text/markdown

# Gherbal

FastText-based multilingual language identification with HuggingFace Hub integration.

Supports 200+ languages including fine-grained Arabic dialect detection.

## Installation

```bash
pip install gherbal
```

## Quick Start

```python
from gherbal import Gherbal

# Load from HuggingFace Hub
model = Gherbal.from_pretrained("omarkamali/gherbal")

# Predict language
model.predict("Hello, how are you?")
# => [('eng_Latn', 0.99)]

model.predict("مرحبا كيف حالك")
# => [('arb_Arab', 0.95)]
```

## Loading a Local Model

```python
model = Gherbal.from_pretrained("./path/to/model")
```

## Training

```python
import pandas as pd
from gherbal import Gherbal

df = pd.DataFrame({"text": [...], "label": [...]})
model = Gherbal.train(df, save_path="./my_model")
```

## Preprocessing

```python
from gherbal import preprocess_text, create_clean_script_function

text = preprocess_text("Hello @user https://example.com 🎉")
# => "hello"

# Script-aware cleaning
clean = create_clean_script_function()
clean("Latn", "Hello World 123")
# => "Hello World"
```

## Pushing to HuggingFace Hub

```python
model.push_to_hub("username/my-gherbal-model")
```

## License

MIT
