Metadata-Version: 2.4
Name: deeplatent-nlp
Version: 0.2.4
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: transformers>=4.0.0
Requires-Dist: huggingface-hub>=0.14.0
Requires-Dist: pytest>=7.0.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0 ; extra == 'dev'
Requires-Dist: black>=23.0.0 ; extra == 'dev'
Requires-Dist: isort>=5.0.0 ; extra == 'dev'
Requires-Dist: mypy>=1.0.0 ; extra == 'dev'
Requires-Dist: sphinx>=6.0.0 ; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=1.0.0 ; extra == 'docs'
Provides-Extra: dev
Provides-Extra: docs
Summary: DeepLatent - Morphology-aware tokenizer for Arabic/English bilingual text with native core
Keywords: tokenizer,arabic,nlp,morphology,sarf,deeplatent,bpe,transformers,huggingface,bilingual
Author-email: Mohammed Almaghrabi <almaghrabima@gmail.com>
Maintainer-email: Mohammed Almaghrabi <almaghrabima@gmail.com>
License: CC-BY-NC-4.0
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/almaghrabima/deeplatent
Project-URL: Documentation, https://huggingface.co/almaghrabima/deeplatent-tokenizer
Project-URL: Repository, https://github.com/almaghrabima/deeplatent
Project-URL: Bug Tracker, https://github.com/almaghrabima/deeplatent/issues
Project-URL: HuggingFace, https://huggingface.co/almaghrabima/deeplatent-tokenizer

# DeepLatent

**DeepLatent** - SARF Tokenizer for Arabic/English bilingual text with native Rust core.

This package provides the SARF (Sarf-Aware Representation Framework) tokenizer that achieves excellent Arabic/English parity (1.09) by applying morpheme-level preprocessing before BPE tokenization.

## Installation

```bash
pip install deeplatent-nlp
```

### Building from Source

If installing from source, you'll need Rust installed:

```bash
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install from source
pip install .
```

## Quick Start

```python
from deeplatent import SARFTokenizer

# Load tokenizer from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Encode text (SARF preprocessing is applied automatically for Arabic)
arabic_text = "مرحبا بكم في هذا الاختبار"
tokens = tokenizer.encode(arabic_text)
print(f"Token count: {len(tokens)}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# Works with English too
english_text = "Hello world, this is a test"
tokens = tokenizer.encode(english_text)
print(f"English token count: {len(tokens)}")
```

## Performance

| Metric | With SARF Preprocessing | Without Preprocessing |
|--------|------------------------|----------------------|
| Arabic Fertility | 2.29 | 5.65 |
| English Fertility | 2.10 | 2.91 |
| Parity (Ar/En) | **1.09** | 1.94 |
| Interpretation | **EXCELLENT** | Moderate |

*Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.*

### Supported Platforms

Pre-built wheels are available for:
- Linux (manylinux2014, x86_64)
- macOS (x86_64, arm64)
- Windows (x86_64)

For other platforms, the package will build from source (requires Rust).

## What is SARF?

**SARF (صَرْف)** is the Arabic term for **morphology**. In Arabic linguistics, *ṣarf* refers to the system that governs:

- Word formation
- Roots and patterns (جذر / وزن)
- Prefixes, suffixes, infixes
- Tense, gender, number, and derivation

Most tokenizers treat Arabic as bytes or characters. **SARF treats Arabic as a language.**

## API Reference

### SARFTokenizer

```python
from deeplatent import SARFTokenizer

# Load from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Load from local directory
tokenizer = SARFTokenizer.from_directory("./my_tokenizer")

# Disable preprocessing (not recommended for Arabic)
tokenizer = SARFTokenizer.from_pretrained(
    "almaghrabima/deeplatent-tokenizer",
    use_preprocessing=False
)
```

### Encoding

```python
# Simple encoding
tokens = tokenizer.encode("مرحبا بكم")

# With options
result = tokenizer.encode(
    "مرحبا بكم",
    add_special_tokens=True,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"  # or "tf" for TensorFlow
)

# Batch encoding
texts = ["مرحبا", "Hello", "مرحبا بكم في العالم"]
batch_tokens = tokenizer.encode_batch(texts)
```

### Decoding

```python
# Simple decoding
text = tokenizer.decode([1234, 5678, 9012])

# Batch decoding
texts = tokenizer.decode_batch([[1234, 5678], [9012, 3456]])

# Keep special tokens
text = tokenizer.decode(tokens, skip_special_tokens=False)
```

## License

This tokenizer is released under **CC-BY-NC-4.0** (Creative Commons Attribution-NonCommercial 4.0 International).

For commercial licensing, please contact: almaghrabima@gmail.com

## Author

- **Mohammed Almaghrabi**
- Email: almaghrabima@gmail.com

## Links

- [HuggingFace Model](https://huggingface.co/almaghrabima/deeplatent-tokenizer)
- [Evaluation Dataset](https://huggingface.co/datasets/almaghrabima/eval-test-data)

