Metadata-Version: 2.4
Name: deeplatent-nlp
Version: 0.1.1
Requires-Dist: datasets>=4.0.0
Requires-Dist: fasttext-wheel>=0.9.2
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: hydra-core>=1.3.2
Requires-Dist: hydra-optuna-sweeper>=1.2.0
Requires-Dist: hydra-submitit-launcher>=1.2.0
Requires-Dist: imageio>=2.37.2
Requires-Dist: ipython>=8.37.0
Requires-Dist: matplotlib>=3.10.7
Requires-Dist: morfessor
Requires-Dist: numpy>=2.2.6
Requires-Dist: omegaconf>=2.3.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: psutil>=7.1.0
Requires-Dist: pyarrow>=15.0.0
Requires-Dist: pytest>=9.0.1
Requires-Dist: ray[tune]>=2.0.0
Requires-Dist: regex>=2025.9.1
Requires-Dist: scikit-learn>=1.7.2
Requires-Dist: seaborn>=0.13.2
Requires-Dist: setuptools>=80.9.0
Requires-Dist: tiktoken>=0.11.0
Requires-Dist: tokenizers>=0.22.0
Requires-Dist: torch>=2.8.0
Requires-Dist: torchnet>=0.0.4
Requires-Dist: torchvision>=0.24.0
Requires-Dist: transformers>=4.0.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: umap-learn>=0.5.9.post2
Requires-Dist: wandb>=0.21.3
Requires-Dist: torch>=2.8.0 ; extra == 'cpu'
Requires-Dist: torch>=2.8.0 ; extra == 'gpu'
Requires-Dist: sentencepiece>=0.2.0 ; extra == 'tokenizer-extras'
Requires-Dist: camel-tools>=1.5.0 ; extra == 'tokenizer-extras'
Provides-Extra: cpu
Provides-Extra: gpu
Provides-Extra: tokenizer-extras
Summary: the new deep latent model
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# DeepLatent

DeepLatent - SARF Tokenizer for Arabic/English bilingual text with native Rust core.

This package provides the SARF (Sarf-Aware Representation Framework) tokenizer that achieves excellent Arabic/English parity (1.09) by applying morpheme-level preprocessing before BPE tokenization.

## Installation

```bash
pip install deeplatent-nlp
```

Or with uv:

```bash
uv add deeplatent-nlp
```

### Building from Source

If installing from source, you'll need Rust installed:

```bash
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install from source
pip install .
```

## Quick Start

```python
from deeplatent import SARFTokenizer

# Load tokenizer from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Encode text (SARF preprocessing is applied automatically for Arabic)
arabic_text = "مرحبا بكم في هذا الاختبار"
tokens = tokenizer.encode(arabic_text)
print(f"Token count: {len(tokens)}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# Works with English too
english_text = "Hello world, this is a test"
tokens = tokenizer.encode(english_text)
print(f"English token count: {len(tokens)}")
```

## Performance

| Metric | With SARF Preprocessing | Without Preprocessing |
|--------|------------------------|----------------------|
| Arabic Fertility | 2.29 | 5.65 |
| English Fertility | 2.10 | 2.91 |
| Parity (Ar/En) | 1.09 | 1.94 |
| Interpretation | EXCELLENT | Moderate |

Fertility = average tokens per word. Lower is better. Parity closer to 1.0 means more equal treatment between languages.

## Supported Platforms

Pre-built wheels are available for:

- Linux (manylinux2014, x86_64)
- macOS (x86_64, arm64)
- Windows (x86_64)

For other platforms, the package will build from source (requires Rust).

## What is SARF?

SARF (صَرْف) is the Arabic term for morphology. In Arabic linguistics, ṣarf refers to the system that governs:

- Word formation
- Roots and patterns (جذر / وزن)
- Prefixes, suffixes, infixes
- Tense, gender, number, and derivation

Most tokenizers treat Arabic as bytes or characters. SARF treats Arabic as a language.

## API Reference

### SARFTokenizer

```python
from deeplatent import SARFTokenizer

# Load from HuggingFace
tokenizer = SARFTokenizer.from_pretrained("almaghrabima/deeplatent-tokenizer")

# Load from local directory
tokenizer = SARFTokenizer.from_directory("./my_tokenizer")

# Disable preprocessing (not recommended for Arabic)
tokenizer = SARFTokenizer.from_pretrained(
    "almaghrabima/deeplatent-tokenizer",
    use_preprocessing=False
)
```

### Encoding

```python
# Simple encoding
tokens = tokenizer.encode("مرحبا بكم")

# With options
result = tokenizer.encode(
    "مرحبا بكم",
    add_special_tokens=True,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="pt"  # or "tf" for TensorFlow
)

# Batch encoding
texts = ["مرحبا", "Hello", "مرحبا بكم في العالم"]
batch_tokens = tokenizer.encode_batch(texts)
```

### Decoding

```python
# Simple decoding
text = tokenizer.decode([1234, 5678, 9012])

# Batch decoding
texts = tokenizer.decode_batch([[1234, 5678], [9012, 3456]])

# Keep special tokens
text = tokenizer.decode(tokens, skip_special_tokens=False)
```

## License

This tokenizer is released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

For commercial licensing, please contact: almaghrabima@gmail.com

## Author

Mohammed Almaghrabi
Email: almaghrabima@gmail.com

## Links

- [HuggingFace Model](https://huggingface.co/almaghrabima/deeplatent-tokenizer)
- [Evaluation Dataset](https://huggingface.co/datasets/almaghrabima/deeplatent-eval)

