Metadata-Version: 2.4
Name: deeplatent-nlp
Version: 0.3.8
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: numpy>=1.21
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: pytest-benchmark>=4.0 ; extra == 'dev'
Requires-Dist: maturin>=1.4 ; extra == 'dev'
Provides-Extra: dev
Summary: High-performance Arabic-first tokenizer with morphology awareness
Keywords: tokenizer,arabic,nlp,morphology,bpe
Author-email: Suhail <contact@almaghrabima.com>
License: Proprietary
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Documentation, https://suhail.almaghrabima.com/docs
Project-URL: Homepage, https://github.com/almaghrabima/suhail-pkg
Project-URL: Repository, https://github.com/almaghrabima/suhail-pkg

# Suhail

High-performance Arabic tokenizer with morphology awareness. Built with Rust for speed, with Python bindings for ease of use.

## Features

- **Arabic-Optimized**: Designed specifically for Arabic and morphologically-rich languages
- **Fast**: Rust core with Python bindings (~30,000 operations/sec)
- **Accurate**: 100% roundtrip accuracy on 300,000+ test samples
- **Edge Case Handling**: Proper handling of diacritics (tashkeel), prefixes, suffixes, and special characters
- **Unicode Support**: Full support for Arabic diacritics, PUA characters, and mixed scripts
- **IP Protection**: AES-256-GCM encrypted morpheme maps (no license key required)

## Installation

```bash
pip install deeplatent-nlp
```

## Quick Start

### Using SarfCodec (Recommended)

The `SarfCodec` class provides encode/decode functionality using a morpheme map:

```python
from suhail import SarfCodec

# Create codec from morpheme map dictionary
morf_map = {
    'ال': '\uE000',      # definite article -> PUA
    'كتاب': '\uE001',    # kitab -> PUA
    'و': '\uE002',       # wa (and) -> PUA
    'ب': '\uE003',       # bi (with) -> PUA
}
codec = SarfCodec(morf_map)

# Encode Arabic text (morphemes -> PUA characters)
text = "الكتاب"
encoded = codec.encode(text)
print(f"Encoded: {repr(encoded)}")  # '\ue000\ue001'

# Decode back to Arabic (PUA -> morphemes)
decoded = codec.decode(encoded)
print(f"Decoded: {decoded}")  # 'الكتاب'

# Verify roundtrip
normalized, decoded, is_ok = codec.roundtrip(text)
print(f"Roundtrip OK: {is_ok}")  # True
```

### Loading from Encrypted File

Morpheme maps are distributed as encrypted `.enc` files for IP protection:

```python
from suhail import SarfCodec

# Load from encrypted file (no license key needed)
codec = SarfCodec.from_encrypted("morf_map.enc")

# Use as normal
encoded = codec.encode("بسم الله الرحمن الرحيم")
decoded = codec.decode(encoded)
```

### Creating Encrypted Morf Map Files

To create encrypted files from your JSON morf_map:

```python
from suhail import SarfCodec, encrypt_morf_map

# Option 1: Encrypt JSON file directly
encrypt_morf_map("morf_map.json", "morf_map.enc")

# Option 2: Encrypt from dict
morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}
codec = SarfCodec(morf_map)
codec.encrypt_to_file("morf_map.enc")
```

**Encryption details:**
- AES-256-GCM encryption
- Key embedded in compiled Rust binary
- Cannot be decrypted without deeplatent-nlp library
- Checksum verification for tamper detection

### Standalone Functions

For quick one-off operations without creating a codec:

```python
from suhail import encode, decode, normalize

morf_map = {'ال': '\uE000', 'كتاب': '\uE001'}

# Encode text
encoded = encode("الكتاب", morf_map)

# Decode text
decoded = decode(encoded, morf_map)

# Normalize Arabic text (without encoding)
normalized = normalize("الكِتَابُ", level="medium")
```

## Normalization Levels

The codec supports three normalization levels:

```python
from suhail import SarfCodec

# Light normalization (minimal changes)
codec = SarfCodec(morf_map, normalization="light")

# Medium normalization (default - recommended)
codec = SarfCodec(morf_map, normalization="medium")

# Aggressive normalization (maximum normalization)
codec = SarfCodec(morf_map, normalization="aggressive")
```

| Level | Alef Variants | Taa Marbuta | Diacritics | Tatweel |
|-------|--------------|-------------|------------|---------|
| light | Preserved | Preserved | Preserved | Removed |
| medium | Normalized | Preserved | Stripped | Removed |
| aggressive | Normalized | Normalized | Stripped | Removed |

## Handling Diacritics (Tashkeel)

The codec properly handles Arabic diacritics:

```python
from suhail import SarfCodec

codec = SarfCodec(morf_map)

# Text with full tashkeel
text = "بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ"
encoded = codec.encode(text)
decoded = codec.decode(encoded)

# Diacritics are handled correctly
print(decoded)  # Normalized form
```

## Utility Functions

```python
from suhail import is_arabic, is_pua, normalize, version

# Check if character is Arabic
is_arabic('ب')  # True
is_arabic('a')  # False

# Check if character is in Private Use Area
is_pua('\uE000')  # True
is_pua('ب')       # False

# Normalize Arabic text
normalize("الكِتَابُ")  # 'الكتاب'

# Get version
version()  # '0.1.0'
```

## Codec Statistics

```python
codec = SarfCodec(morf_map)

# Get number of morphemes
print(codec.num_morphemes)  # 114

# Get detailed statistics
stats = codec.stats()
print(stats)
# {'total_morphemes': 114, 'basic_pua_codes': 114, 'supplementary_pua_codes': 0}
```

## Performance

Tested on 300,000 samples with 100% accuracy:

| Test | Samples | Success Rate | Speed |
|------|---------|--------------|-------|
| Random Arabic/English | 100,000 | 100% | ~30,000/sec |
| Diacritized Arabic (tashkeel) | 100,000 | 100% | ~5,000/sec |
| Plain Arabic | 100,000 | 100% | ~6,000/sec |

## Edge Cases Handled

| Case | Example | Handling |
|------|---------|----------|
| Diacritics | بِسْمِ | Properly normalized |
| Arabic-Indic digits | ٠١٢٣٤٥ | Preserved |
| Alef variants | أ إ آ ا | Normalized to ا |
| Taa marbuta | ة | Optionally normalized |
| Tatweel (kashida) | كـتـاب | Removed |
| French guillemets | « » | Preserved |
| Mixed Arabic/English | Hello مرحبا | Both handled |
| URLs and emails | email@test.com | Preserved |

## Building from Source

```bash
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone and build
git clone https://github.com/almaghrabima/suhail-pkg
cd suhail-pkg
pip install maturin
maturin develop --release

# Run tests
python test_comprehensive.py
python test_large_scale.py
```

## Requirements

- Python 3.9+
- Rust 1.70+ (for building from source)

## License

Proprietary. Contact for licensing options.

## Support

- GitHub: https://github.com/almaghrabima/suhail-pkg
- Email: contact@almaghrabima.com

