Metadata-Version: 2.4
Name: EthioBBPE
Version: 1.0.2
Summary: Advanced Byte Pair Encoding Tokenizer for Ethiopian Languages (Amharic, Tigrinya, Ge'ez)
Author-email: Nexus Research <nexuss0781@gmail.com>
Maintainer-email: Nexus Research <nexuss0781@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/nexuss0781/Ethio_BBPE
Project-URL: Documentation, https://github.com/nexuss0781/Ethio_BBPE#readme
Project-URL: Repository, https://github.com/nexuss0781/Ethio_BBPE.git
Project-URL: Issues, https://github.com/nexuss0781/Ethio_BBPE/issues
Project-URL: HuggingFace, https://huggingface.co/nexuss0781/Ethio-BBPE
Keywords: tokenizer,amharic,tigrinya,geez,ethiopian,nlp,bpe,byte-pair-encoding,huggingface,african-languages
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Description-Content-Type: text/markdown
Requires-Dist: tokenizers>=0.15.0
Requires-Dist: huggingface-hub>=0.20.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: training
Requires-Dist: datasets>=2.14.0; extra == "training"
Requires-Dist: pandas>=2.0.0; extra == "training"

# EthioBBPE

**Advanced Byte Pair Encoding Tokenizer for Ethiopian Languages**

[![PyPI version](https://badge.fury.io/py/EthioBBPE.svg)](https://pypi.org/project/EthioBBPE/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Hugging Face](https://img.shields.io/badge/🤗-HuggingFace-yellow.svg)](https://huggingface.co/nexuss0781/Ethio-BBPE)

EthioBBPE is a production-ready, high-performance tokenizer optimized for **Amharic**, **Tigrinya**, and **Ge'ez** scripts. Built with advanced features including checkpointing, multi-format compression, and model quantization, it delivers efficient text processing for Ethiopian languages.

## ✨ Features

### 🚀 Production-Ready
- **Automatic Model Download**: Seamlessly downloads pretrained models from Hugging Face Hub on first use
- **Embedded Models**: Includes pretrained weights in the package for offline usage
- **Zero Configuration**: Works out of the box with sensible defaults

### 🔧 Advanced Capabilities
- **Checkpointing**: SHA256 integrity verification for fault-tolerant training
- **Multi-Format Compression**: Support for gzip, bz2, and lzma/xz (up to 90% size reduction)
- **Model Quantization**: 8-bit and 4-bit quantization for efficient deployment
- **Batch Processing**: Efficient encoding/decoding of text batches

### 📊 Performance
- **Vocabulary Size**: 16,000 tokens optimized for Ethiopic scripts
- **Compression Ratio**: ~90% size reduction with gzip level 9
- **Perfect Reconstruction**: 100% accuracy on Amharic biblical texts
- **Fast Inference**: Optimized for production workloads

## 📦 Installation

### Basic Installation
```bash
pip install EthioBBPE
```

### With Training Dependencies
```bash
pip install EthioBBPE[training]
```

### Development Installation
```bash
pip install EthioBBPE[dev]
```

## 🎯 Quick Start

### Simple Usage
```python
from ethiobbpe import EthioBBPE

# Initialize tokenizer (auto-downloads model on first use)
tokenizer = EthioBBPE()

# Encode text
text = "ሰላም ለኢዮብ ዘኢነበበ"
encoded = tokenizer.encode(text)
print(encoded['ids'])      # Token IDs
print(encoded['tokens'])   # Token strings

# Decode back to text
decoded = tokenizer.decode(encoded['ids'])
print(decoded)  # Perfect reconstruction: "ሰላም ለኢዮብ ዘኢነበበ"
```

### Batch Processing
```python
texts = [
    "በመዠመሪያ፡እግዚአብሔር፡ሰማይንና፡ምድርን፡ፈጠረ።",
    "ሰላም ለኢዮብ ዘኢነበበ ከንቶ ።"
]

# Encode batch
encoded_batch = tokenizer.encode_batch(texts)

# Decode batch
decoded_batch = tokenizer.decode_batch([e['ids'] for e in encoded_batch])
```

### Load from Hugging Face Hub
```python
from ethiobbpe import EthioBBPE

# Load specific model from Hugging Face
tokenizer = EthioBBPE.from_pretrained("nexuss0781/Ethio-BBPE")
```

## 🔬 Advanced Usage

### Custom Configuration
```python
from ethiobbpe import Config, EthioBBPE

# Create custom configuration
config = Config(
    model_name="MyTokenizer",
    vocab_size=32000,
    compression_format="bz2",
    compression_level=9,
    enable_quantization=True,
    quantization_bits=8
)

# Use configuration
tokenizer = EthioBBPE(model_name=config.model_name)
```

### Utility Functions
```python
from ethiobbpe import (
    load_compressed_vocab,
    validate_checkpoint,
    list_checkpoints,
    get_model_info
)

# Load compressed vocabulary
vocab = load_compressed_vocab("vocab.json.gz")

# Validate checkpoint
is_valid = validate_checkpoint("checkpoint_ckpt_1.json")

# List available checkpoints
checkpoints = list_checkpoints("./checkpoints")

# Get model information
info = get_model_info("./models/EthioBBPE_AmharicBible")
print(f"Total size: {info['total_size_mb']} MB")
```

### Training Your Own Tokenizer
```python
from ethiobbpe.trainer import BBPETrainer

# Initialize trainer with advanced features
trainer = BBPETrainer(
    vocab_size=16000,
    min_frequency=2,
    compression_format="gzip",
    compression_level=9,
    enable_quantization=True,
    quantization_bits=8,
    max_checkpoints=5
)

# Train on your data
texts = ["Your Amharic text 1", "Your Amharic text 2", ...]
metrics = trainer.train(
    texts=texts,
    output_dir="./my_tokenizer",
    model_name="MyAmharicTokenizer"
)

print(f"Training completed in {metrics['training_duration_seconds']:.2f}s")
print(f"Final vocab size: {metrics['final_vocab_size']}")
print(f"Compression ratio: {metrics['compression_ratio']:.2%}")
```

## 📈 Model Details

### Training Data
- **Synaxarium Dataset**: 366 Ethiopian Orthodox Church texts
- **Canon Biblical Dataset**: 61,403 Amharic-English parallel biblical texts
- **Total Corpus**: 27.5 MB, 61,769 lines

### Performance Metrics
| Metric | Value |
|--------|-------|
| Vocabulary Size | 16,000 tokens |
| Training Time | ~17 seconds |
| Original Size | 1.34 MB |
| Compressed Size | 136 KB (gzip) |
| Compression Ratio | 89.8% |
| Reconstruction Accuracy | 100% |

### Supported Scripts
- ✅ Amharic (አማርኛ)
- ✅ Tigrinya (ትግርኛ)
- ✅ Ge'ez (ግዕዝ)
- ✅ Mixed-language texts

## 🛠️ API Reference

### EthioBBPE Class

#### `__init__(model_name: str = "EthioBBPE_AmharicBible", model_dir: Optional[str] = None)`
Initialize the tokenizer with optional custom model name or directory.

#### `encode(text: str, add_special_tokens: bool = True, ...) -> Dict[str, Any]`
Encode text into token IDs, tokens, and offsets.

#### `decode(ids: List[int], skip_special_tokens: bool = True) -> str`
Decode token IDs back to text.

#### `encode_batch(texts: List[str], ...) -> List[Dict[str, Any]]`
Encode multiple texts efficiently.

#### `decode_batch(batch_ids: List[List[int]], ...) -> List[str]`
Decode multiple sequences efficiently.

#### `get_vocab_size() -> int`
Get the vocabulary size.

#### `get_vocab() -> Dict[str, int]`
Get the full vocabulary mapping.

#### `save(path: str) -> None`
Save the tokenizer to a file.

#### `from_pretrained(model_name: str = "nexuss0781/Ethio-BBPE") -> EthioBBPE`
Class method to load a pretrained tokenizer from Hugging Face Hub.

## 📁 Package Structure

```
ethiobbpe/
├── __init__.py          # Main exports
├── tokenizer.py         # Core EthioBBPE class
├── config.py            # Configuration management
├── utils.py             # Utility functions
├── trainer.py           # Advanced BBPE trainer
└── models/              # Pretrained models
    ├── tokenizer.json
    ├── vocab.json.gz
    ├── config.json
    └── training_metrics.json
```

## 🤝 Integration

### Hugging Face Transformers
```python
from transformers import AutoTokenizer

# Use with Hugging Face Transformers
tokenizer = AutoTokenizer.from_pretrained("nexuss0781/Ethio-BBPE")
```

### LangChain
```python
from langchain.text_splitter import CharacterTextSplitter

# Use in LangChain pipelines
text_splitter = CharacterTextSplitter(
    separator="",
    chunk_size=512,
    chunk_overlap=50
)
```

## 📄 License

Apache License 2.0 - See [LICENSE](LICENSE) for details.

## 🙏 Acknowledgments

- Training data from [Synaxarium Dataset](https://huggingface.co/datasets/Nexuss0781/synaxarium)
- Biblical texts from [Canon Biblical Dataset](https://huggingface.co/datasets/Nexuss0781/conon-biblical-am-en)
- Built with [Hugging Face Tokenizers](https://github.com/huggingface/tokenizers)

## 📬 Contact

- **Author**: Nexus Research
- **Email**: nexuss0781@gmail.com
- **GitHub**: [nexuss0781/Ethio_BBPE](https://github.com/nexuss0781/Ethio_BBPE)
- **Hugging Face**: [nexuss0781/Ethio-BBPE](https://huggingface.co/nexuss0781/Ethio-BBPE)

## 🗺️ Roadmap

- [ ] Support for additional Ethiopian languages (Oromo, Somali, Sidama)
- [ ] Pre-trained language models using EthioBBPE
- [ ] WebAssembly build for browser-based inference
- [ ] ONNX export for optimized deployment
- [ ] Streaming tokenization for large documents

---

Made with ❤️ for Ethiopian NLP community
