Metadata-Version: 2.4
Name: wikilangs
Version: 0.1.0
Summary: A Python package for consuming Wikipedia language models including tokenizers, ngram models, markov chains, and vocabularies.
Author-email: Omar Kamali <wikilangs@omarkama.li>
License: MIT
Project-URL: Homepage, https://github.com/omarkamali/wikilangs
Project-URL: Repository, https://github.com/omarkamali/wikilangs.git
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# Wikilangs

[![PyPI version](https://badge.fury.io/py/wikilangs.svg)](https://badge.fury.io/py/wikilangs)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/release/python-380/)

A Python package for consuming Wikipedia language models including tokenizers, n-gram models, Markov chains, and vocabularies.

## Features

- **BPE Tokenizers**: Pre-trained tokenizers for 100+ languages
- **N-gram Models**: Language models for text scoring and next token prediction
- **Markov Chains**: Text generation models with configurable depth
- **Vocabularies**: Comprehensive word dictionaries with frequency information
- **Multi-language Support**: Models available for 100+ Wikipedia languages
- **Easy API**: Simple, intuitive interface for loading and using models

## Installation

```bash
pip install wikilangs
```

## Quick Start

```python
from wikilangs import tokenizer, ngram, markov, vocabulary

# Create a tokenizer
tok = tokenizer(date='20251201', lang='en', vocab_size=16000)

# Tokenize text
tokens = tok.tokenize("Hello, world!")
token_ids = tok.encode("Hello, world!")
print(tokens)  # ['Hello', ',', '▁world', '!']
print(token_ids)  # [1234, 5, 5678, 9]

# Create an n-gram model
ng = ngram(date='20251201', lang='en', gram_size=3)

# Score text
score = ng.score("This is a sample sentence.")
print(score)  # -12.345

# Predict next token
predictions = ng.predict_next("This is a", top_k=5)
print(predictions)  # [('sample', 0.85), ('test', 0.05), ...]

# Create a Markov chain
mc = markov(date='20251201', lang='en', depth=2)

# Generate text
text = mc.generate(length=50)
print(text)  # "Generated text using the Markov chain model..."

# Create a vocabulary
vocab = vocabulary(date='20251201', lang='en')

# Look up a word
word_info = vocab.lookup("example")
print(word_info)  # {'frequency': 12345, 'definition': '...'}

# Get word frequency
freq = vocab.get_frequency("example")
print(freq)  # 12345
```

## API Reference

### tokenizer(date, lang, vocab_size=16000, local_dir=None)

Create a BPE tokenizer instance.

**Parameters:**
- `date` (str): Date of the model (format: YYYYMMDD)
- `lang` (str): Language code (e.g., 'en', 'fr', 'ary')
- `vocab_size` (int): Vocabulary size (8000, 16000, 32000, 64000)
- `local_dir` (str, optional): Local directory to look for models first

**Returns:**
- `BPETokenizer`: Initialized tokenizer instance

### ngram(date, lang, gram_size=3, local_dir=None)

Create an n-gram model instance.

**Parameters:**
- `date` (str): Date of the model (format: YYYYMMDD)
- `lang` (str): Language code (e.g., 'en', 'fr', 'ary')
- `gram_size` (int): Size of n-grams (2, 3, 4, 5)
- `local_dir` (str, optional): Local directory to look for models first

**Returns:**
- `NGramModel`: Initialized n-gram model instance

### markov(date, lang, depth=2, local_dir=None)

Create a Markov chain model instance.

**Parameters:**
- `date` (str): Date of the model (format: YYYYMMDD)
- `lang` (str): Language code (e.g., 'en', 'fr', 'ary')
- `depth` (int): Depth of the Markov chain (1, 2, 3, 4, 5)
- `local_dir` (str, optional): Local directory to look for models first

**Returns:**
- `MarkovChain`: Initialized Markov chain model instance

### vocabulary(date, lang, local_dir=None)

Create a vocabulary instance.

**Parameters:**
- `date` (str): Date of the model (format: YYYYMMDD)
- `lang` (str): Language code (e.g., 'en', 'fr', 'ary')
- `local_dir` (str, optional): Local directory to look for models first

**Returns:**
- `WikilangsVocabulary`: Initialized vocabulary instance

## Available Languages

Models are available for 100+ Wikipedia languages including:

- English (`en`)
- French (`fr`)
- Spanish (`es`)
- German (`de`)
- Arabic (`ar`)
- Chinese (`zh`)
- Japanese (`ja`)
- Korean (`ko`)
- And many more...

## Available Dates

Models are updated regularly. Check the [Hugging Face dataset](https://huggingface.co/datasets/wikilangs) for the latest available dates.

## Examples

Check out the [examples](examples/) directory for Jupyter notebooks demonstrating various use cases.

## Development

### Install dependencies

```bash
pip install -r requirements.txt
```

### Run tests

```bash
pytest tests/
```

### Build package

```bash
python -m build
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Acknowledgments

- Models trained on Wikipedia data
- Uses the [vocabulous](https://github.com/omarkamali/vocabulous) library for vocabulary functionality
- Hosted on Hugging Face Datasets
