Metadata-Version: 2.4
Name: kabyle-corpus-toolkit
Version: 2.0.0
Summary: Tools for downloading, processing, and normalizing Kabyle and Occitan language corpora from Tatoeba
Author-email: Athmane MOKRAOUI <butterflyoffire+pypi@protonmail.com>
License: MIT
Project-URL: Homepage, https://codeberg.org/butterflyoffire/kabyle-corpus-toolkit
Project-URL: Documentation, https://codeberg.org/butterflyoffire/kabyle-corpus-toolkit#readme
Project-URL: Repository, https://codeberg.org/butterflyoffire/kabyle-corpus-toolkit
Project-URL: Issues, https://codeberg.org/butterflyoffire/kabyle-corpus-toolkit/issues
Project-URL: Changelog, https://codeberg.org/butterflyoffire/kabyle-corpus-toolkit/blob/main/CHANGELOG.md
Keywords: kabyle,occitan,corpus,nlp,tatoeba,parallel-corpus,language-processing,berber,tamazight
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.25.0
Provides-Extra: validation
Requires-Dist: fasttext>=0.9.2; extra == "validation"
Requires-Dist: huggingface-hub>=0.16.0; extra == "validation"
Provides-Extra: interactive
Requires-Dist: yaspin>=2.0.0; extra == "interactive"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: yaspin>=2.0.0; extra == "dev"
Provides-Extra: all
Requires-Dist: fasttext>=0.9.2; extra == "all"
Requires-Dist: huggingface-hub>=0.16.0; extra == "all"
Requires-Dist: yaspin>=2.0.0; extra == "all"
Requires-Dist: pytest>=7.0.0; extra == "all"

# Kabyle Corpus Toolkit

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Tools for downloading, processing, and normalizing **Kabyle** (kab) and **Occitan** (oci) language corpora from Tatoeba and other sources.

## Features

- **Download Tatoeba Data**: Automated download of sentences and links from Tatoeba.org
- **Parallel Corpus Creation**: Build aligned English-Kabyle and English-Occitan sentence pairs
- **French Chain Translation**: Expand coverage by routing Kabyle→French→English translations
- **Character Normalization**: Fix encoding issues and normalize extended Latin characters
- **Language Validation**: Validate corpus quality using GlotLID FastText models
- **Stopword Generation**: Generate language-specific stopword lists from corpus statistics

## Installation

### Basic Installation

```bash
pip install kabyle-corpus-toolkit
