Metadata-Version: 2.4
Name: urdu-nlp
Version: 0.1.0
Summary: A lightweight, pure-Python NLP library for Urdu language processing
Author: urdu-nlp contributors
Author-email: urdu-nlp contributors <imabd645@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/imabd645/urdu-nlp
Project-URL: Repository, https://github.com/imabd645/urdu-nlp
Project-URL: Issues, https://github.com/imabd645/urdu-nlp/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Natural Language :: Urdu
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: regex>=2023.0
Dynamic: author
Dynamic: license-file
Dynamic: requires-python

# urdu-nlp

A lightweight, pure-Python NLP library for **Urdu** language processing. No deep learning required — just install and go.

Urdu is spoken by 230+ million people, yet has almost no usable NLP tooling on PyPI. **urdu-nlp** fills that gap with tokenization, stop word removal, normalization, stemming, transliteration, and sentence boundary detection.

## Installation

```bash
pip install urdu-nlp
```

Or install from source:

```bash
git clone https://github.com/imabd645/urdu-nlp.git
cd urdu-nlp
pip install -e .
```

## Quick Start

### 1. Tokenization

```python
from urdu_nlp import tokenize

tokenize("میں اسکول جاتا ہوں")
# → ["میں", "اسکول", "جاتا", "ہوں"]
```

### 2. Stop Word Removal

```python
from urdu_nlp import remove_stopwords

remove_stopwords(["میں", "اسکول", "جاتا", "ہوں"])
# → ["اسکول", "جاتا"]
```

### 3. Normalization

```python
from urdu_nlp import normalize

normalize("ﻛﺮﻧﺎ")   # Arabic form
# → "کرنا"          # Urdu form
```

### 4. Stemming

```python
from urdu_nlp import stem

stem("کتابوں")
# → "کتاب"
```

### 5. Roman Urdu → Urdu Script

```python
from urdu_nlp import roman_to_urdu

roman_to_urdu("mein school jata hoon")
# → "میں اسکول جاتا ہوں"
```

### 6. Sentence Boundary Detection

```python
from urdu_nlp import sent_tokenize

sent_tokenize("یہ پہلا جملہ ہے۔ یہ دوسرا ہے۔")
# → ["یہ پہلا جملہ ہے۔", "یہ دوسرا ہے۔"]
```

## API Reference

| Function | Description |
|---|---|
| `tokenize(text)` | Split Urdu text into word tokens |
| `sent_tokenize(text)` | Split text into sentences |
| `remove_stopwords(tokens)` | Remove common Urdu stop words from a token list |
| `normalize(text)` | Normalize Arabic/Urdu character variants and whitespace |
| `stem(word)` | Strip common Urdu suffixes to get root form |
| `roman_to_urdu(text)` | Transliterate Roman Urdu to Urdu script |

## Dependencies

- `regex` — Unicode-aware pattern matching (the only dependency)

## License

MIT
