Metadata-Version: 2.4
Name: npltk
Version: 0.3.2
Summary: Nepali Language Processing Toolkit
Author: Anurag Sharma, Anita Budha Magar, Apeksha Parajuli, Apeksha Katwal
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: sentencepiece<0.2.0,>=0.1.90
Requires-Dist: torch>=2.0
Requires-Dist: pytorch-crf>=0.7.2
Dynamic: author
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# NPLTK

Nepali Language Processing Toolkit (NPLTK) is a lightweight and modular NLP library designed specifically for the Nepali language. It provides tools for tokenization, normalization, lemmatization, stop-word removal, POS tagging, and Named Entity Recognition (NER).

---

## Why NPLTK?

Most NLP libraries are designed primarily for English and do not handle Nepali morphology, suffixes, and tokenization well.

NPLTK is built specifically for Nepali and provides:

* Hybrid tokenizer combining rule-based logic and SentencePiece
* Hybrid lemmatization using dictionary + rules
* Lightweight POS and NER models
* Fully self-contained package with bundled resources

---

## Installation

```bash
pip install npltk
```

For testing from TestPyPI:

```bash
pip install -i https://test.pypi.org/simple/ npltk
```

---

## Minimal Example

```python
from npltk import create_tokenizer

tokens = create_tokenizer().tokenize("नेपाल सुन्दर देश हो।")
print([t.text for t in tokens])
```

---

## Tokenizer

NPLTK provides a tokenizer factory through `create_tokenizer(...)`.

```python
create_tokenizer(
    mode="hybrid",
    split_into_sentences=True,
    keep_punct=True,
    model_path=None,
    subword=True,
    preprocess=None,
    fallback_to_rule=True,
)
```

### Main arguments

* `mode`: `"hybrid"` or `"rule"`

  * `"hybrid"` uses rule-based tokenization together with SentencePiece
  * `"rule"` uses only rule-based tokenization

* `split_into_sentences`: whether sentence splitting is enabled internally

* `keep_punct`: whether punctuation tokens are kept in output

* `model_path`: optional custom SentencePiece model path

* `subword`: enables SentencePiece-based subword support in hybrid mode

* `preprocess`: optional preprocessing function applied before tokenization

* `fallback_to_rule`: if hybrid loading fails, automatically use rule mode

### Tokenizer Example

```python
from npltk import create_tokenizer

tokenizer = create_tokenizer(
    mode="hybrid",
    keep_punct=True,
    fallback_to_rule=True,
)

tokens = tokenizer.tokenize("नेपाल एक सुन्दर देश हो।")
print([t.text for t in tokens])
```

### Sentence Tokenization Example

```python
from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
sentences = tokenizer.tokenize_sentences("नेपाल सुन्दर देश हो। यहाँ हिमाल छन्।")

for sent in sentences:
    print([t.text for t in sent.tokens])
```

### Detokenization Example

```python
from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
tokens = tokenizer.tokenize("नेपाल सुन्दर देश हो।")
text = tokenizer.detokenize(tokens)

print(text)
```

---

## Separate Examples for Each Component

### 1. Normalizer

```python
from npltk.normalizer import build_normalizer

result = build_normalizer().normalize("  नेपाल।।  ")
print(result.text)
```

### 2. Tokenizer

```python
from npltk import create_tokenizer

tokenizer = create_tokenizer(mode="hybrid")
tokens = tokenizer.tokenize("नेपालको प्रधानमन्त्री काठमाडौं गए।")
print([t.text for t in tokens])
```

### 3. Lemmatizer

```python
from npltk import Lemmatizer

lemmatizer = Lemmatizer()
print(lemmatizer.lemmatize("गयो"))
print(lemmatizer.lemmatize("घरहरूमा"))
```

### 4. Stop Word Removal

```python
from npltk import create_tokenizer
from npltk.stop_word.remover import StopWordRemover

tokens = create_tokenizer().tokenize("नेपाल सुन्दर देश हो र यहाँ हिमाल छन् ।")
filtered, info = StopWordRemover().remove(tokens)

print([t.text for t in filtered])
print(info)
```

### 5. POS Tagger

```python
from npltk import create_tokenizer, POSTagger

tokens = [t.text for t in create_tokenizer().tokenize("नेपालको प्रधानमन्त्री काठमाडौं गए।")]
tagger = POSTagger()

print(tagger.tag_with_tokens(tokens))
```

### 6. NER Tagger

```python
from npltk import NERTagger

tagger = NERTagger(tokenizer_mode="hybrid")
print(tagger.extract("शेरबहादुर देउवा काठमाडौं पुगे।"))
```

---

## Full Workflow Pipeline Example

```python
from pprint import pprint

from npltk import create_tokenizer, Lemmatizer, POSTagger, NERTagger
from npltk.normalizer import build_normalizer
from npltk.stop_word.remover import StopWordRemover

text = "  शेरबहादुर देउवा काठमाडौं पुगे र नेपालको बारेमा बोले।  "

# 1. Normalize
normalizer = build_normalizer()
norm_result = normalizer.normalize(text)
normalized_text = norm_result.text
print("Normalized:", normalized_text)

# 2. Tokenize
tokenizer = create_tokenizer(mode="hybrid", fallback_to_rule=True)
tokens = tokenizer.tokenize(normalized_text)
token_texts = [t.text for t in tokens]
print("Tokens:", token_texts)

# 3. Remove stop words
filtered_tokens, info = StopWordRemover().remove(tokens)
filtered_texts = [t.text for t in filtered_tokens]
print("Filtered Tokens:", filtered_texts)
print("Stopword Info:", info)

# 4. Lemmatize
lemmatizer = Lemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in filtered_texts]
print("Lemmas:", lemmas)

# 5. POS tagging
pos_tagger = POSTagger()
pos_pairs = pos_tagger.tag_with_tokens(token_texts)
print("POS Tags:", pos_pairs)

# 6. NER
ner_tagger = NERTagger(tokenizer_mode="hybrid")
ner_result = ner_tagger.predict(normalized_text)

print("NER Token-Tag Pairs:")
for token, tag in zip(ner_result["tokens"], ner_result["tags"]):
    print(f"{token:12} {tag}")

print("Entities:")
pprint(ner_result["entities"], width=100)
```

---

## Features

* Nepali normalizer
* Hybrid tokenizer (rule-based + SentencePiece)
* Lemmatizer
* Stop-word removal
* POS tagging
* Named Entity Recognition (NER)

---

## Models

NPLTK includes bundled trained models for:

* POS Tagger
* NER Tagger

These work out of the box after installation.

---

## Suggested Workflow

1. Normalize text
2. Tokenize text
3. Optionally remove stop words
4. Lemmatize tokens
5. Run POS tagging
6. Run NER extraction

---

## Contributors

* Anurag Sharma
* Anita Budha Magar
* Apeksha Parajuli
* Apeksha Katwal

Supervisor:

* Pukar Karki

Institute of Engineering, Purwanchal Campus

---

## License

MIT License
