Metadata-Version: 2.4
Name: nahiarhdNLP
Version: 1.0.1
Summary: Advanced Indonesian Natural Language Processing Library
Author-email: Raihan Hidayatullah Djunaedi <raihanhd.dev@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/nahiarhdNLP/nahiarhdNLP
Project-URL: Documentation, https://nahiarhdNLP.readthedocs.io
Project-URL: Repository, https://github.com/nahiarhdNLP/nahiarhdNLP
Project-URL: Issues, https://github.com/nahiarhdNLP/nahiarhdNLP/issues
Keywords: nlp,indonesian,natural-language-processing,text-processing,bahasa-indonesia
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: fsspec>=2021.10.1
Requires-Dist: huggingface_hub>=0.10.0
Requires-Dist: sastrawi>=1.0.1
Requires-Dist: datasets>=2.0.0
Requires-Dist: rich>=12.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: isort>=5.10.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.991; extra == "dev"
Requires-Dist: pre-commit>=2.17.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=4.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "docs"
Requires-Dist: myst-parser>=0.17.0; extra == "docs"
Dynamic: license-file

# nahiarhdNLP

[![PyPI version](https://badge.fury.io/py/nahiarhdNLP.svg)](https://badge.fury.io/py/nahiarhdNLP)
[![Python Version](https://img.shields.io/pypi/pyversions/nahiarhdNLP.svg)](https://pypi.org/project/nahiarhdNLP/)
[![GitHub](https://img.shields.io/github/stars/nahiarhdNLP/nahiarhdNLP?style=social)](https://github.com/nahiarhdNLP/nahiarhdNLP)

**nahiarhdNLP** is an advanced Python library for Indonesian Natural Language Processing (NLP), providing easy-to-use tools for text preprocessing, normalization, tokenization, stemming, spell correction, and customizable pipelines.

---

## Installation

```bash
pip install nahiarhdNLP
```

---

## Features

- **Preprocessing**: Clean text from HTML, URLs, stopwords, slang, emoji, mentions, hashtags, numbers, punctuation, extra spaces, and special characters.
- **Tokenization**: Split sentences into tokens/words.
- **Stemming**: Convert words to their root form (using Sastrawi).
- **Spell Correction**: Automatic spelling correction.
- **Pipeline**: Chain multiple preprocessing functions easily.
- **Normalization**: Replace slang, emoji, and informal words with formal equivalents.

---

## Quick Usage Example

### Basic Preprocessing

```python
from nahiarhdNLP import preprocessing

text = "Halooo emg siapa yg nanya? 😀 <a href='#'>link</a> @user #trending 123"
cleaned = preprocessing.cleaning.text_cleaner.clean_text(text)
print(cleaned)
```

### Custom Preprocessing Pipeline

```python
from nahiarhdNLP.preprocessing import (
    pipeline, remove_html, remove_url, remove_mentions, remove_hashtags,
    remove_numbers, replace_word_elongation, emoji_to_words, replace_slang,
    remove_stopwords, remove_punctuation, remove_extra_spaces, to_lowercase
)

custom_pipe = pipeline([
    remove_html, remove_url, remove_mentions, remove_hashtags, remove_numbers,
    replace_word_elongation, emoji_to_words, replace_slang, remove_stopwords,
    remove_punctuation, remove_extra_spaces, to_lowercase
])

result = custom_pipe("Halooo emg siapa yg nanya? 😀 <a href='#'>link</a> @user #trending 123")
print(result)
```

### Spell Correction

```python
from nahiarhdNLP.preprocessing import correct_spelling
print(correct_spelling("sya suka mkn nasi"))  # "saya suka makan nasi"
```

### Stemming

```python
from nahiarhdNLP.preprocessing import stem_text
print(stem_text("bermain-main dengan senang"))  # "main dengan senang"
```

---

## Requirements

- Python 3.7+
- pandas, fsspec, huggingface_hub, sastrawi, datasets, rich

---

## Testing

```bash
pytest tests/
```

---

## Directory Structure

```
nahiarhdNLP/
├── main.py
├── requirements.txt
├── README.md
├── src/
│   ├── preprocessing/
│   ├── mydatasets/
└── tests/
```

---

## Contribution

Contributions are welcome! Please fork the repository, create a new branch, and submit a pull request.

---

## License

MIT License

---

## Acknowledgments

- Stopwords dataset from HuggingFace
- Emoji dataset from HuggingFace
- Slang dataset from HuggingFace
- Sastrawi for Indonesian stemming
