Metadata-Version: 2.1
Name: advancedtextcleaner
Version: 0.1
Summary: A modular, fully-configurable NLP text cleaning function with 15+ toggleable steps.
Author: Archisman Das a.k.a CYBER ARCHIS OP
Keywords: nlp text-cleaning preprocessing stemming lemmatization stopwords tokenization
Description-Content-Type: text/markdown
Requires-Dist: nltk
Requires-Dist: contractions

# AdvancedTextCleaner

A single, fully-configurable `AdvancedTextCleaner()` function for all your NLP preprocessing needs.  
Toggle any combination of 15+ cleaning steps — no pipeline boilerplate, no class inheritance, just one function call.

---

## Installation

```bash
pip install AdvancedTextCleaner
```

### NLTK data (first run only)

The package downloads required NLTK corpora automatically on first import. If you prefer to do it manually:

```python
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
```

---

## Quick Start

```python
from AdvancedTextCleaner import AdvancedTextCleaner

text = "<p>Hello! Visit https://example.com 😊 Don't you love NLP? #AI @user</p>"

# Defaults — basic cleaning
AdvancedTextCleaner(text)
# → "hello visit don't you love nlp ai user"

# Full normalisation
AdvancedTextCleaner(text,
    expand_contractions=True,
    remove_stopwords=True,
    remove_emojis=True,
    remove_hashtags=True,
    remove_mentions=True,
    lemmatize=True
)
# → "hello visit love nlp"
```

---

## Features

- **15+ cleaning steps** — all individually toggled via keyword arguments
- **Three morphological reducers** — POS-aware Lemmatization, Porter Stemmer, Snowball Stemmer
- **Stopword control** — NLTK defaults + custom additions + a `keep_words` whitelist
- **Social media ready** — handles `@mentions`, `#hashtags`, emojis, and URLs
- **Sentiment-safe mode** — preserves `!`, `?`, `'` even when stripping all other punctuation
- **Stateless & pipeline-safe** — safe to use with `.apply()`, multiprocessing, and inference pipelines

---

## API Reference

```python
AdvancedTextCleaner(
    text: str,

    # Normalisation
    to_lowercase          = True,
    remove_accents        = False,
    expand_contractions   = False,

    # Noise removal
    remove_html           = True,
    remove_urls           = True,
    remove_emails         = True,
    remove_mentions       = False,
    remove_hashtags       = False,
    remove_numbers        = False,
    remove_punctuation    = True,
    remove_extra_spaces   = True,

    # Special characters
    keep_sentiment_markers = False,
    remove_emojis          = False,
    remove_special_chars   = False,

    # Stopwords
    remove_stopwords      = False,
    custom_stopwords      = None,   # set
    keep_words            = None,   # set

    # Morphological reduction
    lemmatize             = False,
    stem_porter           = False,
    stem_snowball         = False,
) -> str
```

### Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `to_lowercase` | `bool` | `True` | `"Hello"` → `"hello"` |
| `remove_accents` | `bool` | `False` | `"café"` → `"cafe"` |
| `expand_contractions` | `bool` | `False` | `"don't"` → `"do not"` |
| `remove_html` | `bool` | `True` | `<b>hi</b>` → `"hi"` |
| `remove_urls` | `bool` | `True` | Strips `http://`, `https://`, `www.` |
| `remove_emails` | `bool` | `True` | Strips `user@mail.com` |
| `remove_mentions` | `bool` | `False` | Strips `@username` |
| `remove_hashtags` | `bool` | `False` | Strips `#hashtag` |
| `remove_numbers` | `bool` | `False` | Strips digit sequences |
| `remove_punctuation` | `bool` | `True` | Strips `.,!?` etc. |
| `remove_extra_spaces` | `bool` | `True` | Collapses multiple spaces into one |
| `keep_sentiment_markers` | `bool` | `False` | Preserves `!`, `?`, `'` even when removing punctuation |
| `remove_emojis` | `bool` | `False` | Strips 😊🔥 etc. |
| `remove_special_chars` | `bool` | `False` | Keeps only `[a-zA-Z0-9 ]` — strictest mode |
| `remove_stopwords` | `bool` | `False` | Strips NLTK English stopwords |
| `custom_stopwords` | `set` | `None` | Extra domain-specific words to strip |
| `keep_words` | `set` | `None` | Whitelist — these words are **never** removed |
| `lemmatize` | `bool` | `False` | POS-aware WordNet lemmatization |
| `stem_porter` | `bool` | `False` | Porter Stemmer |
| `stem_snowball` | `bool` | `False` | Snowball Stemmer |

---

## Processing Order

Steps always execute in this fixed order:

```
1.  Expand contractions
2.  Lowercase
3.  Remove accents
4.  Strip HTML
5.  Remove URLs
6.  Remove emails
7.  Remove @mentions
8.  Remove #hashtags
9.  Remove emojis
10. Remove numbers
11. Remove punctuation / special chars
12. Normalise whitespace
13. Remove stopwords        ← token-level
14. Lemmatize / Stem        ← token-level
15. Final whitespace pass
```

---

## Preset Recipes

### Sentiment Analysis

```python
AdvancedTextCleaner(text,
    expand_contractions=True,
    keep_sentiment_markers=True,   # preserve ! ?
    remove_stopwords=False,        # keep "not", "never"
    remove_emojis=False,           # emojis carry sentiment
)
```

### Topic Modelling / TF-IDF

```python
AdvancedTextCleaner(text,
    remove_stopwords=True,
    lemmatize=True,
    remove_numbers=True,
    remove_emojis=True,
)
```

### Bag-of-Words / Classical ML

```python
AdvancedTextCleaner(text,
    remove_accents=True,
    expand_contractions=True,
    remove_numbers=True,
    remove_punctuation=True,
    remove_stopwords=True,
    remove_emojis=True,
    stem_porter=True,
)
```

### Social Media Text

```python
AdvancedTextCleaner(text,
    remove_mentions=True,
    remove_hashtags=True,
    remove_emojis=True,
    expand_contractions=True,
    keep_sentiment_markers=True,
)
```

### pandas DataFrame

```python
import pandas as pd
from AdvancedTextCleaner import AdvancedTextCleaner

df['clean'] = df['text'].apply(lambda x: AdvancedTextCleaner(x,
    remove_stopwords=True,
    lemmatize=True
))
```

---

## Priority Rules

| Conflicting flags | Winner |
|---|---|
| `lemmatize=True` + `stem_porter=True` | `lemmatize` |
| `lemmatize=True` + `stem_snowball=True` | `lemmatize` |
| `stem_porter=True` + `stem_snowball=True` | `stem_porter` |
| `remove_special_chars=True` + `remove_punctuation=True` | `remove_special_chars` (stricter) |
| `keep_sentiment_markers=True` + `remove_punctuation=True` | `!`, `?`, `'` are kept |
| `keep_words={'not'}` + `remove_stopwords=True` | `"not"` is never removed |

---

## Tips

**Protect negations in sentiment tasks** — `"not"`, `"no"`, `"never"` flip sentiment entirely. Whitelist them:

```python
AdvancedTextCleaner(text,
    remove_stopwords=True,
    keep_words={'not', 'no', 'never', "n't"}
)
```

**Always expand contractions before stopword removal** — without it `"don't"` may slip past the stopword list.

**Lemmatize vs Stem** — `lemmatize` produces real dictionary words (`running` → `run`); stemmers are faster but rougher (`running` → `runn`). Use lemmatize for interpretable output, stem for speed.

---

## Dependencies

| Package | Purpose |
|---|---|
| `nltk` | Tokenization, stopwords, lemmatization, stemming |
| `contractions` | Expanding English contractions |

---
