Metadata-Version: 2.4
Name: farsflow
Version: 0.1.0
Summary: Fast, modern, and modular Persian text preprocessing library.
Author-email: Mahdi Hosseini <mhossza@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/mhhoss/farsflow
Project-URL: Documentation, https://github.com/mhhoss/farsflow
Project-URL: Source, https://github.com/mhhoss/farsflow
Project-URL: Issues, https://github.com/mhhoss/farsflow/issues
Keywords: farsi,Persian,preprocessing,nlp,text-cleaning,normalization,tokenization,machine-learning,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest>=8.3.5; extra == "dev"
Dynamic: license-file

# farsflow

farsflow is a small library for **Persian text preprocessing**, designed for practical NLP and LLM workflows.

---

## 🚀 Features

- Deterministic pipeline (same input -> same output)
- Safe character normalization
- Joiner (ZWNJ) fixes
- Whitespace & punctuation cleanup
- Modular processors (use only the components you need)
- Real‑world style sample tests
- Zero dependencies

---

## 📦 Installation

```bash
pip install farsflow
```

## ✨ Quick Start

```python
import farsflow as ff

text = "سلام  دنیا!  این یك   تست است  که می نویسم  ۴۵۶"
cleaned = ff.clean(text)
print(cleaned)
```
Expected output:  سلام دنیا! این یک تست است که می‌نویسم 456

---

## 🧩 Pipeline Components

farsflow ships with a set of modular, composable components:

- **Normalizer** — safe character normalization
- **JoinerFixer** — fixes ZWNJ usage without over-correction
- **SpaceCleaner** — trims redundant whitespace and punctuation spacing
- **Pipeline** — orchestrates components in a deterministic order

You can customize the pipeline:

```python
from farsflow import Pipeline, Normalizer, JoinerFixer

pipeline = Pipeline(
    Normalizer(),
    JoinerFixer(),
)

text = "متن   تستی"
pipeline(text)
```

---

🧪 Testing

```bash
pytest
# or:
pytest path/to/test_file.py
```

---

🗺 Roadmap (v0.2.0)

- [ ] formalize the behavior of "ff.clean" as a safe and deterministic baseline
- [ ] add optional normalization controls (e.g. digit normalization)
- [ ] add optional noise-cleaning emoji and url processors
- [ ] introduce simple profiles (ff.llm.clean, ff.embedding.query, ff.embedding.index) built on top of the same core
- [ ] expand real-world test cases to ensure stable behavior across informal, mixed, and noisy Persian text

---

📄 License

MIT License — see [LICENSE](LICENSE).

---

🤝 Contributing

Contributions are welcome.  
Please open an issue or submit a pull request on GitHub.

📝 Changelog

See [CHANGELOG](CHANGELOG) for version history.
