Metadata-Version: 2.4
Name: pydatamax
Version: 0.2.1
Summary: Advanced Data Crawling and Processing Framework
Home-page: https://github.com/Hi-Dolphin/datamax
Author: ccy
Author-email: DataMax Team <cy.kron@foxmail.com>
Maintainer: DataMax Team
Maintainer-email: DataMax Team <cy.kron@foxmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Hi-Dolphin/datamax
Project-URL: Documentation, https://github.com/Hi-Dolphin/datamax/docs
Project-URL: Repository, https://github.com/Hi-Dolphin/datamax
Project-URL: Bug Reports, https://github.com/Hi-Dolphin/datamax/issues
Project-URL: Source, https://github.com/Hi-Dolphin/datamax
Keywords: crawler,scraping,data-processing,arxiv,web-scraping,data-extraction,parsing,async,cli,framework,academic-papers,research,automation,data-collection,file-conversion,document-processing
Platform: any
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Information Technology
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Utilities
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: OS Independent
Classifier: Operating System :: POSIX
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Framework :: AsyncIO
Classifier: Natural Language :: English
Classifier: Natural Language :: Chinese (Simplified)
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: oss2<3.0.0,>=2.19.1
Requires-Dist: aliyun-python-sdk-core<3.0.0,>=2.16.0
Requires-Dist: aliyun-python-sdk-kms<3.0.0,>=2.16.5
Requires-Dist: crcmod<2.0.0,>=1.7
Requires-Dist: langdetect<2.0.0,>=1.0.9
Requires-Dist: loguru<1.0.0,>=0.7.3
Requires-Dist: python-docx<2.0.0,>=1.1.2
Requires-Dist: python-dotenv<2.0.0,>=1.1.0
Requires-Dist: pymupdf<2.0.0,>=1.24.14
Requires-Dist: pypdf<6.0.0,>=5.5.0
Requires-Dist: openpyxl<4.0.0,>=3.1.5
Requires-Dist: pandas<3.0.0,>=2.2.3
Requires-Dist: numpy<3.0.0,>=2.2.6
Requires-Dist: requests<3.0.0,>=2.32.3
Requires-Dist: defusedxml<1.0.0,>=0.7.1
Requires-Dist: tqdm<5.0.0,>=4.67.1
Requires-Dist: pydantic<3.0.0,>=2.10.6
Requires-Dist: pydantic-settings<3.0.0,>=2.9.1
Requires-Dist: python-magic<1.0.0,>=0.4.27
Requires-Dist: PyYAML<7.0.0,>=6.0.2
Requires-Dist: Pillow<12.0.0,>=11.2.1
Requires-Dist: packaging<25.0,>=24.2
Requires-Dist: beautifulsoup4<5.0.0,>=4.13.4
Requires-Dist: minio<8.0.0,>=7.2.15
Requires-Dist: openai<2.0.0,>=1.82.0
Requires-Dist: jionlp<2.0.0,>=1.5.23
Requires-Dist: chardet<6.0.0,>=5.2.0
Requires-Dist: python-pptx<2.0.0,>=1.0.2
Requires-Dist: tiktoken<1.0.0,>=0.9.0
Requires-Dist: markitdown<1.0.0,>=0.1.1
Requires-Dist: xlrd<3.0.0,>=2.0.1
Requires-Dist: tabulate<1.0.0,>=0.9.0
Requires-Dist: unstructured<1.0.0,>=0.17.2
Requires-Dist: markdown<4.0.0,>=3.8
Requires-Dist: langchain<1.0.0,>=0.3.0
Requires-Dist: langchain-community<1.0.0,>=0.3.0
Requires-Dist: langchain-text-splitters<1.0.0,>=0.3.0
Requires-Dist: ebooklib==0.19
Requires-Dist: setuptools
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: click>=8.0.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: python-dateutil>=2.8.0
Requires-Dist: typing-extensions>=4.0.0
Requires-Dist: pytest>=7.0.0
Requires-Dist: pytest-asyncio>=0.21.0
Requires-Dist: pytest-cov>=4.0.0
Requires-Dist: pytest-mock>=3.10.0
Requires-Dist: pytest-timeout>=2.1.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.1.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: isort>=5.10.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: pre-commit>=2.20.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-mock>=3.10.0; extra == "test"
Requires-Dist: pytest-timeout>=2.1.0; extra == "test"
Requires-Dist: aioresponses>=0.7.0; extra == "test"
Provides-Extra: docs
Requires-Dist: sphinx>=5.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Requires-Dist: myst-parser>=0.18.0; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=1.19.0; extra == "docs"
Provides-Extra: all
Requires-Dist: pytest>=7.0.0; extra == "all"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "all"
Requires-Dist: pytest-cov>=4.0.0; extra == "all"
Requires-Dist: pytest-mock>=3.10.0; extra == "all"
Requires-Dist: pytest-timeout>=2.1.0; extra == "all"
Requires-Dist: black>=22.0.0; extra == "all"
Requires-Dist: isort>=5.10.0; extra == "all"
Requires-Dist: flake8>=5.0.0; extra == "all"
Requires-Dist: pre-commit>=2.20.0; extra == "all"
Requires-Dist: aioresponses>=0.7.0; extra == "all"
Requires-Dist: sphinx>=5.0.0; extra == "all"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "all"
Requires-Dist: myst-parser>=0.18.0; extra == "all"
Requires-Dist: sphinx-autodoc-typehints>=1.19.0; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: maintainer
Dynamic: platform
Dynamic: requires-python

# DataMax

<div align="center">

[中文](README_zh.md) | **English**

[![PyPI version](https://badge.fury.io/py/pydatamax.svg)](https://badge.fury.io/py/pydatamax) [![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

</div>

**Documentation Portal:** https://hi-dolphin.github.io/datamax

A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.

## ✨ Key Features

- 🔄 **Multi-format Support**: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more
- 🧹 **Intelligent Cleaning**: Advanced data cleaning with anomaly detection, privacy protection, and text filtering
- 🤖 **AI Annotation**: LLM-powered automatic annotation and QA generation
- ⚡ **High Performance**: Efficient batch processing with caching and parallel execution
- 🎯 **Developer Friendly**: Modern SDK design with type hints, configuration management, and comprehensive error handling
- ☁️ **Cloud Ready**: Built-in support for OSS, MinIO, and other cloud storage providers

## 🚀 Quick Start

### Install

```bash
pip install pydatamax
```

### Examples

```python
from datamax import DataMax

# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"

# init client
client = DataMax(file_path=FILE_PATHS)

# get data
data = dm.get_data()

# get content
content = data.get("content")

# get pre label. return trainable qa list
qa = dm.get_pre_label(
    content=content,
    api_key=api_key,
    base_url=base_url,
    model_name=model,
    question_number=50,  # question_number_per_chunk
    max_qps=100.0,
    debug=False,
    structured_data=True,  # enable structured output
    auto_self_review_mode=True,  # auto review qa, pass with 4 and 5 score, drop with 1, 2 and 3 score.
    review_max_qps=100.0,
)


# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)
```

## 📚 Documentation

- See docs: `docs/index.md`
- Sections: Getting Started, Parsing, Cleaning, Labeling, Crawling, Evaluation, CLI, API, Extending, FAQ
- For the complete text-modal QA generation pipeline, see [examples/scripts/generate_qa.py](examples/scripts/generate_qa.py)

## 🤝 Contributing

Issues and Pull Requests are welcome!

## 📄 License

This project is licensed under the [MIT License](LICENSE).

## 📞 Contact Us

- 📧 Email: cy.kron@foxmail.com, wang.xiangyuxy@outlook.com
- 🐛 Issues: [GitHub Issues](https://github.com/Hi-Dolphin/datamax/issues)
- 📚 Documentation: [Project Homepage](https://github.com/Hi-Dolphin/datamax)
- 💬 Wechat Group: <br><img src='wechat.jpg' width=300>
---

⭐ If this project helps you, please give us a star!
