Metadata-Version: 2.4
Name: pydatamax
Version: 0.1.24
Summary: A library for parsing and converting various file formats.
Home-page: https://github.com/Hi-Dolphin/datamax
Author: ccy
Author-email: cy.kron@foxmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: oss2<3.0.0,>=2.19.1
Requires-Dist: aliyun-python-sdk-core<3.0.0,>=2.16.0
Requires-Dist: aliyun-python-sdk-kms<3.0.0,>=2.16.5
Requires-Dist: crcmod<2.0.0,>=1.7
Requires-Dist: langdetect<2.0.0,>=1.0.9
Requires-Dist: loguru<1.0.0,>=0.7.3
Requires-Dist: python-docx<2.0.0,>=1.1.2
Requires-Dist: python-dotenv<2.0.0,>=1.1.0
Requires-Dist: pymupdf<2.0.0,>=1.24.14
Requires-Dist: pypdf<6.0.0,>=5.5.0
Requires-Dist: openpyxl<4.0.0,>=3.1.5
Requires-Dist: pandas<3.0.0,>=2.2.3
Requires-Dist: numpy<3.0.0,>=2.2.6
Requires-Dist: requests<3.0.0,>=2.32.3
Requires-Dist: tqdm<5.0.0,>=4.67.1
Requires-Dist: pydantic<3.0.0,>=2.10.6
Requires-Dist: pydantic-settings<3.0.0,>=2.9.1
Requires-Dist: python-magic<1.0.0,>=0.4.27
Requires-Dist: PyYAML<7.0.0,>=6.0.2
Requires-Dist: Pillow<12.0.0,>=11.2.1
Requires-Dist: packaging<25.0,>=24.2
Requires-Dist: beautifulsoup4<5.0.0,>=4.13.4
Requires-Dist: minio<8.0.0,>=7.2.15
Requires-Dist: openai<2.0.0,>=1.82.0
Requires-Dist: jionlp<2.0.0,>=1.5.23
Requires-Dist: chardet<6.0.0,>=5.2.0
Requires-Dist: python-pptx<2.0.0,>=1.0.2
Requires-Dist: tiktoken<1.0.0,>=0.9.0
Requires-Dist: markitdown<1.0.0,>=0.1.1
Requires-Dist: xlrd<3.0.0,>=2.0.1
Requires-Dist: tabulate<1.0.0,>=0.9.0
Requires-Dist: unstructured<1.0.0,>=0.17.2
Requires-Dist: markdown<4.0.0,>=3.8
Requires-Dist: langchain<1.0.0,>=0.3.0
Requires-Dist: langchain-community<1.0.0,>=0.3.0
Requires-Dist: langchain-text-splitters<1.0.0,>=0.3.0
Requires-Dist: ebooklib==0.19
Requires-Dist: setuptools
Requires-Dist: transformers==4.53.1
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# DataMax

<div align="center">

[中文](README_zh.md) | **English**

[![PyPI version](https://badge.fury.io/py/pydatamax.svg)](https://badge.fury.io/py/pydatamax) [![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

</div>

A powerful multi-format file parsing, data cleaning, and AI annotation toolkit built for modern Python applications.

## ✨ Key Features

- 🔄 **Multi-format Support**: PDF, DOCX/DOC, PPT/PPTX, XLS/XLSX, HTML, EPUB, TXT, images, and more
- 🧹 **Intelligent Cleaning**: Advanced data cleaning with anomaly detection, privacy protection, and text filtering
- 🤖 **AI Annotation**: LLM-powered automatic annotation and QA generation
- ⚡ **High Performance**: Efficient batch processing with caching and parallel execution
- 🎯 **Developer Friendly**: Modern SDK design with type hints, configuration management, and comprehensive error handling
- ☁️ **Cloud Ready**: Built-in support for OSS, MinIO, and other cloud storage providers

## 🚀 Quick Start

### Install

```bash
pip install pydatamax
```

### Examples

```python
from datamax import DataMax

# prepare info
FILE_PATHS = ["/your/file/path/1.md", "/your/file/path/2.doc", "/your/file/path/3.xlsx"]
LABEL_LLM_API_KEY = "YOUR_API_KEY"
LABEL_LLM_BASE_URL = "YOUR_BASE_URL"
LABEL_LLM_MODEL_NAME = "YOUR_MODEL_NAME"
LLM_TRAIN_OUTPUT_FILE_NAME = "train"

# init client
client = DataMax(file_path=FILE_PATHS)

# get pre label. return trainable qa list
qa_list = client.get_pre_label(
    api_key=LABEL_LLM_API_KEY,
    base_url=LABEL_LLM_BASE_URL,
    model_name=LABEL_LLM_MODEL_NAME,
    question_number=10,
    max_workers=5)

# save label data
client.save_label_data(qa_list, LLM_TRAIN_OUTPUT_FILE_NAME)
```


## 🤝 Contributing

Issues and Pull Requests are welcome!

## 📄 License

This project is licensed under the [MIT License](LICENSE).

## 📞 Contact Us

- 📧 Email: cy.kron@foxmail.com
- 🐛 Issues: [GitHub Issues](https://github.com/Hi-Dolphin/datamax/issues)
- 📚 Documentation: [Project Homepage](https://github.com/Hi-Dolphin/datamax)
- 💬 Wechat Group: <br><img src='wechat.jpg' width=300>
---

⭐ If this project helps you, please give us a star!
