Metadata-Version: 2.1
Name: porosdata-processor
Version: 0.2.4
Summary: Academic document intelligent cleaning pipeline for AI for Science, ensuring MinerU parsed data meets LLM input standards
Author-email: Kivent YE <72405514@cityu-dg.edu.cn>
Maintainer-email: Kivent YE <72405514@cityu-dg.edu.cn>
License: MIT
Project-URL: Documentation, https://porosdata-doc.readthedocs.io/en/latest/
Keywords: text,cleaning,latex,greek,preprocessing,nlp,llm,token-optimization
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: charset-normalizer>=3.0
Requires-Dist: transformers>=4.21.0
Requires-Dist: ijson>=3.0.0
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: psutil>=5.9.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: flake8>=6.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Provides-Extra: encoding
Requires-Dist: charset-normalizer>=3.0; extra == "encoding"

# PorosData-Processor

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) 

**Academic Literature Data Engineering Toolkit** - Specializes in cleaning MinerU outputs (scientific literature) and evaluating Token efficiency for LLMs, supporting the complete "AI for Science" workflow.

## 🎯 Project Positioning

In the AI for Science field, high-quality academic data preprocessing is the foundation for models to understand scientific literature. PorosData-Processor serves as a data engineering tool that specifically addresses the "last mile" problem from MinerU document parsing to LLM input, ensuring academic documents achieve maximum Token efficiency while maintaining integrity.

## 🌟 Core Features

- **🔬 Professional Academic Cleaning**: Intelligently handles LaTeX formulas, control characters, citation formats, and other academic document-specific issues
- **⚡ Multi-process Parallel Processing**: Cross-platform concurrent processing based on pathlib and spawn methods, supporting Windows/Linux
- **📊 Real-time Token Evaluation**: Integrated GPT-2 tokenizer, providing precise Token compression rate calculations
- **🛡️ Intelligent Protection Mechanism**: 20% compression rate threshold protection, ensuring text semantic integrity
- **🔄 Self-healing Quality Assurance**: Automatic detection and repair of data corruption during processing

## 🚀 Quick Start

### Environment Requirements

```bash
pip install porosdata-processor ijson tiktoken psutil
```

### Process MinerU Data

```bash
# Process all JSON files in the data/mineru_output_raw_data directory
# Automatically configure HF_ENDPOINT mirror for Chinese users to accelerate downloads
python run_processor.py --enable-evaluation
```

### Output Data Format

Processed JSON files contain standardized fields:

- **`text`**: Cleaned academic text, suitable for LLM input
- **`original_text`**: Original input text, for quality comparison
- **`healed_count`**: Self-healing repair count, reflecting data quality

## 📈 Performance Showcase

### Materials Science Data Pilot Test Results

| Test Metric                          | Value     | Description                                                    |
| ------------------------------------ | --------- | -------------------------------------------------------------- |
| **Files Processed**            | 3 files   | MinerU-parsed academic paper JSON files                        |
| **Items Processed**            | 127 items | Including text, formulas, tables, and other structured content |
| **Avg Token Compression Rate** | 0.098     | **90.2% Token Savings**                                  |
| **Processing Time**            | 0.456s    | Multi-process parallel processing efficiency                   |
| **Memory Peak**                | 204.2MB   | Streaming processing ensures memory efficiency                 |

**Key Insights**:

- Token compression rate reaches 0.098 (90.2% reduction), significantly reducing LLM inference costs
- Processing speed of 278.5 items/second, suitable for large-scale academic data processing
- Memory peak usage of only 204.2MB, supporting TB-level data processing

## 🛠️ Academic Tools Suite

The project includes a complete academic data processing toolchain:

### Core Processing Scripts

- **Batch Processing**: `academic_tools/standalone/batch_process.py`
- **Single File Processing**: `academic_tools/standalone/process_single_json.py`
- **Compatibility Processing**: `academic_tools/standalone/process_with_cleanlit.py`

### Advanced Configuration Options

```bash
# Custom input/output directories
python run_processor.py --input-dir ./data/input --output-dir ./data/output

# Basic cleaning only (no Token evaluation)
python run_processor.py

# Force reprocessing of all files
python run_processor.py --enable-evaluation --force-reprocess
```

## 📚 Technical Documentation

- **[Architecture Design](docs/architecture.md)** - Core components and implementation principles
- **[Usage Guide](docs/usage_guide.md)** - Detailed API and configuration instructions
- **[Testing Guide](docs/guides/TESTING_GUIDE.md)** - Development and testing environment setup

## 🔬 Technical Specifications

### Supported Data Formats

- **Input**: MinerU-parsed JSON format academic documents
- **Output**: Standardized JSON, compatible with LLM training and inference
- **Encoding**: UTF-8 cross-platform support, automatic control character handling

### Quality Assurance Mechanisms

- **Compression Rate Protection**: ≤20% threshold ensures text semantic integrity
- **Boundary Self-healing**: Automatic detection and repair of Shield protection anomalies
- **Integrity Auditing**: Multi-layer verification ensures data quality

## 📄 Open Source License

MIT License - see [LICENSE](LICENSE) file for details

## 📖 Citation

If you use PorosData-Processor in your research, please cite:

```bibtex
@software{porosdata_processor,
  title = {PorosData-Processor: Academic Document Intelligent Cleaning Pipeline},
  author = {YE, Kivent},
  year = {2025},
  url = {https://github.com/KiventYip/PorosData-doc},
  version = {0.2.4}
}
```

## 🤝 Contributions and Feedback

Issues and Pull Requests are welcome! The project adopts a data-driven development philosophy, and any performance optimization suggestions will be seriously considered.
