Metadata-Version: 2.1
Name: porosdata-processor
Version: 0.4.0
Summary: Academic document intelligent cleaning pipeline for AI for Science, ensuring MinerU parsed data meets LLM input standards
Author-email: Kivent YE <72405514@cityu-dg.edu.cn>
Maintainer-email: Kivent YE <72405514@cityu-dg.edu.cn>
License: CC BY 4.0
Project-URL: Documentation, https://porosdata-doc.readthedocs.io/en/latest/
Keywords: text,cleaning,latex,greek,preprocessing,nlp,llm,token-optimization
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: psutil>=5.9.0
Provides-Extra: batch
Requires-Dist: ijson>=3.0.0; extra == "batch"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: flake8>=6.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: transformers>=4.21.0; extra == "dev"
Requires-Dist: ijson>=3.0.0; extra == "dev"
Provides-Extra: eval
Requires-Dist: transformers>=4.21.0; extra == "eval"

# PorosData-Processor

  
[Python Version](https://www.python.org/downloads/)
[License: CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)

PorosData-Processor cleans MinerU-derived scientific document text for downstream LLM and data-mining workflows. It normalizes text without breaking LaTeX formulas, repairs OCR errors, cleans up citations and numbering, and optionally evaluates token efficiency.

## Installation

```bash
pip install porosdata-processor
```

Optional extras:

```bash
# Token evaluation support
pip install "porosdata-processor[eval]"

# Streaming JSON parsing for large files
pip install "porosdata-processor[batch]"
```

Python `3.11+` is supported.

## What The Package Does

- Cleans scientific text while preserving LaTeX formulas via a placeholder-based protect/restore mechanism (Shield).
- Normalizes whitespace, Unicode, citations, numbering, and repairs OCR errors.
- Provides a Python API for single-text processing.
- Provides a CLI for batch processing MinerU `*_content_list.json` files.
- Optionally computes token-efficiency statistics when `transformers` is installed.

## Python API

```python
from processor import TextCleaner

cleaner = TextCleaner()
result = cleaner.clean("Recent studies 【1】 show α-phase stability.")
print(result)
```

Example output:

```text
Recent studies ref[1] show \alpha-phase stability.
```

Custom pipeline example:

```python
from processor import TextCleaner

cleaner = TextCleaner(
    pipeline=[
        "unicode_normalization",
        "patterns_cleaning",
        "normalize_whitespace",
    ]
)

result = cleaner.clean("Text   with   extra spaces")
print(result)
```

Optional evaluation mode:

```python
from processor import TextCleaner

cleaner = TextCleaner.with_tokenizer_evaluation("gpt2")
result = cleaner.clean("Recent studies 【1】 show α-phase stability.", eval_mode=True)
print(result["processed_text"])
print(result["evaluation"]["overall"]["compression_rate"])
```

## CLI Batch Processing

The CLI recursively scans the input directory for MinerU `*_content_list.json` files and writes cleaned JSON outputs plus `processing_report.json` to the output directory.

```bash
porosdata-processor run \
    --input-dir "data/Raw Database" \
    --output-dir "data/Processed Database" \
    --max-workers 4
```

Enable optional features:

```bash
porosdata-processor run \
    --input-dir "data/Raw Database" \
    --output-dir "data/Processed Database" \
    --enable-evaluation \
    --max-workers 4
```

## Script Entrypoints

For repository-level operations, use these two scripts:

- `scripts/run_processeddataset.sh`: run full cleaning from `data/Raw Database` to `data/Processed Database`
- `scripts/run_rulesupplement.sh`: run the rule-supplement governance flow and write iteration artifacts to `data/Rule Supplement Database/<RUN_ID>`

Common flags:

- `--input-dir`: input directory containing MinerU outputs
- `--output-dir`: output directory for cleaned files
- `--enable-evaluation`: enable token-efficiency evaluation
- `--max-workers`: set the number of worker processes
- `--force-reprocess`: ignore existing outputs and re-run processing
- `--memory-limit`: memory limit in MB
- `--log-level`: `DEBUG`, `INFO`, `WARNING`, or `ERROR`
- `--heartbeat-seconds`: emit runtime heartbeat logs every N seconds

## Scope And Input Format

- Batch processing is designed for MinerU-generated JSON content lists, not for generic `JSONL`, `Parquet`, or `HDF5` datasets.
- The primary public API is `TextCleaner` for string-based cleaning and the `porosdata-processor` CLI for directory-based processing.
- Commands such as `audit`, `sample-validate`, and `delivery-gate` are also available from the CLI, but they are intended for internal data-governance workflows.

## License

PorosData-Processor is released under the `CC BY 4.0` License.
