Metadata-Version: 2.4
Name: PorosData-Designer
Version: 0.1.0
Summary: Structured delivery toolkit for converting MinerU parses into full_text, datamining, and multimodal outputs.
Author-email: Kivent <72405514@cityu-dg.edu.cn>
License: MIT
Keywords: mineru,document-processing,scientific-data,multimodal,structured-output
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typing-extensions>=4.0; python_version < "3.10"
Requires-Dist: dataclasses>=0.6; python_version < "3.7"
Requires-Dist: pydantic>=2.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: pathlib2>=2.3.0; python_version < "3.4"
Requires-Dist: dataclasses-json>=0.5.0
Requires-Dist: tqdm>=4.64.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Provides-Extra: optional
Requires-Dist: cleanlit>=0.2.0; extra == "optional"
Dynamic: license-file

# PorosData-Designer

`PorosData-Designer` converts MinerU-generated document parses into three stable deliverables: `full_text`, `datamining`, and `multimodal`.

It is designed for scientific data processing, structure-aware training preparation, and atomic document design centered on paragraphs, formulas, chemical expressions, and figure assets.

## What It Does

- Builds a structure-aware full-text view from `*_content_list.json`.
- Maps document sections, formulas, chemical expressions, and asset references into a datamining view.
- Extracts image-caption-mention relationships into a multimodal view with copied assets and Markdown outputs.

## Install

```bash
pip install porosdata-designer
```

## Quick Start

`--input_dir` should point to a directory tree that contains MinerU outputs. In practice, Designer expects:

- one or more `*_content_list.json` files
- image assets that remain resolvable relative to those input files

For package usage, it is recommended to pass explicit output and log directories.

Run the full pipeline:

```bash
porosdata-designer run all --input_dir "path/to/input_dir" --output_dir "path/to/output_dir" --log_dir "path/to/log_dir"
```

Equivalent module mode:

```bash
python -m porosdata_designer run all --input_dir "path/to/input_dir" --output_dir "path/to/output_dir" --log_dir "path/to/log_dir"
```

## Stage Commands

Run text structuring only:

```bash
porosdata-designer run text --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
```

Run multimodal extraction only:

```bash
porosdata-designer run multimodal --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
```

## Outputs

Designer produces three delivery views by default:

- `full_text`: for structure-aware training and human review.
- `datamining`: for retrieval, extraction, indexing, and downstream storage.
- `multimodal`: for image-text linking, multimodal indexing, and asset delivery.

Typical output layout:

```text
path/to/output_dir/
├── full_text/{doc_id}/
│   ├── {doc_id}_structured.json
│   └── {doc_id}_structured.txt
├── datamining/{doc_id}/
│   └── {doc_id}_datamining.json
└── multimodal/{doc_id}/
    ├── {doc_id}_index.json
    ├── fig_n.md
    └── assets/
```

## Validation

Audit structured outputs:

```bash
porosdata-designer audit structured --root_dir "path/to/output_dir"
```

Validate text outputs:

```bash
porosdata-designer validate structured --output_dir "path/to/output_dir/full_text"
```

Validate multimodal outputs:

```bash
porosdata-designer validate multimodal --output_dir "path/to/output_dir/multimodal"
```

Run final acceptance validation:

```bash
porosdata-designer validate acceptance --output_dir "path/to/output_dir/multimodal"
```

## Python Usage

You can also use the package directly in Python:

```python
from porosdata_designer import DataMiningMapper, MultimodalInterleaver, TextAggregator

aggregator = TextAggregator()
mapper = DataMiningMapper()
interleaver = MultimodalInterleaver()
```

A more complete text-side example:

```python
from porosdata_designer import DataMiningMapper, TextAggregator

content_list = [
    {"type": "text", "text_level": 1, "text": "Abstract", "page_idx": 0},
    {"type": "text", "text": "This work studies a Cu-Zr metallic glass system.", "page_idx": 0},
    {"type": "text", "text_level": 1, "text": "Results and Discussion", "page_idx": 1},
    {"type": "text", "text": "Figure 1 shows the microstructure evolution at 700 K.", "page_idx": 1},
]

aggregator = TextAggregator()
structured_text = aggregator.aggregate(content_list)

mapper = DataMiningMapper()
datamining_view = mapper.map(structured_text, {"doc_id": "demo-0001"})

print(structured_text)
print(datamining_view.pure_text_stream)
print(datamining_view.structured_json["sections"])
```

Expected outcome:

- `structured_text` contains Poros tags such as `<poros_doc>`, `<poros_section_*>`, and `<poros_paragraph>`.
- `pure_text_stream` removes the structure tags while keeping readable text.
- `structured_json` exposes mined fields such as `sections`, `formulas`, `chemical_formulas`, and `asset_refs`.

