Metadata-Version: 2.4
Name: porosdata-designer
Version: 0.1.3
Summary: Structured delivery toolkit: MinerU content_list to per-doc layout (content/structure/assets under one doc_id directory).
Author-email: Kivent <72405514@cityu-dg.edu.cn>
License-Expression: CC-BY-4.0
Keywords: mineru,document-processing,scientific-data,multimodal,structured-output
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: loguru>=0.7.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Dynamic: license-file

# PorosData-Designer

`PorosData-Designer` converts MinerU-generated document parses into a **unified per-document layout** under `{output_root}/{doc_id}/`: structure-aware training text (`*.content.json` / `*.content.txt`), a datamining view (`*.structure.json`), and multimodal delivery (`*.assets.index.json` and `images/`).

It is designed for scientific data processing and structure-aware preparation centered on paragraphs, formulas, chemical expressions, and figure assets.

## What it does

- Builds a structure-aware full-text view from `*_content_list.json`.
- Maps document sections, formulas, chemical expressions, and asset references into `*.structure.json`.
- Extracts image/caption/mention relationships and ships copied assets plus Markdown cards under `images/`.

## Install

```bash
pip install porosdata-designer
```

After install, use `import designer` in Python and the `designer` CLI command. The PyPI distribution name is `porosdata-designer`.

Python requirement: `>=3.8`.

## Quick start (CLI)

1. `--input_dir`: a directory tree containing MinerU outputs (recursive `*_content_list.json`).
2. `--output_dir`: where the structured delivery will be written (default is determined by the package config when omitted).

Run the full pipeline:

```bash
designer run all --input_dir "path/to/input_dir" --output_dir "path/to/output_dir" --log_dir "path/to/log_dir"
```

You can also run only one stage:

```bash
designer run text --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
designer run multimodal --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
```

## Outputs

Each document is written to `{output_root}/{doc_id}/`:

```text
path/to/output_root/
└── {doc_id}/
    ├── {doc_id}.content.json
    ├── {doc_id}.content.txt
    ├── {doc_id}.structure.json
    ├── {doc_id}.assets.index.json
    └── images/
        ├── fig_1.jpg
        └── fig_1.md
```

## Validation / audit (CLI)

Examples (paths are relative to your local filesystem):

```bash
# Audit structured outputs (content/structure/*.json under each doc_id)
designer audit structured --root_dir "path/to/output_root"

# Validate structured outputs
designer validate structured --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Validate multimodal index files
designer validate multimodal --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Final acceptance validation
designer validate acceptance --output_dir "path/to/output_root" --log_dir "path/to/log_dir"

# Validate against the delivery standard
designer validate delivery --root_dir "path/to/output_root" --log_dir "path/to/log_dir"
```

## Python usage

You can also use the package directly in Python:

```python
from designer import DataMiningMapper, MultimodalInterleaver, TextAggregator

aggregator = TextAggregator()
mapper = DataMiningMapper()
interleaver = MultimodalInterleaver()
```

Text-side example:

```python
from designer import DataMiningMapper, TextAggregator

content_list = [
    {"type": "text", "text_level": 1, "text": "Abstract", "page_idx": 0},
    {"type": "text", "text": "This work studies a Cu-Zr metallic glass system.", "page_idx": 0},
    {"type": "text", "text_level": 1, "text": "Results and Discussion", "page_idx": 1},
    {"type": "text", "text": "Figure 1 shows the microstructure evolution at 700 K.", "page_idx": 1},
]

aggregator = TextAggregator()
structured_text = aggregator.aggregate(content_list)

mapper = DataMiningMapper()
datamining_view = mapper.map(structured_text, {"doc_id": "demo-0001"})

print(structured_text)
print(datamining_view.pure_text_stream)
print(datamining_view.structured_json["sections"])
```

Expected outcome:

- `structured_text` contains Poros tags such as `<poros_doc>`, `<poros_section_*>`, and `<poros_paragraph>`.
- `pure_text_stream` removes the structure tags while keeping readable text.
- `structured_json` exposes mined fields such as `sections`, `formulas`, `chemical_formulas`, and `asset_refs`.
