Metadata-Version: 2.4
Name: PorosData-Designer
Version: 0.1.1
Summary: Structured delivery toolkit: MinerU content_list to per-doc layout (content/structure/assets under one doc_id directory).
Author-email: Kivent <72405514@cityu-dg.edu.cn>
License-Expression: CC-BY-4.0
Keywords: mineru,document-processing,scientific-data,multimodal,structured-output
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: loguru>=0.7.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Dynamic: license-file

# PorosData-Designer

`PorosData-Designer` converts MinerU-generated document parses into a **unified per-document layout** under `{output_root}/{doc_id}/`: structure-aware training text (`*.content.json` / `*.content.txt`), a datamining view (`*.structure.json`), and multimodal delivery (`*.assets.index.json` and `images/`).

It is designed for scientific data processing, structure-aware training preparation, and atomic document design centered on paragraphs, formulas, chemical expressions, and figure assets.

## What It Does

- Builds a structure-aware full-text view from `*_content_list.json`.
- Maps document sections, formulas, chemical expressions, and asset references into `*.structure.json`.
- Extracts image-caption-mention relationships with copied assets and Markdown cards under `images/`.

## Install

Recommended Python version: `3.12.6` (validated on Windows `win32 10.0.26200`).
Minimum supported version remains `3.8`; Python `3.9`-`3.11` are syntax-compatible but not fully regression-tested in this repository.

```bash
pip install porosdata-designer
```

Runtime install now pulls only one direct dependency: `loguru>=0.7.0`.

For development in this repository:

```bash
pip install -e ".[dev]"
```

## Quick Start

`--input_dir` must point to a directory tree that contains MinerU outputs (recursive `*_content_list.json`). In this repository the conventional input root is `data/Processed Database`.

**Recommended in this repo (full pipeline):**

```bash
./scripts/run_designeddataset.sh
# or explicit paths:
./scripts/run_designeddataset.sh "data/Processed Database" "data/Designed Database" logs
```

The script exports `PYTHONPATH` and runs `python -m src.porosdata_designer.cli run all` with the same defaults.

**After editable install:**

```bash
porosdata-designer run all \
  --input_dir "data/Processed Database" \
  --output_dir "data/Designed Database" \
  --log_dir logs
```

**Module mode (same as installed package):**

```bash
python -m porosdata_designer run all \
  --input_dir "data/Processed Database" \
  --output_dir "data/Designed Database" \
  --log_dir logs
```

**Unpacked source without install** (equivalent to the shell script):

```bash
export PYTHONPATH="${PWD}:${PWD}/src${PYTHONPATH:+:${PYTHONPATH}}"
python -m src.porosdata_designer.cli run all \
  --input_dir "data/Processed Database" \
  --output_dir "data/Designed Database" \
  --log_dir logs
```

If you omit `--output_dir`, the default is `data/Designed Database` under the project root (see `DEFAULT_DESIGNED_OUTPUT_DIR_NAME` in `src/porosdata_designer/runtime/config.py`).

## Stage Commands

Run text structuring only:

```bash
porosdata-designer run text --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
```

Run multimodal extraction only (same input tree as text in this repo):

```bash
porosdata-designer run multimodal --input_dir "path/to/input_dir" --output_dir "path/to/output_dir"
```

## Outputs

Each document is written to `{output_root}/{doc_id}/`:

```text
path/to/output_root/
└── {doc_id}/
    ├── {doc_id}.content.json
    ├── {doc_id}.content.txt
    ├── {doc_id}.structure.json
    ├── {doc_id}.assets.index.json
    └── images/
        ├── fig_1.jpg
        └── fig_1.md
```

## Validation

Audit outputs (default root is `data/Designed Database` when `--root_dir` is omitted):

```bash
porosdata-designer audit structured --root_dir "path/to/output_root"
```

Validate `*.content.json` files:

```bash
porosdata-designer validate structured --output_dir "path/to/output_root"
```

Validate multimodal indexes:

```bash
porosdata-designer validate multimodal --output_dir "path/to/output_root"
```

Run final acceptance validation:

```bash
porosdata-designer validate acceptance --output_dir "path/to/output_root"
```

## Python Usage

You can also use the package directly in Python:

```python
from porosdata_designer import DataMiningMapper, MultimodalInterleaver, TextAggregator

aggregator = TextAggregator()
mapper = DataMiningMapper()
interleaver = MultimodalInterleaver()
```

A more complete text-side example:

```python
from porosdata_designer import DataMiningMapper, TextAggregator

content_list = [
    {"type": "text", "text_level": 1, "text": "Abstract", "page_idx": 0},
    {"type": "text", "text": "This work studies a Cu-Zr metallic glass system.", "page_idx": 0},
    {"type": "text", "text_level": 1, "text": "Results and Discussion", "page_idx": 1},
    {"type": "text", "text": "Figure 1 shows the microstructure evolution at 700 K.", "page_idx": 1},
]

aggregator = TextAggregator()
structured_text = aggregator.aggregate(content_list)

mapper = DataMiningMapper()
datamining_view = mapper.map(structured_text, {"doc_id": "demo-0001"})

print(structured_text)
print(datamining_view.pure_text_stream)
print(datamining_view.structured_json["sections"])
```

Expected outcome:

- `structured_text` contains Poros tags such as `<poros_doc>`, `<poros_section_*>`, and `<poros_paragraph>`.
- `pure_text_stream` removes the structure tags while keeping readable text.
- `structured_json` exposes mined fields such as `sections`, `formulas`, `chemical_formulas`, and `asset_refs`.
