Metadata-Version: 2.4
Name: hwpx2md
Version: 0.1.0
Summary: Build LLM-friendly Markdown artifact bundles from HWPX and HWP documents.
Keywords: hwp,hwpx,korean,llm,markdown
Requires-Python: >=3.10
Requires-Dist: hwp-hwpx-parser>=1.0.0
Description-Content-Type: text/markdown

# hwpx2md

Convert Korean `.hwpx` or `.hwp` documents into an LLM-readable Markdown artifact bundle.

Use this when a model or script needs the document body, memos/comments, tables, and extracted assets in plain files instead of a binary office document.

## Quick start

Run directly from PyPI with `uvx`:

```powershell
uvx hwpx2md "input.hwpx" -o "input_bundle" --overwrite
uvx hwpx2md "input.hwp" -o "input_bundle" --overwrite
```

If `-o` is omitted, `hwpx2md` writes the bundle next to the source file as `<input_stem>_llm_bundle`.

Use quotes around Windows paths, Korean filenames, and paths containing spaces.

## What you get

After conversion, read the files in this order:

1. `document.md`: main body text. Memo, table, and asset references are linked inline.
2. `memos.md`: all memo/comment text. Each memo includes the document anchor, paragraph text, and nearby context.
3. `chunks.jsonl`: retrieval-friendly chunks. Each JSON line contains text plus related memo, table, and asset IDs.
4. `manifest.json`: machine-readable inventory with backend, counts, warnings, memo metadata, table metadata, and asset metadata.
5. `tables/`: one set of artifacts per table: full Markdown, CSV, compact Markdown, and cell JSON.
6. `assets/`: copied embedded or preview assets when the backend exposes them.

The CLI prints a JSON summary with the output directory, backend, counts, files to read first, and warning codes:

```json
{
  "output": "C:\\work\\input_bundle",
  "format": "hwpx",
  "backend": "native-hwpx",
  "counts": {
    "sections": 1,
    "paragraphs": 96,
    "tables": 5,
    "memos": 2,
    "assets": 1,
    "asset_refs": 0,
    "chunks": 2,
    "warnings": 1
  },
  "read_first": [
    "C:\\work\\input_bundle\\document.md",
    "C:\\work\\input_bundle\\memos.md",
    "C:\\work\\input_bundle\\chunks.jsonl",
    "C:\\work\\input_bundle\\manifest.json"
  ],
  "warnings": ["nested_tables_folded"]
}
```

## Backend fidelity

`.hwpx` uses the native XML backend. This is the highest-fidelity path. It preserves memo IDs, authors, timestamps when present, document anchors, table cell spans, and exposed assets.

`.hwp` uses `hwp-hwpx-parser`. This path does not require Hancom Office or Windows COM automation. It preserves body text, tables, memo text, and `[MEMO:N]` anchors, but the backend may not expose author, timestamp, exact memo IDs, image references, or table span metadata. Check `manifest.json` warnings for the exact limitations seen in a converted file.

For the best memo and table metadata, prefer `.hwpx` when you can. Use `.hwp` when the original binary file is all you have and text/memo extraction is more important than perfect layout metadata.

## Commands

Convert HWPX:

```powershell
uvx hwpx2md "proposal.hwpx" -o "proposal_bundle" --overwrite
```

Convert HWP:

```powershell
uvx hwpx2md "proposal.hwp" -o "proposal_bundle" --overwrite
```

Show the self-contained CLI guide:

```powershell
uvx hwpx2md --help
```

Use the package from a local checkout:

```powershell
cd hwpx2md
uv run hwpx2md "..\email\sample.hwpx" -o "..\.backup\sample_bundle" --overwrite
```

## Python API

```python
from pathlib import Path
from hwpx2md import bundle_document

manifest = bundle_document(
    Path("input.hwpx"),
    output_dir=Path("input_bundle"),
    overwrite=True,
)

print(manifest["counts"])
```

Use `bundle_hwpx(...)` if you want to accept only `.hwpx`, or `bundle_hwp(...)` if you want to accept only `.hwp`.
