Metadata-Version: 2.4
Name: docx2md-cli
Version: 0.2.0
Summary: High-fidelity Word (.docx) to Markdown converter. Preserves tables (vMerge), footnotes, field codes, bibliography, bold/italic/underline, and numbered lists.
Project-URL: Homepage, https://github.com/gonzalopezgil/docx2md-cli
Project-URL: Repository, https://github.com/gonzalopezgil/docx2md-cli
Project-URL: Issues, https://github.com/gonzalopezgil/docx2md-cli/issues
Project-URL: Changelog, https://github.com/gonzalopezgil/docx2md-cli/blob/main/CHANGELOG.md
Author-email: Gonzalo López Gil <gonzalo.lopez.gil@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: ai-agents,bibliography,citations,cli,converter,docx,footnotes,frontmatter,images,markdown,tables,vmerge,word
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: lxml>=4.9.0
Requires-Dist: python-docx>=1.0.0
Provides-Extra: frontmatter
Requires-Dist: pyyaml>=6.0; extra == 'frontmatter'
Provides-Extra: test
Requires-Dist: pytest-cov>=5.0; extra == 'test'
Requires-Dist: pytest>=8.0; extra == 'test'
Description-Content-Type: text/markdown

# docx2md-cli

[![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://pypi.org/project/docx2md-cli/)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![PyPI](https://img.shields.io/pypi/v/docx2md-cli)](https://pypi.org/project/docx2md-cli/)

High-fidelity Word (`.docx`) to Markdown for documents where citations, tables, footnotes, and structure need to survive conversion.

## Why This Exists

Most DOCX-to-Markdown tools do fine on simple prose, then fall over on the details that matter in real reports and papers. `docx2md-cli` exists to preserve Word-specific structure such as field-code references, bibliography content controls, vertically merged tables, inline footnotes, and list numbering with minimal cleanup after conversion.

## Feature Comparison

| Feature | docx2md-cli | Pandoc | MarkItDown | mammoth |
|---|---|---|---|---|
| Bold / Italic / Underline | ✅ | ✅ | ❌ | ✅ |
| Footnotes (inline position) | ✅ | ✅ | ❌ | ✅ |
| Field codes (`[N]` refs) | ✅ | Partial | ❌ | ❌ |
| Bibliography (SDT) | ✅ | ❌ | ❌ | ❌ |
| Vertical merge (vMerge) | ✅ | ❌ | ❌ | ❌ |
| Split table detection | ✅ | ❌ | ❌ | ❌ |
| Numbered list distinction | ✅ | ✅ | ❌ | ❌ |
| Nested list levels | ✅ | ✅ | ❌ | ❌ |
| Image extraction + rename | ✅ | ✅ | ❌ | ❌ |
| YAML frontmatter | ✅ | ❌ | ❌ | ❌ |

## Quick Start

```bash
pip install docx2md-cli
docx2md input.docx
docx2md input.docx -o output.md
```

Optional frontmatter support:

```bash
pip install "docx2md-cli[frontmatter]"
```

## CLI Reference

Basic usage:

```bash
docx2md input.docx
docx2md input.docx -o output.md --extract-images images
docx2md input.docx --skip-before-heading --no-frontmatter
```

All flags:

| Flag | Description | Example |
|---|---|---|
| `-o`, `--output PATH` | Write Markdown to `PATH`. Use `-` for stdout. | `docx2md input.docx -o output.md` |
| `--extract-images DIR` | Extract embedded images and link them in Markdown. | `docx2md input.docx --extract-images images` |
| `--skip-before-heading` | Ignore content before the first real Word heading. | `docx2md input.docx --skip-before-heading` |
| `--frontmatter FILE` | Prepend custom YAML frontmatter from a file. | `docx2md input.docx --frontmatter meta.yaml` |
| `--no-frontmatter` | Disable both auto and custom frontmatter. | `docx2md input.docx --no-frontmatter` |
| `-q`, `--quiet` | Suppress stats output. | `docx2md input.docx -q` |
| `--json-stats` | Emit machine-readable stats JSON. | `docx2md input.docx --json-stats` |
| `-v`, `--version` | Print the installed version. | `docx2md --version` |

Streaming examples:

```bash
cat input.docx | docx2md - -o -
docx2md input.docx --json-stats
docx2md input.docx -o - --no-frontmatter
```

## Python API

```python
from docx2md_cli import convert

result = convert(
    "input.docx",
    output_path="output.md",
    images_dir="images",
    skip_before_heading=False,
    frontmatter_path=None,
    frontmatter_dict=None,
    no_frontmatter=False,
    print_stats=True,
    json_stats=False,
)
```

Parameters:

| Parameter | Type | Description |
|---|---|---|
| `input_path` | `str | bytes | BinaryIO` | DOCX path, DOCX bytes, or a binary file-like object. |
| `output_path` | `str | None` | Output Markdown path. Use `"-"` for stdout. |
| `images_dir` | `str | None` | Directory for extracted embedded images. |
| `skip_before_heading` | `bool` | Skip cover pages or prefatory content before `Heading N`. |
| `frontmatter_path` | `str | None` | YAML file to prepend as frontmatter. |
| `frontmatter_dict` | `dict | None` | Frontmatter mapping passed directly from Python. |
| `no_frontmatter` | `bool` | Disable frontmatter generation. |
| `print_stats` | `bool` | Emit conversion stats when writing output. |
| `json_stats` | `bool` | Emit stats as JSON instead of human-readable text. |
| `stats_stream` | `TextIO | None` | Stream used for stats output. |

Return value:

```python
print(result.lines[:3])
print(result.stats["table_rows"])
print(result.as_json())
```

`convert()` returns `ConvertResult`, which is list-like for backward compatibility and also exposes `.lines`, `.stats`, and `.as_json()`.

## For AI Agents

Use stdout-friendly and machine-readable modes when chaining tools:

```bash
docx2md input.docx --json-stats
docx2md input.docx -q -o output.md
cat input.docx | docx2md - -o -
```

```python
from docx2md_cli import convert

result = convert("input.docx", print_stats=False, no_frontmatter=True)
stats = result.stats
payload = result.as_json()
```

`--quiet` avoids human-oriented console output. `--json-stats` gives structured stats for automation. `-o -` writes Markdown to stdout. `ConvertResult` lets agents inspect lines and counters without reparsing terminal output.

## Supported Languages

Caption matching currently recognizes:

- Spanish: `Figura`, `Tabla`
- English: `Figure`, `Table`
- French: `Tableau`
- German: `Abbildung`, `Tabelle`
- Portuguese: `Tabela`
- Italian: `Tabella`

Word heading detection intentionally follows the standard `Heading N` style names.

## How It Works

The converter walks the Word document body in order instead of flattening everything to plain text. A field-code state machine preserves citation references, `numbering.xml` is read directly to distinguish ordered vs unordered lists and nested levels, and the table walker handles `vMerge` and split-table cases before emitting Markdown. Footnotes are collected from `footnotes.xml`, bibliography SDTs are extracted, and image filenames can be derived from nearby captions.

## Contributing

Issues welcome. PRs welcome. Run `pytest` before submitting.

## License

MIT
