Metadata-Version: 2.4
Name: raw_docx
Version: 0.14.0
Summary: A package for processing and analyzing raw document formats
Home-page: https://github.com/daveih/raw_docx
Author: Dave Iberson-Hurst
Author-email: 
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-docx>=1.1.2
Requires-Dist: simple_error_log>=0.6.0
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Raw DOCX

A Python library that extends [python-docx](https://python-docx.readthedocs.io/) to convert Word documents into structured, traversable Python objects with export to dictionary, HTML, and plain text formats.

## Installation

```bash
pip install raw_docx
```

## Features

- **Document hierarchy** - Automatic section numbering with multi-level headings (1-6)
- **Rich text** - Colors, highlighting, bold, italic, superscript, and subscript
- **Tables** - Full support for merged cells (row and column spans)
- **Nested lists** - Arbitrary nesting depth with level tracking and numId boundary detection
- **Indentation hierarchy** - Infers nesting from indentation when all items share the same level
- **Bookmarks and cross-references** - Bookmark anchors and field-based references
- **Image extraction** - Extracts embedded images with base64 HTML embedding
- **Multiple export formats** - Dictionary, HTML, and plain text
- **Search** - Find text across sections, tables, and the full document
- **Error tracking** - Integrated logging via [simple_error_log](https://pypi.org/project/simple-error-log/)

## Quick Start

```python
from raw_docx import RawDocx

# Load and process a document
docx = RawDocx("path/to/document.docx", work_dir="/tmp/output")

# Disable indentation-based hierarchy inference if needed
# docx = RawDocx("path/to/document.docx", infer_indent_hierarchy=False)

# Access the structured document
document = docx.target_document

# Export to dictionary
data = docx.to_dict()

# Work with sections
section = document.section_by_title("Introduction")
paragraphs = section.paragraphs()
tables = section.tables()
lists = section.lists()

# Search for content
results = section.find("keyword")

# Generate HTML
html = section.to_html()
```

## Key Classes

| Class | Description |
|-------|-------------|
| `RawDocx` | Main entry point; loads and processes a .docx file |
| `RawDocument` | Top-level container managing sections and hierarchy |
| `RawSection` | A document section/heading with its content |
| `RawParagraph` | A paragraph containing runs and bookmarks |
| `RawRun` | A text run with formatting attributes |
| `RawTable` / `RawTableRow` / `RawTableCell` | Table structure with merged cell support |
| `RawList` / `RawListItem` | Nested list structure |
| `RawImage` | Embedded image handling |

## Requirements

- Python >= 3.12
- [python-docx](https://pypi.org/project/python-docx/) >= 1.1.2
- [simple_error_log](https://pypi.org/project/simple-error-log/) >= 0.6.0

## License

MIT - see [LICENSE](LICENSE) for details.

## Build and Release

```bash
pytest
ruff format
ruff check
python3 -m build --sdist --wheel
twine upload dist/*
```
