Metadata-Version: 2.4
Name: structured-docx-loader
Version: 0.1.0
Summary: LangChain document loader for structured Microsoft Word (.docx) files using python-docx.
Project-URL: Repository, https://github.com/Harshitn24/structured-docx-loader
Project-URL: Issues, https://github.com/Harshitn24/structured-docx-loader/issues
Author-email: Harshit <harshitnavadiya24@gmail.com>
License: MIT
License-File: LICENSE
Keywords: document-loader,docx,langchain,llm,word
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: <4.0,>=3.10
Requires-Dist: langchain-core<2.0,>=0.3
Requires-Dist: python-docx<2.0,>=1.0.0
Requires-Dist: requests<3.0,>=2.31.0
Provides-Extra: lint
Requires-Dist: ruff<1.0,>=0.5.0; extra == 'lint'
Provides-Extra: test
Requires-Dist: pytest-mock<4.0,>=3.10.0; extra == 'test'
Requires-Dist: pytest<9.0,>=8.0.0; extra == 'test'
Provides-Extra: typing
Requires-Dist: mypy<2.0,>=1.10.0; extra == 'typing'
Requires-Dist: types-requests>=2.31.0; extra == 'typing'
Description-Content-Type: text/markdown

# structured-docx-loader

A [LangChain](https://github.com/langchain-ai/langchain) `BaseLoader` for Microsoft Word (`.docx`) files that preserves document structure instead of flattening it into one undifferentiated blob of text.

`langchain-community`'s existing Word loaders either dump raw text (`Docx2txtLoader`) or depend on the heavyweight `unstructured` library (`UnstructuredWordDocumentLoader`). `DocxLoader` uses [`python-docx`](https://python-docx.readthedocs.io/) directly to walk the document in its native order and:

- Renders heading styles (`Heading 1`-`Heading 9`) as Markdown headings, preserving hierarchy.
- Converts tables to Markdown (default), HTML, or a key-value row format suitable for retrieval.
- Supports three loading granularities: a single document, one document per heading section, or one document per paragraph/table element.

## Install

```bash
pip install structured-docx-loader
```

## Usage

```python
from structured_docx_loader import DocxLoader

# Load the entire document as a single Document
loader = DocxLoader("example.docx")
docs = loader.load()

# Split by heading sections, with HTML tables
loader = DocxLoader("example.docx", mode="sections", table_format="html")
docs = loader.load()

# One Document per paragraph/table row, tables as key-value pairs
loader = DocxLoader(
    "example.docx",
    mode="elements",
    table_format="key_value",
    table_extraction_strategy="row",
)
docs = loader.load()
```

`file_path` also accepts an HTTP(S) URL, in which case the file is downloaded to a temporary location before parsing.

### Options

| Argument | Values | Description |
| --- | --- | --- |
| `mode` | `"single"` (default), `"sections"`, `"elements"` | Granularity of the returned `Document` objects. |
| `table_format` | `"markdown"` (default), `"html"`, `"key_value"` | How tables are rendered into text. |
| `table_extraction_strategy` | `"table"` (default), `"row"` | Whether a table becomes one block or one block per row. |

## Development

```bash
pip install -e ".[test,lint,typing]"
pytest
ruff check .
mypy structured_docx_loader
```

## License

MIT
