Metadata-Version: 2.4
Name: srx-lib-docs
Version: 0.1.7
Summary: Document utilities for SRX: extract text from PDF, DOCX, PPTX, XLSX, and audio files (MP3, M4A, WAV)
Author-email: SRX <dev@srx.id>
Requires-Python: >=3.12
Requires-Dist: aiofiles>=23.2.1
Requires-Dist: httpx>=0.28.1
Requires-Dist: markitdown[all]>=0.1.3
Requires-Dist: openpyxl>=3.1.2
Requires-Dist: pillow>=10.0.0
Requires-Dist: pypdf2>=3.0.1
Requires-Dist: python-docx>=0.8.11
Requires-Dist: python-pptx>=0.6.21
Description-Content-Type: text/markdown

# srx-lib-docs

Small helpers to extract plain text from common office document formats used by SRX services.

What it includes:
- `extract_text(path_or_bytes, mime_type=None)` supports PDF, DOCX, PPTX, XLSX
- `DocumentMarkdownConverter` to download and convert PDF/DOCX/PPTX/XLSX to Markdown

## Install

PyPI (public):
- `pip install srx-lib-docs`

uv (pyproject):
```
[project]
dependencies = ["srx-lib-docs>=0.1.0"]
```

## Usage

```
from srx_lib_docs import extract_text
text = extract_text("/path/to/file.pdf")
```

Markdown conversion with download:
```
from srx_lib_docs.markdown import DocumentMarkdownConverter

conv = DocumentMarkdownConverter()
result = await conv.process_document(url, mimetype="application/pdf")
print(result["markdown_content"])  # plus file_type, file_size, success
```

## Notes

- For XLSX, the first 20 rows of each sheet are read to keep it lightweight; adjust in code if needed.

## License

Proprietary © SRX
