Metadata-Version: 2.4
Name: gendoruwo
Version: 0.0.2
Summary: Generate Document Running Workflow (GENDORUWO): a document extraction framework
Home-page: https://gitlab.com/alkode.id/gendoruwo
Author: Singgih
Author-email: singgih@alkode.id
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Click>=7.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-mock; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: tox; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: flake8-import-order; extra == "dev"
Requires-Dist: flake8-print; extra == "dev"
Requires-Dist: flake8-builtins; extra == "dev"
Requires-Dist: pep8-naming; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: rope; extra == "dev"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

<p align="center">
  <img src="https://img.freepik.com/vektor-premium/desain-vektor-klipart-monyet-yang-lucu-dan-menawan_1275990-8521.jpg" alt="GENDORUWO Logo" width="128"/>
</p>

# GENDORUWO 👻
**Generate Document Running Workflow**

GENDORUWO is a lightweight document text extraction framework built in Python.

## Features

- 📄 **PDF Text Extraction**: Extract text content from PDF documents
- 📝 **DOCX Text Extraction**: Extract text from Word documents
- 📊 **XLSX Text Extraction**: Extract text from Excel spreadsheets
- 📎 **Multi-format Support**: Extensible architecture for adding more document formats
- ⚡ **Workflow-based**: Define extraction workflows via YAML configuration

## Installation
```bash
pip install gendoruwo
```

## Quick Start

### CLI Usage

```bash
# Extract text from a single document
gendoruwo extract document.pdf

# Run a workflow
gendoruwo run workflow.yaml

# Validate a workflow file
gendoruwo validate workflow.yaml

# Initialize a sample workflow
gendoruwo init
```

### Python API

```python
from gendoruwo import Gendoruwo

gd = Gendoruwo()

# Extract text from a document
text = gd.extract("document.pdf")
print(text)
```

## Workflow YAML Format

```yaml
name: "Extract Contracts"
input:
  paths:
    - "docs/contract1.pdf"
    - "docs/contract2.docx"
  recursive: true
output:
  directory: "extracted/"
  format: "text"  # text | markdown | json
options:
  encoding: "utf-8"
```

## Development

```bash
# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
flake8 src/ tests/

# Run tox
tox
```

## License

MIT License - see [LICENSE](LICENSE) for details.
