Metadata-Version: 2.4
Name: airun-hwp
Version: 0.2.9
Summary: AI-powered HWP/HWPX document processing library for Hamonize
Author-email: Kevin Kim <chaeya@gmail.com>
License: MIT License
        
        Copyright (c) 2024 Hamonize Team
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/chaeya/airun-hwp
Project-URL: Repository, https://github.com/chaeya/airun-hwp.git
Project-URL: Issues, https://github.com/chaeya/airun-hwp/issues
Keywords: hwp,hwpx,document,parser,hancom,hamonize
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pypandoc-hwpx>=0.1.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: Pillow>=10.0.0
Requires-Dist: weasyprint>=60.0
Requires-Dist: markdown>=3.5.0
Requires-Dist: click>=8.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10; extra == "dev"
Requires-Dist: pytest-xdist>=3.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: types-PyYAML>=6.0; extra == "dev"
Dynamic: license-file

# airun-hwp

AI-powered HWP/HWPX document processing library for Hamonize

[![Python Version](https://img.shields.io/badge/python-3.8%2B-blue.svg)](https://python.org)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![PyPI Version](https://img.shields.io/pypi/v/airun-hwp.svg)](https://pypi.org/project/airun-hwp/)

## Features

- **HWPX Parsing**: Parse HWPX files with full document structure preservation
- **HWP Text Extraction**: Extract plain text from HWP files (structure not preserved)
- **Ordered Content Extraction**: Maintain original document flow with mixed content types (HWPX only)
- **Image Extraction**: Extract and save all images from documents
- **Table Processing**: Extract tables with proper formatting (HWPX only)
- **Markdown Conversion**: Convert documents to well-structured Markdown
- **PDF Export**: Generate PDF files with embedded images (included by default)
- **CLI Tool**: Easy-to-use command-line interface

## Installation

```bash
pip install airun-hwp
```

Note: PDF export functionality is included by default.

### Development Installation

```bash
git clone https://github.com/chaeya/airun-hwp.git
cd airun-hwp
pip install -e ".[dev]"
```

## Quick Start

### Command Line Interface

The CLI provides a simple and intuitive interface for processing HWPX documents:

```bash
# Convert to both Markdown and PDF (default)
airun-hwp document.hwpx

# Convert to specific format
airun-hwp document.hwpx --format pdf
airun-hwp document.hwpx -f markdown

# Specify output directory
airun-hwp document.hwpx --format pdf --output ./results
airun-hwp document.hwpx -o ./output_folder

# Get help
airun-hwp --help
```

#### Legacy Commands (Deprecated)

The old subcommand structure is still supported but deprecated:

```bash
# These still work but show deprecation warnings
airun-hwp convert document.hwpx --format pdf
airun-hwp process document.hwpx
```

## Shell Auto-completion

The CLI supports tab completion for bash, zsh, and fish shells. This makes it easier to use the command-line interface without remembering all options.

### Automatic Installation (Recommended)

Run the completion installer after installing the package:

```bash
# Install completion automatically (detects your shell)
airun-hwp-completion

# Or manually run:
python -c "from airun_hwp.cli_click import completion_install; completion_install()"
```

The installer will:
- Detect your current shell (bash or zsh)
- Add completion script to your shell configuration file
- Show you how to activate it

### Manual Setup

#### Bash

Add this line to your `~/.bashrc`:

```bash
eval "$(_AIRUN_HWP_COMPLETE=bash_source airun-hwp)"
```

Then reload your shell:

```bash
source ~/.bashrc
```

#### Zsh

Add this line to your `~/.zshrc`:

```bash
eval "$(_AIRUN_HWP_COMPLETE=zsh_source airun-hwp)"
```

Then reload your shell:

```bash
source ~/.zshrc
```

#### Fish

Create a completion file:

```bash
mkdir -p ~/.config/fish/completions
airun-hwp --completion=bash > ~/.config/fish/completions/airun-hwp.fish
```

### Using Completion

Once enabled, you can use tab completion:

```bash
# Tab completion for commands
airun-hwp <TAB>
# convert  process

# Tab completion for options
airun-hwp convert <TAB>
# document.hwpx  --format  --help  --output

# Tab completion for option values
airun-hwp convert --format <TAB>
# markdown  md  pdf
```

### Python API

```python
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered
from airun_hwp.reader.hwpx_to_markdown import extract_text_from_file

# Parse HWPX file (full structure preserved)
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract text
text = document.get_all_text()
print(f"Total text length: {len(text)} characters")

# Extract images
images = document.extract_images("./output/images")
print(f"Extracted {len(images)} images")

# Convert to Markdown with tables
markdown_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Save Markdown
with open("document.md", "w", encoding="utf-8") as f:
    f.write(markdown_content)

# For HWP files (plain text only)
hwp_text = extract_text_from_file("document.hwp")
print(f"HWP text (tables not preserved): {len(hwp_text)} characters")
```

## Advanced Usage

### PDF Generation with Custom Styling

```python
import markdown
import weasyprint
from airun_hwp.reader.hwpx_reader_ordered import HWPXReaderOrdered

# Parse document
reader = HWPXReaderOrdered()
document = reader.parse("document.hwpx")

# Extract images
document.extract_images("./output/images")

# Get Markdown content
md_content = document.to_markdown_ordered(
    include_metadata=True,
    images_dir="./output/images"
)

# Convert to HTML
html = markdown.markdown(md_content, extensions=['tables', 'fenced_code'])

# Add custom CSS
css = """
<style>
    body { font-family: 'Malgun Gothic', Arial, sans-serif; }
    img { max-width: 100%; height: auto; }
    table { border-collapse: collapse; width: 100%; }
    th, td { border: 1px solid #333; padding: 8px; }
</style>
"""

# Generate PDF
pdf = weasyprint.HTML(string=css + html).write_pdf("document.pdf")
```

## Document Structure

The library processes HWPX documents using a token-stream approach that preserves the original document order:

- **Text Runs**: Consecutive text segments
- **Images**: Embedded images with proper positioning
- **Tables**: Structured table data
- **Paragraph Breaks**: Logical document divisions
- **Page Breaks**: Document pagination

## CLI Usage

### Basic Usage

The airun-hwp command accepts an HWPX file as a required argument:

```bash
airun-hwp <input_file> [options]

Arguments:
  input_file                  Path to the HWPX file to process

Options:
  --format, -f {markdown,md,pdf,all}
                              Output format (default: all)
  --output, -o PATH          Output directory (default: ./output)
  --help                     Show help message
  --version                  Show version
```

### Examples

```bash
# Process to both formats (default behavior)
airun-hwp document.hwpx

# Create only Markdown
airun-hwp document.hwpx --format markdown

# Create only PDF
airun-hwp document.hwpx -f pdf

# Custom output location
airun-hwp document.hwpx --output ./my_results
```

## HWP vs HWPX: Important Differences

This library handles HWP and HWPX files differently due to their fundamental format differences:

### HWPX Files (Recommended)
- **Format**: XML-based, open standard
- **Structure**: Preserves full document structure
- **Tables**: ✅ Extracted with proper formatting
- **Images**: ✅ Extracted with positioning
- **Layout**: Maintains original document flow

### HWP Files (Limited Support)
- **Format**: Binary, proprietary format
- **Structure**: Only plain text extraction available
- **Tables**: ❌ Not preserved (extracted as plain text only)
- **Images**: ❌ Cannot preserve original position/sequence
- **Layout**: Original structure and order lost

### Recommendation
For best results, use HWPX files. If you have HWP files:
1. Convert HWP to HWPX in Hanword (한글) before processing
2. Or use for plain text extraction only

## Output Structure

When processing a document named `document.hwpx`:

```
output/
└── document/
    ├── images/
    │   ├── image1.png
    │   ├── image2.png
    │   └── ...
    ├── document.md
    └── document.pdf
```

## Dependencies

- `pypandoc-hwpx>=0.1.0`: HWPX file format support
- `PyYAML>=6.0`: YAML configuration parsing
- `Pillow>=10.0.0`: Image processing
- `weasyprint>=60.0`: HTML to PDF conversion (included)
- `markdown>=3.5.0`: Markdown processing (included)
- `click>=8.0.0`: Command-line interface with auto-completion support

## Development

### Running Tests

```bash
pytest
```

### Code Coverage

```bash
pytest --cov=airun_hwp
```

### Code Formatting

```bash
black airun_hwp/
ruff check airun_hwp/
```

### Type Checking

```bash
mypy airun_hwp/
```

## Building for Distribution

```bash
# Build source and wheel distributions
python -m build

# Build with twine
twine build dist/
```

## Publishing to PyPI

```bash
# Upload to Test PyPI
twine upload --repository testpypi dist/*

# Upload to PyPI
twine upload dist/*
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Contributing

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## Support

- 📧 Email: chaeya@gmail.com (Kevin Kim)
- 🐛 Issues: [GitHub Issues](https://github.com/hamonize/airun-hwp/issues)
- 📖 Documentation: [GitHub Wiki](https://github.com/hamonize/airun-hwp/wiki)

## Changelog

### Version 0.2.9
- Simplified CLI interface: removed subcommands for direct usage
- Now use `airun-hwp document.hwpx` instead of `airun-hwp convert document.hwpx`
- Default behavior creates both Markdown and PDF outputs
- Added `--format all` option (default) for creating both formats
- Maintained backward compatibility with deprecated subcommands
- Cleaner and more intuitive command-line experience

### Version 0.2.8
- Added shell auto-completion support for bash, zsh, and fish
- Migrated CLI from argparse to Click for better user experience
- Added automatic completion installer (`airun-hwp-completion`)
- Enhanced CLI with tab completion for commands and options
- Improved error messages with Click's formatting

### Version 0.2.7
- Fixed PyPI publishing workflow with Trusted Publishing
- Fixed license format for Python 3.8 compatibility
- Updated build configuration

### Version 0.2.5
- Fixed `get_all_text()` method to properly extract text from token stream
- Improved text extraction to handle both tokens and paragraphs
- Added deduplication to prevent duplicate text extraction
- Updated documentation to clarify HWP vs HWPX limitations

### Version 0.2.0
- HWPX parsing support
- Markdown conversion
- PDF export functionality
- CLI tool
- Image extraction
- Table processing

### Version 0.1.0
- Initial release

---

**Made with ❤️ for the Hamonize project**
