Metadata-Version: 2.4
Name: docforge
Version: 0.1.0
Summary: Forge perfect documents from any format with precision, power, and simplicity
Author-email: Oscar Song <oscar2song@gmail.com>
Maintainer-email: Oscar Song <oscar2song@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/oscar2song/docforge
Project-URL: Documentation, https://oscar2song.github.io/docforge
Project-URL: Repository, https://github.com/oscar2song/docforge.git
Project-URL: Bug Tracker, https://github.com/oscar2song/docforge/issues
Project-URL: Changelog, https://github.com/oscar2song/docforge/blob/main/CHANGELOG.md
Keywords: pdf,ocr,document,processing,optimization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Multimedia :: Graphics :: Graphics Conversion
Classifier: Topic :: Office/Business :: Office Suites
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0.0
Requires-Dist: Pillow>=9.0.0
Requires-Dist: PyPDF2>=3.0.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: tqdm>=4.64.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: colorama>=0.4.4
Requires-Dist: rich>=12.0.0
Requires-Dist: reportlab>=3.6.0
Requires-Dist: PyMuPDF>=1.20.0
Requires-Dist: python-docx>=0.8.11
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.8.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=0.991; extra == "dev"
Requires-Dist: pre-commit>=2.20.0; extra == "dev"
Requires-Dist: sphinx>=5.0.0; extra == "dev"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "dev"
Provides-Extra: web
Requires-Dist: fastapi>=0.85.0; extra == "web"
Requires-Dist: uvicorn>=0.18.0; extra == "web"
Requires-Dist: jinja2>=3.0.0; extra == "web"
Requires-Dist: python-multipart>=0.0.5; extra == "web"
Provides-Extra: all
Requires-Dist: docforge[dev,web]; extra == "all"
Dynamic: license-file

# DocForge 🔨

**Forge perfect documents from any format with precision, power, and simplicity.**

DocForge is a comprehensive document processing toolkit built on proven implementations with a modern modular architecture. Born from real-world needs and battle-tested algorithms, DocForge transforms how you work with documents.

## ✨ Features

- 🔍 **OCR Processing**: Convert scanned PDFs to searchable documents with precision
- 🗜️ **Smart Optimization**: Reduce file sizes without compromising quality  
- ⚙️ **Batch Processing**: Handle hundreds of documents efficiently
- 🔧 **Document Analysis**: Extract insights and metadata
- 🎯 **Modular Design**: Use only what you need, extend easily

## 🚀 Why DocForge?

- **Battle-tested OCR algorithms** with Windows compatibility
- **Advanced optimization techniques** from real-world usage
- **Memory-efficient batch processing** for large-scale operations
- **Clean, modular codebase** that's easy to understand and extend
- **Comprehensive error handling** and logging
- **Both programmatic API and command-line interface**

## 📦 Installation

### Option 1: Install from PyPI (when available)
```bash
pip install docforge
```

### Option 2: Install from source
```bash
git clone https://github.com/oscar2song/docforge.git
cd docforge
pip install -e .
```

### System Dependencies

**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr poppler-utils
```

**macOS:**
```bash
brew install tesseract poppler
```

**Windows:**
Download Tesseract from: https://github.com/tesseract-ocr/tesseract

## 🎯 Quick Start

### Command Line Interface

After installation, use the `docforge` command:

```bash
# Get help
docforge --help

# OCR a scanned PDF
docforge enhanced-ocr -i scanned_document.pdf -o searchable_document.pdf

# Batch OCR processing
docforge enhanced-batch-ocr -i scanned_folder/ -o searchable_folder/

# Standard OCR processing
docforge ocr -i document.pdf -o output.pdf --language eng

# Batch optimization
docforge batch-ocr -i input_folder/ -o output_folder/

# Test the interface
docforge test-rich

# Run performance benchmarks
docforge benchmark --test-files document.pdf
```

### Programmatic API

```python
from docforge import DocumentProcessor

# Initialize the processor
processor = DocumentProcessor(verbose=True)

# OCR a scanned PDF
result = processor.ocr_pdf(
    "scanned_document.pdf",
    "searchable_document.pdf", 
    language='eng'
)

# Optimize PDF size
result = processor.optimize_pdf(
    "large_document.pdf",
    "optimized_document.pdf",
    optimization_type="aggressive"
)

# Batch processing
result = processor.batch_ocr_pdfs(
    "scanned_folder/",
    "searchable_folder/"
)
```

## 🏗️ Architecture

DocForge is built with a clean, modular architecture:

```
docforge/
├── core/           # Core processing engine
├── pdf/            # PDF operations (proven implementations)  
├── cli/            # Command-line interface
├── utils/          # Shared utilities
└── main.py         # CLI entry point
```

## 📋 Available Commands

| Command | Description |
|---------|-------------|
| `enhanced-ocr` | OCR with advanced performance optimization |
| `enhanced-batch-ocr` | Batch OCR with intelligent performance optimization |
| `ocr` | Standard OCR processing |
| `batch-ocr` | Standard batch OCR processing |
| `optimize` | PDF optimization |
| `pdf-to-word` | PDF to Word conversion |
| `split-pdf` | Split PDF documents |
| `benchmark` | Run performance benchmarks |
| `perf-stats` | Display performance statistics |
| `test-rich` | Test Rich CLI interface |

## 🧪 Examples

Run the examples to see DocForge in action:

```bash
# Basic usage examples (if you have example files)
python examples/basic_usage.py

# Test the CLI interface
docforge test-rich

# Test error handling
docforge test-errors

# Test validation system  
docforge test-validation
```

## 🤝 Contributing

We welcome contributions! The modular architecture makes it easy to add new features.

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## 🗺️ Roadmap

- ✅ Core PDF processing with proven implementations
- ✅ OCR and optimization capabilities  
- ✅ Command-line interface
- ✅ Comprehensive documentation
- 📄 Word document processing (Word ↔ PDF conversion)
- 🎨 Modern GUI interface
- 🚀 Performance optimizations
- 📊 Excel and PowerPoint support
- 🤖 AI-powered document analysis
- 🌐 Web interface

## 📄 License

This project is licensed under the MIT License.

## 🏆 Acknowledgments

Built with proven implementations and enhanced with modern architecture for the open source community.

---

⭐ **If DocForge helped you, please give it a star!** ⭐

*Built by craftsmen, for craftsmen.* 🔨
