Metadata-Version: 2.4
Name: quanta-pdf
Version: 1.0.5
Summary: Advanced PDF layout analysis engine for extracting figures, tables, and structured content
Author-email: Sovit Nayak <sovitnayak1258@gmail.com>, Manush Murali <manush@example.com>
Maintainer-email: Sovit Nayak <sovitnayak1258@gmail.com>, Manush Murali <manush@example.com>
License: MIT
Project-URL: Homepage, https://github.com/Magnet-AI/Quanta
Project-URL: Documentation, https://quanta.readthedocs.io
Project-URL: Repository, https://github.com/Magnet-AI/Quanta.git
Project-URL: Bug Tracker, https://github.com/Magnet-AI/Quanta/issues
Project-URL: Changelog, https://github.com/Magnet-AI/Quanta/blob/main/CHANGELOG.md
Keywords: pdf,layout-analysis,computer-vision,document-processing,figure-extraction,table-detection,ocr,machine-learning
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Image Processing
Classifier: Topic :: Text Processing :: Markup
Classifier: Topic :: Multimedia :: Graphics :: Graphics Conversion
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: opencv-python>=4.8.0
Requires-Dist: PyMuPDF>=1.23.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: Pillow>=9.0.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: matplotlib>=3.5.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=2.20.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=5.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.0.0; extra == "docs"
Requires-Dist: myst-parser>=0.18.0; extra == "docs"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: pytest-xdist>=3.0.0; extra == "test"
Dynamic: license-file

<div align="center">

<img src="assets/quanta.png" alt="Quanta Logo" width="120" height="120"/>

# Quanta

**Advanced PDF Layout Analysis Engine**

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Stars](https://img.shields.io/github/stars/Magnet-AI/Quanta)](https://github.com/Magnet-AI/Quanta)

[![Issues](https://img.shields.io/badge/Issues-GitHub%20Issues-red)](https://github.com/Magnet-AI/Quanta/issues)
[![Discussions](https://img.shields.io/badge/Discussions-GitHub%20Discussions-purple)](https://github.com/Magnet-AI/Quanta/discussions)

</div>

---

A powerful and intelligent PDF layout analysis engine that automatically extracts figures, tables, and structured content from PDF documents using advanced computer vision and machine learning techniques.

**🎯 Problem Solved**: Complex engineering documents often lose critical visual information (figures, diagrams, technical drawings) when being parsed by traditional PDF tools. This engine specifically addresses the challenge of accurately detecting and extracting visual elements from technical and engineering PDFs that contain intricate layouts, multi-column designs, and embedded graphics.

<div align="center">
  <img src="docs/images/debug_overlay_example.png" alt="Layout Analysis Debug Overlay" width="600"/>
  <p><em>Debug overlay showing detected layout elements: columns (blue), text blocks (green), figures (red), and tables (yellow)</em></p>
</div>

## ✨ Features

- 🔍 **Multi-column Layout Detection** - Automatically identifies and processes complex multi-column layouts
- 📊 **Intelligent Table Recognition (Mistral OCR)** - Extracts tables and text with high accuracy via Mistral Document OCR
- 🖼️ **Figure Extraction (Custom)** - Identifies and extracts figures, diagrams, and images using custom algorithms
- 📝 **Text Block Analysis (Mistral + Heuristics)** - Uses Mistral OCR output and in-house grouping for reading order
- 🏷️ **Caption Linking** - Automatically links captions to their corresponding figures and tables
- 🎯 **High Accuracy** - Advanced algorithms ensure reliable content extraction
- ⚡ **Fast Processing** - Optimized for speed and efficiency
- 🛠️ **Easy Integration** - Simple API for integration into existing workflows
- 🔧 **Debug Mode** - Visualize layout analysis with overlay images

## 🚀 Quick Start

### Install via PyPI

```bash
pip install quanta-pdf
```

### Basic Usage (Python)

```python
from quanta import extract_document

result = extract_document("document.pdf", "output/")
print(f"Pages: {len(result['pages'])}")
```

### Command Line Interface

```bash
quanta --input document.pdf --output output/
```

If you want Mistral OCR tables/text, set `MISTRAL_API_KEY` first (see below).

### Environment configuration (.env)

To enable Mistral OCR for tables and text blocks, set your API key. You can either export it or place it in a `.env` file at your project root.

```bash
# Option A: environment variable
export MISTRAL_API_KEY="your-mistral-api-key"

# Option B: .env file (same directory where you run the code)
echo "MISTRAL_API_KEY=your-mistral-api-key" > .env
```
The library loads `.env` automatically; the CLI also picks it up when run from that directory.

## 📖 Documentation

### Core Concepts

#### Layout Analysis Pipeline
The engine follows a sophisticated multi-stage pipeline:

1. **PDF Rendering** - Converts PDF pages to high-resolution images
2. **Column Detection** - Identifies multi-column layouts using whitespace analysis
3. **Text Extraction** - Extracts and groups text blocks
4. **Figure Detection** - Identifies figures using vector clustering and image analysis
5. **Table & Text Recognition (Mistral OCR)** - Leverages Mistral Document OCR to extract tables (CSV) and text blocks
6. **Caption Linking** - Links captions to their corresponding figures/tables
7. **Reading Order** - Determines proper reading sequence

#### Mathematical Foundations

**Column Detection Algorithm:**
- Uses whitespace valley analysis to identify column boundaries
- Applies Gaussian smoothing to detect consistent vertical gaps
- Implements adaptive thresholding for varying document layouts

**Table/Text Extraction:**
- Uses Mistral Document OCR to obtain markdown-like structured output
- Parses tables into CSV files and groups text into blocks

**Figure Detection:**
- Vector clustering using DBSCAN algorithm
- Aspect ratio analysis to distinguish figures from tables
- Image XObject extraction for embedded graphics

### API Reference (package)

#### `extract_document(input_pdf: str | Path, output_dir: str | Path) -> dict`

Process a PDF document and extract structured content.

**Parameters:**
- `input_pdf`: Path to the input PDF file
- `output_dir`: Directory to save extracted content

**Returns:**
- `Dict[str, Any]`: Processing results containing figures, tables, and metadata

**Example:**
```python
from quanta import extract_document
result = extract_document("research_paper.pdf", "output/")
print(result["summary_path"])  # JSON summary path
```

## 🎯 Use Cases

### Engineering & Technical Documents
- **Technical Drawings**: Extract engineering diagrams and CAD drawings
- **Specification Sheets**: Parse technical specifications and data tables
- **Engineering Reports**: Process complex multi-column technical reports
- **Manufacturing Docs**: Extract assembly instructions and part diagrams

### Academic Research
- Extract figures and tables from research papers
- Analyze document structure and layout
- Process large collections of academic PDFs

### Document Digitization
- Convert PDF documents to structured data
- Extract content for database storage
- Prepare documents for text analysis

### Content Management
- Automatically categorize document content
- Extract metadata and captions
- Generate document summaries

### Data Analysis
- Extract tabular data from reports
- Process financial documents
- Analyze technical specifications

## 🔧 Advanced Configuration

### Custom Parameters

```python
from pdf_layout_engine import process_pdf

# Custom processing parameters
config = {
    'min_figure_area': 1000,
    'table_detection_threshold': 0.7,
    'column_detection_sensitivity': 0.8
}

result = process_pdf("document.pdf", "output/", config=config)
```

### Debug Mode

Enable debug mode to visualize the layout analysis process:

```bash
python main.py --debug
```

This generates overlay images showing:
- 🟦 Blue rectangles: Column boundaries
- 🟢 Green rectangles: Text blocks
- 🟥 Red rectangles: Figures
- 🟡 Yellow rectangles: Tables

### Output Structure

Results are organized per page under the PDF name inside `output/`.

Example:

```
output/<pdf_name>/
├── page_01/
│   ├── figures/
│   │   └── figure_01.png
│   ├── tables/
│   │   └── table_01.csv          # tables saved as CSV only (no table PNGs)
│   ├── text/
│   │   └── text_blocks.txt       # text blocks from Mistral OCR
│   └── page_01.png               # full page image
├── page_02/
│   └── ...
├── page_XX_debug_overlay.png     # debug overlay for each processed page (at root)
└── summary.json                  # high-level summary (counts, filenames)
```

Key points:
- Tables are saved as CSV files only (no table images).
- Figures are cropped from the page using custom detection and saved as PNGs.
- Text blocks (from Mistral OCR) are written to `text/text_blocks.txt` per page.
- A full-page PNG is saved in each `page_XX/` directory.
- Debug overlays (`page_XX_debug_overlay.png`) are saved at the PDF root inside `output/<pdf_name>/`.

## 📊 Performance

### Current Benchmarks
- **Processing Speed**: ~2-5 seconds per page
- **Current Accuracy**: ~80% for figures and tables
- **Memory Usage**: ~200MB for typical documents
- **Supported Formats**: PDF 1.4 - PDF 2.0

### 🚧 **Active Development**
We're currently fine-tuning our base models to improve accuracy. The engine is in active development with regular updates to enhance detection performance. We're working towards achieving 90%+ accuracy through:
- Model fine-tuning on engineering document datasets
- Improved preprocessing pipelines
- Enhanced feature extraction algorithms
- Community feedback integration

### Optimization Tips
- Use high-resolution rendering for better accuracy
- Adjust parameters based on document type
- Process pages in parallel for batch operations
- Use debug mode to tune detection parameters

## 🖼️ Examples

### Debug Overlay Analysis
<div align="center">
  <img src="docs/images/debug_overlay_example.png" alt="Layout Analysis Debug Overlay" width="600"/>
  <p><em>Debug overlay showing detected layout elements: columns (red), text blocks (green), figures (blue), and tables (yellow)</em></p>
</div>

### Engineering Document Processing
<div align="center">
  <img src="docs/images/page_11_debug_overlay.png" alt="Engineering Document Analysis" width="500"/>
  <p><em>Complex engineering document with multi-column layout and technical drawings</em></p>
</div>

### Extracted Figure
<div align="center">
  <img src="docs/images/extracted_figure_example.png" alt="Extracted Figure" width="400"/>
  <p><em>Automatically extracted figure from PDF document</em></p>
</div>

### Extracted Table
<div align="center">
  <img src="docs/images/extracted_table_example.png" alt="Extracted Table" width="400"/>
  <p><em>Automatically extracted table with preserved formatting</em></p>
</div>

### Multi-Page Analysis
<div align="center">
  <img src="docs/images/page_08_debug_overlay.png" alt="Multi-page Analysis" width="500"/>
  <p><em>Consistent layout analysis across multiple pages of technical documents</em></p>
</div>

## 👥 Contributors

**Developers & Maintainers:**
- [@soovittt](https://github.com/soovittt) - Core Developer
- [@Manushpm8](https://github.com/Manushpm8) - Core Developer
- [@Magnet-AI](https://github.com/Magnet-AI) - Organization

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

### Development Setup

```bash
# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run with coverage
pytest --cov=src
```

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- Built with [PyMuPDF](https://pymupdf.readthedocs.io/) for PDF processing
- Uses [OpenCV](https://opencv.org/) for computer vision operations
- Inspired by research in document layout analysis

## 📞 Support

- 📧 Email: sovitnayak1258@gmail.com
- 🐛 Issues: [GitHub Issues](https://github.com/Magnet-AI/Quanta/issues)
- 💬 Discussions: [GitHub Discussions](https://github.com/Magnet-AI/Quanta/discussions)

---

**Made with ❤️ for the open source community**
