Metadata-Version: 2.4
Name: neuradoc
Version: 0.1.2
Summary: A Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities
Home-page: https://github.com/neuradoc/neuradoc
Author: NeuraDoc Team
Author-email: NeuraDoc Team <neuradoc@example.com>
Project-URL: Bug Tracker, https://github.com/neuradoc/neuradoc/issues
Project-URL: Documentation, https://github.com/neuradoc/neuradoc
Project-URL: Source Code, https://github.com/neuradoc/neuradoc
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyPDF2>=2.0.0
Requires-Dist: python-docx>=0.8.11
Requires-Dist: openpyxl>=3.0.10
Requires-Dist: beautifulsoup4>=4.10.0
Requires-Dist: lxml>=4.6.5
Requires-Dist: pillow>=9.0.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: python-pptx>=0.6.21
Requires-Dist: requests>=2.27.0
Requires-Dist: pyarmor>=9.1.2
Requires-Dist: flask>=3.0.3
Requires-Dist: setuptools>=75.3.2
Requires-Dist: werkzeug>=3.0.6
Requires-Dist: build>=1.2.2.post1
Requires-Dist: twine>=6.1.0
Requires-Dist: trafilatura>=1.6.1
Requires-Dist: gunicorn>=23.0.0
Provides-Extra: ocr
Requires-Dist: pytesseract>=0.3.8; extra == "ocr"
Provides-Extra: tables
Requires-Dist: camelot-py>=0.10.1; extra == "tables"
Requires-Dist: tabula-py>=2.3.0; extra == "tables"
Provides-Extra: nlp
Requires-Dist: spacy>=3.2.0; extra == "nlp"
Provides-Extra: transformers
Requires-Dist: transformers>=4.16.0; extra == "transformers"
Requires-Dist: torch>=1.10.0; extra == "transformers"
Provides-Extra: web
Requires-Dist: flask>=2.0.0; extra == "web"
Requires-Dist: gunicorn>=20.0.0; extra == "web"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# NeuraDoc

[![PyPI version](https://img.shields.io/pypi/v/neuradoc.svg)](https://pypi.org/project/neuradoc/)
[![Python Version](https://img.shields.io/pypi/pyversions/neuradoc.svg)](https://pypi.org/project/neuradoc/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

NeuraDoc is a Python package for parsing and transforming various document formats into LLM-ready data with element classification capabilities. The library intelligently extracts and classifies content from documents for AI/ML workflows.

## Features

- **Multi-format Support**: Parse at least 10 different document types (PDF, Word, TXT, etc.)
- **Element Extraction**: Extract text, images, tables, and diagrams from documents
- **Classification**: Classify document elements by type
- **Smart Positioning**: Position and organize extracted elements intelligently
- **LLM Integration**: Convert extracted data to LLM-ready formats (tokenized structures)
- **Memory Efficiency**: Optimized for processing large documents
- **Configurable Parsing**: Control extraction behavior with custom configurations
- **Parsing Profiles**: Use predefined profiles for different extraction needs (fast, detailed, etc.)
- **Batch Processing**: Process multiple documents with consistent settings
- **Performance Metrics**: Get detailed processing statistics and timing information

## Supported Document Formats

NeuraDoc supports the following document formats:

- PDF (`.pdf`)
- Microsoft Word (`.docx`, `.doc`)
- Plain Text (`.txt`)
- Microsoft Excel (`.xlsx`, `.xls`)
- HTML (`.html`, `.htm`)
- XML (`.xml`)
- Images (`.jpg`, `.jpeg`, `.png`, `.gif`)
- Microsoft PowerPoint (`.pptx`, `.ppt`)
- CSV (`.csv`)
- JSON (`.json`)
- Markdown (`.md`)

## Installation

### Basic Installation

```bash
pip install neuradoc
```

### Installation with Optional Dependencies

```bash
# Install with OCR support
pip install neuradoc[ocr]

# Install with advanced table extraction
pip install neuradoc[tables]

# Install with NLP capabilities
pip install neuradoc[nlp]

# Install with transformer model support
pip install neuradoc[transformers]

# Install with web interface
pip install neuradoc[web]

# Install with all optional dependencies
pip install neuradoc[ocr,tables,nlp,transformers,web]
```

## Quick Start

### Basic Usage

```python
import neuradoc

# Load and parse a document
doc = neuradoc.load_document("path/to/your/document.pdf")

# Get all text content
text = doc.get_text_content()

# Get tables
tables = doc.get_tables()

# Get images
images = doc.get_images()

# Save extracted content in different formats
doc.save("output.json", format="json")
doc.save("output.md", format="markdown")
doc.save("output.txt", format="text")
```

### Advanced Usage

```python
import neuradoc
from neuradoc.models.element import ElementType
from neuradoc.transformers.llm_transformer import chunk_document

# Load document
doc = neuradoc.load_document("document.docx")

# Filter elements by type
headings = doc.get_elements_by_type(ElementType.HEADING)
code_blocks = doc.get_elements_by_type(ElementType.CODE)

# Transform document into chunks for LLM processing
chunks = chunk_document(doc, max_chunk_size=1000, overlap=100)

# Process chunks with your LLM
for chunk in chunks:
    # Process each chunk with your LLM implementation
    print(f"Chunk: {len(chunk)} characters")
```

## Web Interface

NeuraDoc includes a web interface for document processing:

```bash
# Install web dependencies
pip install neuradoc[web]

# Run the web server
python -m neuradoc.web.app
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.
