Metadata-Version: 2.4
Name: longparser
Version: 0.1.0
Summary: Privacy-first document intelligence engine — converts PDFs, DOCX, PPTX, XLSX, and CSV into AI-ready Markdown + structured JSON for RAG pipelines.
Author-email: ENDEVSOLS Team <dev@endevsols.com>
License: MIT
Project-URL: Homepage, https://github.com/ENDEVSOLS/LongParser
Project-URL: Repository, https://github.com/ENDEVSOLS/LongParser
Project-URL: Issues, https://github.com/ENDEVSOLS/LongParser/issues
Project-URL: Documentation, https://endevsols.github.io/LongParser/
Project-URL: Changelog, https://github.com/ENDEVSOLS/LongParser/blob/main/CHANGELOG.md
Keywords: pdf,document,parsing,ocr,rag,ai,docling,chunking,extraction,retrieval-augmented-generation,longparser
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pydantic<3,>=2.0
Requires-Dist: docling>=2.14
Requires-Dist: docling-core>=2.13
Provides-Extra: pptx
Requires-Dist: python-pptx>=1.0; extra == "pptx"
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.2; extra == "langchain"
Provides-Extra: llamaindex
Requires-Dist: llama-index-core>=0.10; extra == "llamaindex"
Provides-Extra: server
Requires-Dist: fastapi>=0.115; extra == "server"
Requires-Dist: uvicorn[standard]>=0.34; extra == "server"
Requires-Dist: python-multipart>=0.0.9; extra == "server"
Requires-Dist: motor>=3.6; extra == "server"
Requires-Dist: arq>=0.26; extra == "server"
Requires-Dist: python-magic>=0.4.27; extra == "server"
Requires-Dist: python-dotenv>=1.0; extra == "server"
Requires-Dist: langchain>=0.3; extra == "server"
Requires-Dist: langchain-openai>=0.3; extra == "server"
Requires-Dist: langchain-google-genai>=2.0; extra == "server"
Requires-Dist: langchain-groq>=0.3; extra == "server"
Requires-Dist: langchain-mongodb>=0.3; extra == "server"
Requires-Dist: langchain-huggingface>=0.1; extra == "server"
Requires-Dist: langchain-chroma>=0.2; extra == "server"
Requires-Dist: langgraph>=0.2; extra == "server"
Requires-Dist: langgraph-checkpoint>=2.0; extra == "server"
Requires-Dist: tiktoken>=0.7; extra == "server"
Requires-Dist: redis>=5.0; extra == "server"
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=3.0; extra == "embeddings"
Provides-Extra: chroma
Requires-Dist: chromadb>=0.5; extra == "chroma"
Provides-Extra: faiss
Requires-Dist: faiss-cpu>=1.8; extra == "faiss"
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.12; extra == "qdrant"
Provides-Extra: latex-ocr
Requires-Dist: pix2tex>=0.1.4; extra == "latex-ocr"
Provides-Extra: docx-equations
Requires-Dist: docxlatex>=0.3.0; extra == "docx-equations"
Requires-Dist: defusedxml>=0.7.0; extra == "docx-equations"
Provides-Extra: mfd
Requires-Dist: pix2text<1.2,>=1.1.1; extra == "mfd"
Provides-Extra: all
Requires-Dist: longparser[pptx]; extra == "all"
Requires-Dist: longparser[langchain]; extra == "all"
Requires-Dist: longparser[llamaindex]; extra == "all"
Requires-Dist: longparser[server]; extra == "all"
Requires-Dist: longparser[embeddings]; extra == "all"
Requires-Dist: longparser[chroma]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Requires-Dist: httpx>=0.27; extra == "dev"
Requires-Dist: anyio>=4.0; extra == "dev"

<p align="center">
  <!-- Logo goes here once ready -->
  <h1 align="center">LongParser</h1>
  <p align="center"><strong>Privacy-first document intelligence engine for production RAG pipelines.</strong></p>
  <p align="center">
    Parse PDFs, DOCX, PPTX, XLSX &amp; CSV → validated, AI-ready chunks with HITL review.
  </p>
  <p align="center">
    <a href="https://github.com/ENDEVSOLS/LongParser/actions/workflows/ci.yml">
      <img src="https://github.com/ENDEVSOLS/LongParser/actions/workflows/ci.yml/badge.svg" alt="CI">
    </a>
    <a href="https://pypi.org/project/longparser/">
      <img src="https://img.shields.io/pypi/v/longparser.svg?label=pypi&color=0078d4" alt="PyPI">
    </a>
    <a href="https://pypi.org/project/longparser/">
      <img src="https://img.shields.io/pypi/dm/longparser.svg?label=downloads&color=28a745" alt="Downloads">
    </a>
    <a href="https://www.python.org/">
      <img src="https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue.svg" alt="Python">
    </a>
    <a href="LICENSE">
      <img src="https://img.shields.io/badge/License-MIT-brightgreen.svg" alt="MIT License">
    </a>
    <a href="https://endevsols.github.io/LongParser/">
      <img src="https://img.shields.io/badge/docs-online-indigo.svg" alt="Docs">
    </a>
  </p>
</p>

---


## Features

| Feature | Detail |
|---------|--------|
| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling |
| **Hybrid chunking** | Token-aware, heading-hierarchy-aware, table-aware |
| **HITL review** | Human-in-the-Loop block & chunk editing before embedding |
| **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` |
| **3-layer memory** | Short-term turns + rolling summary + long-term facts |
| **Multi-provider LLM** | OpenAI, Gemini, Groq, OpenRouter |
| **Multi-backend vectors** | Chroma, FAISS, Qdrant |
| **Async-first API** | FastAPI + Motor (MongoDB) + ARQ (Redis) |
| **LangChain adapters** | Drop-in `BaseRetriever` and LlamaIndex `QueryEngine` |
| **Privacy-first** | All processing runs locally; no data leaves your infra |

---

## Installation

### Core (SDK only — no API server)

```bash
pip install longparser
```

### With REST API server (FastAPI + MongoDB + LLM)

```bash
pip install "longparser[server]"
```

### All extras

```bash
pip install "longparser[all]"
```

---

## Quick Start

### Python SDK

```python
from longparser import PipelineOrchestrator, ProcessingConfig

pipeline = PipelineOrchestrator()
result = pipeline.process_file("document.pdf")

print(f"Pages: {result.document.metadata.total_pages}")
print(f"Chunks: {len(result.chunks)}")
print(result.chunks[0].text)
```

### REST API

```bash
# 1. Copy and edit configuration
cp .env.example .env

# 2. Start services (MongoDB + Redis)
docker-compose up -d mongo redis

# 3. Start the API
uv run uvicorn longparser.server.app:app --reload --port 8000

# 4. Upload a document
curl -X POST http://localhost:8000/jobs \
  -H "X-API-Key: your-key" \
  -F "file=@document.pdf"

# 5. Check job status
curl http://localhost:8000/jobs/{job_id} -H "X-API-Key: your-key"

# 6. Finalize and embed
curl -X POST http://localhost:8000/jobs/{job_id}/finalize \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"finalize_policy": "approve_all_pending"}'

curl -X POST http://localhost:8000/jobs/{job_id}/embed \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"provider": "huggingface", "model": "BAAI/bge-base-en-v1.5", "vector_db": "chroma"}'

# 7. Chat with the document
curl -X POST http://localhost:8000/chat/sessions \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"job_id": "your-job-id"}'

curl -X POST http://localhost:8000/chat \
  -H "X-API-Key: your-key" \
  -H "Content-Type: application/json" \
  -d '{"session_id": "...", "job_id": "...", "question": "What is the refund policy?"}'
```

---

## Architecture

```
Document → Extract → Validate → HITL Review → Chunk → Embed → Index
                                                              ↓
                                             Chat → RAG → LLM → Answer
```

### Pipeline Stages

1. **Extract** — Docling converts PDF/DOCX/etc. into structured `Block` objects
2. **Validate** — Per-page confidence scoring and RTL detection
3. **HITL Review** — Human approves/edits/rejects blocks and chunks via the API
4. **Chunk** — `HybridChunker` builds token-aware RAG chunks with section hierarchy
5. **Embed** — Embedding engine (HuggingFace / OpenAI) vectors stored in Chroma/FAISS/Qdrant
6. **Chat** — LCEL chain with 3-layer memory and citation validation

---

## Project Structure

```
src/longparser/
├── schemas.py           ← core Pydantic models (Document, Block, Chunk, …)
├── extractors/          ← Docling, LaTeX OCR backends
├── chunkers/            ← HybridChunker
├── pipeline/            ← PipelineOrchestrator
├── integrations/        ← LangChain loader & LlamaIndex reader
├── utils/               ← shared helpers (RTL detection, …)
└── server/              ← REST API layer
    ├── app.py           ← FastAPI application (all routes)
    ├── db.py            ← Motor async MongoDB
    ├── queue.py         ← ARQ/Redis job queue
    ├── worker.py        ← ARQ background worker
    ├── embeddings.py    ← HuggingFace / OpenAI embedding engine
    ├── vectorstores.py  ← Chroma / FAISS / Qdrant adapters
    └── chat/            ← RAG chat engine
        ├── engine.py    ← ChatEngine (LCEL + 3-layer memory)
        ├── graph.py     ← LangGraph HITL workflow
        ├── schemas.py   ← chat Pydantic models
        ├── retriever.py ← LangChain BaseRetriever adapter
        ├── llm_chain.py ← multi-provider LLM factory
        └── callbacks.py ← observability callbacks
```

---

## LangChain Integration

```python
from longparser.integrations.langchain import LongParserLoader

loader = LongParserLoader("report.pdf")
docs = loader.load()  # list[langchain_core.documents.Document]
```

## LlamaIndex Integration

```python
from longparser.integrations.llamaindex import LongParserReader

reader = LongParserReader()
docs = reader.load_data("report.pdf")
```

---

## Configuration

Copy `.env.example` to `.env` and set:

| Variable | Default | Description |
|----------|---------|-------------|
| `LONGPARSER_MONGO_URL` | `mongodb://localhost:27017` | MongoDB connection |
| `LONGPARSER_REDIS_URL` | `redis://localhost:6379` | Redis for job queue |
| `LONGPARSER_LLM_PROVIDER` | `openai` | LLM provider |
| `LONGPARSER_LLM_MODEL` | `gpt-4o` | Model name |
| `LONGPARSER_EMBED_PROVIDER` | `huggingface` | Embedding provider |
| `LONGPARSER_VECTOR_DB` | `chroma` | Vector store backend |

---

## Running with Docker

```bash
docker-compose up
```

API available at `http://localhost:8000` · Docs at `http://localhost:8000/docs`

---

## Testing

```bash
# Install dev dependencies
uv sync --extra dev

# Run unit tests
uv run pytest tests/unit/ -v

# Run with coverage
uv run pytest tests/ --cov=src/longparser --cov-report=term-missing
```

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and PR guidelines.

## Security

See [SECURITY.md](SECURITY.md) for vulnerability reporting.

## License

[MIT](LICENSE) — Copyright © 2026 ENDEVSOLS
