Metadata-Version: 2.3
Name: wizit-open-rag
Version: 0.0.1
Summary: AI-powered document transcription and semantic chunking for RAG pipelines
Keywords: rag,retrieval-augmented-generation,llm,chunking,transcription,weaviate,bedrock,langchain,langgraph,pdf,semantic-search
Author: Restebance
Author-email: Restebance <restebance@gmail.com>
License: Apache-2.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Dist: boto3>=1.40.23
Requires-Dist: langchain>=1.2.10
Requires-Dist: langchain-aws>=1.3.0
Requires-Dist: langchain-classic>=1.0.7
Requires-Dist: langchain-community>=0.4.1
Requires-Dist: langchain-core>=1.2.16
Requires-Dist: langchain-experimental>=0.4.1
Requires-Dist: langchain-text-splitters>=1.1.1
Requires-Dist: langgraph>=1.0.9
Requires-Dist: pillow>=11.3.0
Requires-Dist: pymupdf>=1.27.1
Requires-Dist: anthropic>=0.84.0
Requires-Dist: psycopg2-binary>=2.9.11
Requires-Dist: sqlalchemy[asyncio]>=2.0.43
Requires-Dist: langchain-postgres>=0.0.17
Requires-Dist: weaviate-client>=4.0.0
Requires-Dist: langchain-weaviate>=0.0.3
Requires-Python: >=3.12
Project-URL: Bug Tracker, https://github.com/Restebance/open_rag/issues
Project-URL: Repository, https://github.com/Restebance/open_rag
Description-Content-Type: text/markdown

# open_rag

A Python library for AI-powered document transcription and semantic chunking with RAG (Retrieval-Augmented Generation). It processes PDFs through LLMs (Claude via **AWS Bedrock**), chunks the resulting Markdown semantically, enriches each chunk with surrounding context, and returns ready-to-index `Document` objects for PostgreSQL pgvector.

**Version**: 0.0.1 | **Python**: >=3.12 | **Build**: uv

---

## Features

- PDF-to-Markdown transcription powered by Claude via AWS Bedrock
- LangGraph-based transcription workflow with configurable retry logic and accuracy thresholds
- Semantic chunking with 85th-percentile breakpoints (plus recursive and Markdown-header strategies)
- Per-chunk context enrichment via a dedicated LangGraph workflow — each chunk is wrapped with `<context>` and `<content>` tags
- Pluggable storage backends: local filesystem or AWS S3
- Vector indexing into PostgreSQL pgvector via LangChain `PGVectorStore`
- LangSmith tracing support

---

## Prerequisites

- Python 3.12 or higher
- [uv](https://docs.astral.sh/uv/) for dependency management
- AWS credentials configured (standard boto3 credential chain — env vars, `~/.aws/credentials`, or instance profile)
- PostgreSQL database with the [pgvector](https://github.com/pgvector/pgvector) extension enabled

---

## Installation

Install from PyPI:

```bash
pip install open_rag
```

For development (clone + install with dev tools):

```bash
git clone https://github.com/Restebance/open_rag.git
cd open_rag
uv sync --group dev
cp example.env .env
```

Fill in `.env` with your credentials (see [Environment Variables](#environment-variables) below).

---

## Usage

### Document Transcription

`OpenRagTranscriber` accepts the raw bytes of a single PDF page and returns a `ParsedDocPage` containing the Markdown transcription.

```python
import asyncio
from open_rag import OpenRagTranscriber

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",   # required
    langsmith_api_key="lsv2_...",          # required
    llm_model_id="global.anthropic.claude-sonnet-4-6",
    target_language="es-CO",
    transcription_accuracy_threshold=0.90,
    max_transcription_retries=2,
)

with open("page.pdf", "rb") as f:
    page_bytes = f.read()

result = asyncio.run(transcriber.transcribe_document(page_bytes))
print(result.page_text)  # Markdown string
```

### Semantic Chunking with Context

`ChunksManager` takes a pre-loaded Markdown string and returns a list of LangChain `Document` objects, each enriched with a contextual summary.

```python
import asyncio
from open_rag import ChunksManager

manager = ChunksManager(
    langsmith_project_name="my-project",   # required
    langsmith_api_key="lsv2_...",          # required
)

with open("document.md") as f:
    markdown_content = f.read()

docs = asyncio.run(manager.gen_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown_content,
    file_tags={"category": "hr", "department": "onboarding"},
))

# docs is a List[Document]; index to pgvector as needed
for doc in docs:
    print(doc.page_content)
```

> **Note:** `gen_context_chunks` does not load files from storage — the caller must pass the content as a string. Indexing to pgvector is the caller's responsibility.

---

## Architecture

The codebase follows a clean layered architecture. Dependency direction: `transcription.py / chunks.py → application → domain ↔ infra ← workflows`.

```
open_rag/                     # installable package (src/open_rag/)
├── transcription.py          # Public API — OpenRagTranscriber
├── chunks.py                 # Public API — ChunksManager
├── domain/                   # Core data models (PageToTranscribe, ParsedDocPage, ParsedDoc)
├── application/              # Orchestration + abstract interfaces (ABCs)
├── data/                     # Shared enums and prompt strings
├── infra/
│   ├── llms/                 # AWS Bedrock chat (ChatBedrockConverse)
│   ├── embeddings/           # AWS Bedrock embeddings (BedrockEmbeddings)
│   ├── persistence/          # Local filesystem, AWS S3, PostgreSQL managers
│   ├── rag/                  # SemanticChunks, RecursiveChunks, MarkdownHeadersChunks, PGVectorStore, WeaviateEmbeddingsManager
│   └── secrets/              # AWS Secrets Manager helper
├── utils/                    # validate_file_name_format
└── workflows/                # LangGraph state machines (transcription + context)
tests/                        # pytest suite
data/                         # Sample / test documents
example.env
pyproject.toml
```

### Key Data Flow

```
PDF bytes
  → ParseDocModelService  (PyMuPDF → base64 pages)
  → TranscriptionWorkflow (LangGraph → Claude via AWS Bedrock → Markdown)

Markdown string + tags
  → SemanticChunks        (AWS Bedrock embeddings, 85th-percentile breakpoints)
  → ContextWorkflow       (LangGraph → Claude adds surrounding context per chunk)
  → List[Document]        (each chunk wrapped in <context> / <content> tags)
  → Caller indexes to pgvector
```

---

## Environment Variables

Copy `example.env` to `.env` and fill in the values:

| Variable | Purpose |
|---|---|
| `VECTOR_STORE_CONNECTION` | PostgreSQL connection string (pgvector) |
| `VECTOR_STORE_TABLE` | pgvector table name |
| `LANGSMITH_API_KEY` | LangSmith API key for tracing |
| `LANGCHAIN_PROJECT` | LangSmith project name |
| `LANGSMITH_TRACING` | Enable LangSmith tracing (`true` / `false`) |
| `SUPABASE_KEY` / `SUPABASE_URL` | Supabase credentials (optional) |

AWS credentials are read from the standard **boto3 credential chain** and are not set in `.env`.

---

## Development

### Running tests

```bash
# Unit tests (mocked — no AWS credentials required)
uv run pytest

# Transcription integration test (requires live AWS credentials)
uv run python src/open_rag/transcription.py

# Chunking integration test (requires live AWS credentials)
uv run python src/open_rag/chunks.py
```

### Profiling

```bash
# CPU profiling
uv run pyinstrument test.py transcribe <file.pdf> <source_dir> <target_dir>

# Memory profiling
uv run python -m memray run test.py transcribe <file.pdf> <source_dir> <target_dir>
```

### Building the package

```bash
uv build
```

---

## Gotchas

- `SemanticChunks` calls AWS Bedrock **at construction time** (via `SemanticChunker`) — not just at index time. Make sure credentials are available before instantiating `ChunksManager`.
- Both `transcribe_document` and `gen_context_chunks` are `async`; wrap them in `asyncio.run(...)` from synchronous code.
- `OpenRagTranscriber` and `ChunksManager` require `langsmith_project_name` and `langsmith_api_key` as **constructor arguments** — they are not read from environment variables.
- `ParseDocModelService.parse_document_to_base64_pages` iterates `range(0, page_count)` — pages are zero-indexed (`page_number=0` is the first page).
- AWS Bedrock cross-region model IDs use the `global.` prefix (e.g. `global.anthropic.claude-sonnet-4-6`).

---

## License

Licensed under the [Apache License 2.0](LICENSE.md).
