Metadata-Version: 2.3
Name: wizit-open-rag
Version: 0.0.9
Summary: AI-powered document transcription and semantic chunking for RAG pipelines
Keywords: rag,retrieval-augmented-generation,llm,chunking,transcription,weaviate,bedrock,langchain,langgraph,pdf,semantic-search
Author: Restebance
Author-email: Restebance <restebance@gmail.com>
License: Apache-2.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Dist: boto3>=1.40.23
Requires-Dist: langchain>=1.2.10
Requires-Dist: langchain-anthropic==1.4.4
Requires-Dist: langchain-aws>=1.3.0
Requires-Dist: langchain-classic>=1.0.7
Requires-Dist: langchain-community>=0.4.1
Requires-Dist: langchain-core>=1.2.16
Requires-Dist: langchain-experimental>=0.4.1
Requires-Dist: langchain-text-splitters>=1.1.1
Requires-Dist: langgraph>=1.0.9
Requires-Dist: pillow>=11.3.0
Requires-Dist: mistralai>=1.0
Requires-Dist: pdfplumber>=0.11
Requires-Dist: pymupdf>=1.27.1
Requires-Dist: anthropic==0.109.0
Requires-Dist: psycopg2-binary>=2.9.11
Requires-Dist: sqlalchemy[asyncio]>=2.0.43
Requires-Dist: langchain-postgres>=0.0.17
Requires-Dist: weaviate-client>=4.0.0
Requires-Dist: langchain-weaviate==0.0.6
Requires-Dist: langchain-voyageai>=0.1
Requires-Dist: botocore[crt]>=1.43.7
Requires-Python: >=3.12
Project-URL: Bug Tracker, https://github.com/Restebance/open_rag/issues
Project-URL: Repository, https://github.com/Restebance/open_rag
Description-Content-Type: text/markdown

# wizit_open_rag

A Python library for AI-powered document transcription and semantic chunking for RAG (Retrieval-Augmented Generation) pipelines. It processes PDFs and images through a cost-aware tiered pipeline — plain-text extraction first, OCR second, LLM last — then chunks the resulting Markdown semantically, enriches each chunk with surrounding context, and returns ready-to-index `Document` objects for PostgreSQL pgvector or Weaviate.

**Version**: 0.0.4 | **Python**: >=3.12

---

## Features

- **Cost-aware tiered transcription**: pdfplumber (free) → OCR (AWS Textract or Mistral Document AI) → Claude Haiku (LLM fallback). Each page escalates only when the previous tier scores below the quality threshold.
- **Image transcription**: PNG and JPG/JPEG files bypass the tiered pipeline entirely and go straight to the LLM (Claude vision via AWS Bedrock). Pass `file_name="scan.png"` to `transcribe_document` — no other change needed.
- LangGraph-based transcription workflow with configurable retry logic and accuracy thresholds.
- Per-chunk context enrichment — each chunk is wrapped with `<context>` and `<content>` tags for higher retrieval precision.
- Markdown-header-based chunking strategy, ready to extend to semantic or recursive splitting.
- Pluggable vector store backends: PostgreSQL pgvector (`PgEmbeddingsManager`) or Weaviate (`WeaviateEmbeddingsManager`).
- LangSmith tracing built in.

---

## Prerequisites

- Python 3.12 or higher
- AWS credentials configured (standard boto3 credential chain — env vars, `~/.aws/credentials`, or instance profile). Required for AWS Bedrock (LLM + embeddings) and optionally for AWS Textract and S3.
- For pgvector: PostgreSQL with the [pgvector](https://github.com/pgvector/pgvector) extension enabled.
- For Weaviate: a running Weaviate instance (local or cloud).
- For Mistral OCR: a `MISTRAL_API_KEY` environment variable or the key passed directly.
- For Voyage AI embeddings: a `VOYAGE_API_KEY` environment variable or the key passed directly.
- For Anthropic direct API: an `ANTHROPIC_API_KEY` environment variable or the key passed directly to `ClaudeModels`.

---

## Installation

```bash
pip install wizit_open_rag
```

---

## Quickstart

### 1. Transcribe a PDF page

`OpenRagTranscriber` accepts raw bytes for a single PDF page and returns a `ParsedDocPage` containing the Markdown transcription. By default it uses AWS Bedrock; pass `ai_service=ClaudeModels(...)` to use the Anthropic direct API instead.

```python
import asyncio
import fitz  # PyMuPDF — pip install pymupdf
from wizit_open_rag import OpenRagTranscriber

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",  # required
    langsmith_api_key="lsv2_...",         # required
    target_language="en",
)

# Split a multi-page PDF into single-page byte blobs
with fitz.open("document.pdf") as doc:
    single = fitz.open()
    single.insert_pdf(doc, from_page=0, to_page=0)
    page_bytes = single.tobytes()

result = asyncio.run(transcriber.transcribe_document(page_number=1, page_content=page_bytes))
print(result.page_text)  # Markdown string
```

**Using the Anthropic direct API instead of Bedrock:**

```python
import asyncio
import fitz
from wizit_open_rag import OpenRagTranscriber
from wizit_open_rag.infra.llms.claude_model import ClaudeModels

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
    ai_service=ClaudeModels("claude-sonnet-4-6"),  # reads ANTHROPIC_API_KEY from env
    # ai_service=ClaudeModels("claude-sonnet-4-6", api_key="sk-ant-..."),  # or pass directly
    target_language="en",
)

with fitz.open("document.pdf") as doc:
    single = fitz.open()
    single.insert_pdf(doc, from_page=0, to_page=0)
    page_bytes = single.tobytes()

result = asyncio.run(transcriber.transcribe_document(page_number=1, page_content=page_bytes))
print(result.page_text)
```

### 1b. Transcribe a standalone image (PNG / JPG)

Pass `file_name` with the image extension to signal that the input is an image rather than a PDF. Images bypass the tiered pipeline and go directly to the LLM, regardless of whether `use_tiered_transcription` is set.

```python
import asyncio
from wizit_open_rag import OpenRagTranscriber

transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
)

with open("scan.png", "rb") as f:
    image_bytes = f.read()

result = asyncio.run(
    transcriber.transcribe_document(
        page_number=1,
        page_content=image_bytes,
        file_name="scan.png",   # .png / .jpg / .jpeg — triggers image path
    )
)
print(result.page_text)
```

Supported image extensions: `.png`, `.jpg`, `.jpeg`. Any other extension raises `ValueError`. When `file_name` is `None` or omitted, PDF is assumed (backwards-compatible default).

### 2. Chunk Markdown and generate context

`ChunksManager` takes a pre-loaded Markdown string and returns a list of LangChain `Document` objects, each enriched with a contextual summary. By default it uses AWS Bedrock; pass `ai_service=ClaudeModels(...)` to use the Anthropic direct API.

```python
import asyncio
from wizit_open_rag import ChunksManager

manager = ChunksManager(
    langsmith_project_name="my-project",  # required
    langsmith_api_key="lsv2_...",         # required
)

with open("document.md") as f:
    markdown = f.read()

docs = asyncio.run(manager.gen_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown,
    file_tags={"category": "hr", "department": "onboarding"},
))

for doc in docs:
    print(doc.page_content)   # "<context>...</context><content>...</content>"
    print(doc.metadata)       # {"source": "document.md", "category": "hr", ...}
```

**Using the Anthropic direct API:**

```python
import asyncio
from wizit_open_rag import ChunksManager
from wizit_open_rag.infra.llms.claude_model import ClaudeModels

manager = ChunksManager(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
    ai_service=ClaudeModels("claude-sonnet-4-6"),  # reads ANTHROPIC_API_KEY from env
)

with open("document.md") as f:
    markdown = f.read()

docs = asyncio.run(manager.gen_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown,
    file_tags={"category": "hr"},
))
```

### 3. Full pipeline — transcribe, chunk, and index

```python
import asyncio
import fitz
from wizit_open_rag import OpenRagTranscriber, ChunksManager
from wizit_open_rag.infra.embeddings.aws_embeddings import AWSEmbeddingsModels
from wizit_open_rag.infra.rag.weaviate_embeddings import WeaviateEmbeddingsManager

# ── Transcription ──────────────────────────────────────────────────────────────
transcriber = OpenRagTranscriber(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
    use_tiered_transcription=True,  # cost-aware: pdfplumber → Textract → Haiku
    tier2_ocr="textract",
    target_language="en",
)

pages_text = []
with fitz.open("document.pdf") as doc:
    for i in range(len(doc)):
        single = fitz.open()
        single.insert_pdf(doc, from_page=i, to_page=i)
        result = asyncio.run(transcriber.transcribe_document(
            page_number=i + 1,
            page_content=single.tobytes(),
        ))
        pages_text.append(result.page_text or "")

markdown = "\n\n".join(pages_text)

# ── Chunking + indexing ────────────────────────────────────────────────────────
embeddings = AWSEmbeddingsModels("amazon.titan-embed-text-v1").load_embeddings_model()

kdb = WeaviateEmbeddingsManager(
    embeddings_model=embeddings,
    weaviate_url="http://localhost:8080",
    collection_name="Documents",
    records_manager_db_url="postgresql://user:password@localhost:5432/vectordb",
)

manager = ChunksManager(
    langsmith_project_name="my-project",
    langsmith_api_key="lsv2_...",
    kdb=kdb,
)

result = asyncio.run(manager.gen_and_index_context_chunks(
    file_key="document.md",
    file_markdown_content=markdown,
    file_tags={"source_doc": "document.pdf"},
))
print(result)  # IndexingResult(num_added=12, num_updated=0, num_deleted=0)
```

---

## Transcription Reference

### `OpenRagTranscriber`

```python
from wizit_open_rag import OpenRagTranscriber
```

**Constructor parameters**

| Parameter | Type | Default | Description |
|---|---|---|---|
| `langsmith_project_name` | `str` | required | LangSmith project name for tracing |
| `langsmith_api_key` | `str` | required | LangSmith API key |
| `llm_model_id` | `str` | `"global.anthropic.claude-sonnet-4-6"` | Bedrock model ID used when `use_tiered_transcription=False` |
| `target_language` | `str` | `"es"` | BCP-47 language tag for the output (e.g. `"en"`, `"es-CO"`) |
| `transcription_additional_instructions` | `str` | `""` | Extra instructions appended to the system prompt |
| `transcription_accuracy_threshold` | `float` | `0.80` | Minimum quality score `[0.0, 0.95]` to accept a tier's output |
| `max_transcription_retries` | `int` | `2` | LLM retry attempts `[1, 3]` within the LangGraph loop |
| `use_tiered_transcription` | `bool` | `False` | Enable cost-aware tiered pipeline |
| `tier2_ocr` | `"textract" \| "mistral"` | `"textract"` | Tier 2 OCR backend |
| `tier3_model_id` | `str` | `"us.anthropic.claude-haiku-4-5-20251001-v1:0"` | Bedrock model for the LLM fallback tier |
| `mistral_api_key` | `str \| None` | `None` | Mistral API key; falls back to `MISTRAL_API_KEY` env var |
| `ai_service` | `AiApplicationService \| None` | `None` | LLM backend override for all standard and image transcription. Pass `ClaudeModels(...)` to use the Anthropic direct API instead of Bedrock. Ignores `llm_model_id` when set. |
| `tier3_ai_service` | `AiApplicationService \| None` | `None` | LLM backend override for the Tier 3 fallback (only relevant when `use_tiered_transcription=True`). Defaults to `AWSModels(tier3_model_id)` when not set. |

**Method**

```python
async def transcribe_document(
    page_number: int,
    page_content: str | bytes,
    file_name: str | None = None,
) -> ParsedDocPage
```

- `page_content` — raw bytes of the input. For PDFs, use PyMuPDF to extract a single page. For images, read the file directly.
- `file_name` — optional filename used to detect the input format from its extension. When `None` or omitted, PDF is assumed. Supported extensions: `.pdf`, `.png`, `.jpg`, `.jpeg`. Unsupported extensions raise `ValueError`.

**Image routing**: when `file_name` has an image extension, the tiered pipeline is skipped and the page goes directly to `llm_model_id` (not `tier3_model_id`), regardless of the `use_tiered_transcription` setting.

**Input model**

```python
@dataclass
class PageToTranscribe:
    page_number: int
    page_content: str | bytes
    media_type: str = "application/pdf"  # set automatically from file_name extension
```

**Return type**

```python
@dataclass
class ParsedDocPage:
    page_number: int
    page_content: str | bytes  # original input
    page_text: str | None      # Markdown transcription
```

### Tiered pipeline

When `use_tiered_transcription=True`, each **PDF** page flows through tiers in order. A tier's output is accepted when its score meets `transcription_accuracy_threshold`; otherwise the next tier runs.

```
Tier 1 — pdfplumber    (free, no network, digital text + tables)
    ↓ score < threshold
Tier 2 — AWS Textract  (OCR API, tables + forms)
       OR Mistral OCR  (swap via tier2_ocr="mistral")
    ↓ score < threshold
Tier 3 — Claude Haiku  (LLM fallback, always produces a result)
```

**Images always bypass the tiered pipeline.** When `file_name` has a `.png`, `.jpg`, or `.jpeg` extension, the page goes directly to the primary `llm_model_id` (Sonnet by default), not through Tier 1→2→3. This applies even when `use_tiered_transcription=True`.

Instantiate `OpenRagTranscriber` once and reuse it across all pages — both LangGraph workflows (Sonnet for the standard path, Haiku for Tier 3) are compiled at construction time.

---

## Chunking Reference

### `ChunksManager`

```python
from wizit_open_rag import ChunksManager
```

**Constructor parameters**

| Parameter | Type | Default | Description |
|---|---|---|---|
| `langsmith_project_name` | `str` | required | LangSmith project name for tracing |
| `langsmith_api_key` | `str` | required | LangSmith API key |
| `llm_model_id` | `str` | `"global.anthropic.claude-sonnet-4-6"` | Bedrock model for context generation |
| `embeddings_model_id` | `str` | `"amazon.titan-embed-text-v1"` | Bedrock embeddings model |
| `target_language` | `str` | `"es-CO"` | Output language for generated context |
| `kdb` | `EmbeddingsManager \| None` | `None` | Vector store backend; required only for `gen_and_index_context_chunks` |
| `ai_service` | `AiApplicationService \| None` | `None` | LLM backend override for context generation. Pass `ClaudeModels(...)` to use the Anthropic direct API instead of Bedrock. Ignores `llm_model_id` when set. |

**Methods**

```python
# Generate enriched chunks — caller handles indexing
async def gen_context_chunks(
    file_key: str,
    file_markdown_content: str,
    file_tags: dict,
) -> list[Document]

# Generate + index in one call — requires kdb= at construction time
async def gen_and_index_context_chunks(
    file_key: str,
    file_markdown_content: str,
    file_tags: dict,
    cleanup: "incremental" | "full" | "scoped_full" | None = "incremental",
    source_id_key: str = "source",
) -> IndexingResult
```

- `file_key`: Filename used as the `source` metadata key (e.g. `"report.md"`). Must end with `.md`.
- `file_markdown_content`: Pre-loaded Markdown string. This method does not read files from disk or S3.
- `file_tags`: Arbitrary key/value metadata propagated to every chunk.
- `cleanup`: LangChain indexing deduplication mode. `"incremental"` (default) skips unchanged chunks; `"full"` replaces all prior chunks for the source.

---

## Vector Store Backends

### PostgreSQL pgvector — `PgEmbeddingsManager`

```python
from wizit_open_rag import PgEmbeddingsManager
from wizit_open_rag.infra.embeddings.aws_embeddings import AWSEmbeddingsModels

embeddings = AWSEmbeddingsModels("amazon.titan-embed-text-v1").load_embeddings_model()

kdb = PgEmbeddingsManager(
    embeddings_model=embeddings,
    pg_connection="postgresql://user:password@localhost:5432/vectordb",
    embeddings_vectors_table_name="documents",
    records_manager_table_name="documents_records",
    # optional
    vector_size=768,                        # must match the embeddings model output
    metadata_columns=["source", "category"],
)

# First-time setup: create the table and record-manager schema
kdb.configure_vector_store()

# Create an HNSW index for fast ANN search (requires vector_size <= 2000)
kdb.create_index()

# Index documents
from langchain_core.documents import Document
docs = [Document(page_content="...", metadata={"source": "report.md"})]
result = kdb.index_documents(docs)

# Similarity search (returns top-5 by default)
matches = kdb.search_records("What is the refund policy?")

# Delete a document and all its chunks
ids = kdb.retrieve_documents_by_file_name("report.md")
kdb.delete_documents_by_ids(ids)
```

### Weaviate — `WeaviateEmbeddingsManager`

```python
from wizit_open_rag import WeaviateEmbeddingsManager
from wizit_open_rag.infra.embeddings.aws_embeddings import AWSEmbeddingsModels

embeddings = AWSEmbeddingsModels("amazon.titan-embed-text-v1").load_embeddings_model()

kdb = WeaviateEmbeddingsManager(
    embeddings_model=embeddings,
    weaviate_url="http://localhost:8080",
    collection_name="Documents",
    records_manager_db_url="postgresql://user:password@localhost:5432/vectordb",
    # optional
    records_manager_table_name="weaviate_records_manager",
    weaviate_api_key=None,    # set for Weaviate Cloud
    text_key="text",
)

# First-time setup: initialise record-manager schema
# (Weaviate creates the collection automatically on first write)
kdb.configure_vector_store()

# Index documents
result = kdb.index_documents(docs)

# Similarity search
matches = kdb.search_records("What is the refund policy?", k=5)

# Delete
ids = kdb.retrieve_documents_by_file_name("report.md")
kdb.delete_documents_by_ids(ids)
```

Both backends implement the same `EmbeddingsManager` interface and are interchangeable when passed as `kdb=` to `ChunksManager`.

---

## Embeddings Models

### AWS Bedrock — `AWSEmbeddingsModels`

```python
from wizit_open_rag.infra.embeddings.aws_embeddings import AWSEmbeddingsModels

# Returns a LangChain-compatible Embeddings instance backed by AWS Bedrock
embeddings = AWSEmbeddingsModels(
    embeddings_model_id="amazon.titan-embed-text-v1",
    region_name="us-east-1",  # default
).load_embeddings_model()
```

Credentials are read from the standard boto3 credential chain — no explicit key is needed.

### Voyage AI — `VoyageEmbeddingsModels`

A drop-in alternative to AWS Bedrock embeddings. Voyage AI models tend to score higher on retrieval benchmarks and support multilingual content out of the box.

```python
from wizit_open_rag.infra.embeddings.voyage_embeddings import VoyageEmbeddingsModels

embeddings = VoyageEmbeddingsModels(
    embeddings_model_id="voyage-3",      # default
    # api_key="voy-...",                 # or set VOYAGE_API_KEY env var
    batch_size=72,                       # default; Voyage's hard limit is 128
).load_embeddings_model()
```

**Available models** (pass as `embeddings_model_id`):

| Model | Dimensions | Notes |
|---|---|---|
| `voyage-3` | 1024 | General-purpose, highest quality (default) |
| `voyage-3-lite` | 512 | Lower latency, lower cost |
| `voyage-multilingual-2` | 1024 | Optimised for multilingual retrieval |

The returned object is a standard LangChain `Embeddings` instance — pass it to `PgEmbeddingsManager`, `WeaviateEmbeddingsManager`, or `ChunksManager` exactly like the AWS variant:

```python
from wizit_open_rag import PgEmbeddingsManager
from wizit_open_rag.infra.embeddings.voyage_embeddings import VoyageEmbeddingsModels

embeddings = VoyageEmbeddingsModels("voyage-3").load_embeddings_model()

kdb = PgEmbeddingsManager(
    embeddings_model=embeddings,
    pg_connection="postgresql://user:password@localhost:5432/vectordb",
    embeddings_vectors_table_name="documents",
    records_manager_table_name="documents_records",
    vector_size=1024,  # must match the model's output dimension
)
```

---

## Environment Variables

Variables read at runtime (not at import time):

| Variable | Purpose |
|---|---|
| `LANGSMITH_API_KEY` | LangSmith API key (can also be passed as constructor arg) |
| `LANGCHAIN_PROJECT` | LangSmith project name |
| `LANGSMITH_TRACING` | Enable LangSmith tracing (`true` / `false`) |
| `MISTRAL_API_KEY` | Mistral OCR API key (only needed for `tier2_ocr="mistral"`) |
| `VECTOR_STORE_CONNECTION` | PostgreSQL connection string for pgvector |
| `VECTOR_STORE_TABLE` | pgvector table name |
| `WEAVIATE_URL` | Weaviate cluster URL |
| `WEAVIATE_API_KEY` | Weaviate Cloud API key (optional for local) |
| `WEAVIATE_COLLECTION` | Weaviate collection name |
| `VOYAGE_API_KEY` | Voyage AI API key (only needed when using `VoyageEmbeddingsModels`) |
| `ANTHROPIC_API_KEY` | Anthropic API key (only needed when using `ClaudeModels`; can also be passed directly as `api_key`) |

AWS credentials (Bedrock, Textract, S3) are configured via the standard boto3 chain and are not managed by this library.

---

## Architecture

```
wizit_open_rag/
├── transcription.py       ← OpenRagTranscriber (public API)
├── chunks.py              ← ChunksManager (public API)
├── domain/                ← PageToTranscribe, ParsedDocPage, ParsedDoc
├── application/
│   ├── interfaces.py      ← ABCs: EmbeddingsManager, PageTranscriptionTier, …
│   ├── transcription_app.py         ← LangGraph transcription workflow
│   ├── tiered_transcription_app.py  ← Cost-aware tier sequencer
│   └── context_chunk_app.py         ← Per-chunk context enrichment
├── infra/
│   ├── llms/              ← AWSModels (ChatBedrockConverse), ClaudeModels (ChatAnthropic)
│   ├── embeddings/        ← AWSEmbeddingsModels (BedrockEmbeddings), VoyageEmbeddingsModels
│   ├── transcription/
│   │   ├── pdfplumber_tier.py   ← Tier 1
│   │   ├── textract_tier.py     ← Tier 2a
│   │   ├── mistral_ocr_tier.py  ← Tier 2b
│   │   └── llm_tier.py          ← Tier 3
│   ├── rag/
│   │   ├── pg_embeddings.py        ← PgEmbeddingsManager
│   │   ├── weaviate_embeddings.py  ← WeaviateEmbeddingsManager
│   │   ├── markdown_chunks.py      ← MarkdownHeadersChunks
│   │   ├── semantic_chunks.py      ← SemanticChunks (85th-pct breakpoints)
│   │   └── recursive_chunks.py    ← RecursiveChunks
│   └── persistence/       ← LocalStorageService, S3StorageService, PgConnectionManager
└── workflows/             ← LangGraph state machines (transcription + context)
```

---

## Gotchas

- `transcribe_document` takes a **single-page** PDF as bytes. Use PyMuPDF (`fitz`) to split pages before calling it.
- For images, pass the raw file bytes directly — no page-splitting needed. Include `file_name="scan.png"` so the library detects the format.
- Both `transcribe_document` and `gen_context_chunks` are `async`. Use `asyncio.run(...)` from synchronous code, or `await` them inside an async function.
- `OpenRagTranscriber` and `ChunksManager` require `langsmith_project_name` and `langsmith_api_key` as constructor arguments — they are not read from environment variables.
- AWS Bedrock cross-region model IDs use the `global.` prefix (e.g. `global.anthropic.claude-sonnet-4-6`). Region-specific IDs use the regional prefix (e.g. `us.anthropic.claude-haiku-4-5-20251001-v1:0`).
- `ClaudeModels` uses the Anthropic direct API — model IDs are plain Anthropic IDs (e.g. `"claude-sonnet-4-6"`), **not** the Bedrock-prefixed forms (`global.` / `us.`). When `ai_service` is provided, `llm_model_id` and `tier3_model_id` are ignored.
- `gen_context_chunks` does not load files from disk or S3 — pass the Markdown content as a string.
- `gen_and_index_context_chunks` raises `ValueError` if no `kdb=` backend was provided at construction time.
- `WeaviateEmbeddingsManager` opens a new Weaviate client connection per operation. Avoid calling it in a tight loop; prefer batching via `gen_and_index_context_chunks`.
- `PgEmbeddingsManager.create_index()` raises `NotImplementedError` when `vector_size > 2000`.
- When `use_tiered_transcription=True`, the `OpenRagTranscriber` compiles two LangGraph workflows at construction time. Instantiate once and reuse across all pages.
- Images passed with an unsupported extension (e.g. `.tiff`, `.bmp`, `.webp`) raise `ValueError` immediately — they are not silently treated as PDFs.
- When `file_name` is `None` (default), the library assumes `application/pdf`. Pass `file_name` explicitly when the bytes are an image.

---

## License

Licensed under the [Apache License 2.0](LICENSE.md).
