Metadata-Version: 2.4
Name: pyxrag
Version: 0.5.3
Summary: RAG ingest + retrieval engine — section_table chunker, hybrid retrieval, optional rerank/enrichment
Project-URL: Repository, https://github.com/henryle97/xrag
Author: Henry Le
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: boto3>=1.34.0
Requires-Dist: langchain-core<0.4,>=0.2.43
Requires-Dist: langchain-openai<0.4,>=0.1.22
Requires-Dist: langchain<0.4,>=0.2.17
Requires-Dist: openai>=1.0.0
Requires-Dist: pydantic-settings>=2.4.0
Requires-Dist: pydantic>=2.7.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: qdrant-client>=1.12.0
Requires-Dist: rank-bm25>=0.2.2
Requires-Dist: rich>=13.0.0
Requires-Dist: unstructured-client>=0.34.0
Requires-Dist: unstructured[docx]>=0.22.0
Provides-Extra: all
Requires-Dist: chromadb>=0.6.0; extra == 'all'
Requires-Dist: cohere>=5.11.0; extra == 'all'
Requires-Dist: typer>=0.12.3; extra == 'all'
Requires-Dist: unstructured[pdf]>=0.22.0; extra == 'all'
Provides-Extra: chroma
Requires-Dist: chromadb>=0.6.0; extra == 'chroma'
Provides-Extra: cli
Requires-Dist: typer>=0.12.3; extra == 'cli'
Provides-Extra: parser-unstructured-local-pdf
Requires-Dist: unstructured[pdf]>=0.22.0; extra == 'parser-unstructured-local-pdf'
Provides-Extra: rerank-cohere
Requires-Dist: cohere>=5.11.0; extra == 'rerank-cohere'
Description-Content-Type: text/markdown

# xrag

RAG ingest + retrieval engine for documents with text and tables. Section/table-aware chunking, hybrid retrieval, optional Cohere reranking, optional LLM-based enrichment, and end-to-end Q&A generation. Works as both a CLI and an installable Python library.

`xrag` is designed for teams that want one Python library for:

- ingesting real documents into a retrieval system
- indexing pre-chunked offline data
- retrieving typed chunks for downstream LLM apps
- running a complete retrieve-and-answer flow with one client

## Install

```bash
pip install pyxrag
# or with optional extras:
pip install "pyxrag[rerank-cohere,cli]"
```

The distribution name on PyPI is `pyxrag`; the import name is `xrag`:

```python
import xrag
```

For local development from source:

```bash
git clone https://github.com/henryle97/xrag.git
cd xrag
uv sync --group dev
```

## 30-Second Quickstart

If you want the main library path, start with `Xrag`:

```python
from xrag import Xrag

client = Xrag(
    qdrant_path="data/dev/xrag_quickstart",
    generation="openai:gpt-4.1-nano",
)

doc = await client.documents.ingest("annual-report.docx", collection="ir_docs")
result = await client.rag.ask(
    query="What is this document about?",
    collection="ir_docs",
)

print(doc.id)
print(result.answer)
```

See the runnable example: [`examples/01_quickstart.py`](examples/01_quickstart.py)
For the full Python guide, see [`docs/library.md`](docs/library.md).

## Choose Your Path

`xrag` has three main entrypoints:

- `Xrag`: the higher-level async client for application code
- `run_ingest` / `run_rag`: the lower-level functional API for callers that want direct control over `AppConfig` and pipeline wiring
- CLI: local workflows, debugging, artifact inspection, and evaluation runs

Use `Xrag` when you want the application-facing library surface.
Use the functional API when you already work in terms of `AppConfig`.
Use the CLI when you are operating the pipeline directly from the shell.

## Using `Xrag`

Create one client and reuse it across operations:

```python
from xrag import Xrag

client = Xrag(
    qdrant_url="http://localhost:6333",
    embedding="openai:text-embedding-3-small",
    generation="openai:gpt-4.1-nano",
)
```

### Ingest A Document

Parse, chunk, embed, and index a source document:

```python
doc = await client.documents.ingest(
    "annual-report.docx",
    collection="ir_docs",
)

print(doc.id)
```

Runnable example: [`examples/01_quickstart.py`](examples/01_quickstart.py)

### Index From Offline Chunks

If you already have chunks prepared offline, skip parsing and index them directly:

```python
doc = await client.documents.index_chunks(
    "data/chunks.json",
    collection="ir_docs",
    name="f1-offline-chunks",
)

print(doc.id)
```

Runnable examples:

- [`examples/05_index_chunks.py`](examples/05_index_chunks.py)
- [`examples/05b_index_chunks_from_json.py`](examples/05b_index_chunks_from_json.py)

### Retrieve Chunks

If you only want retrieval and already have your own LLM stack:

```python
result = await client.retrievals.search(
    query="What was revenue?",
    collection="ir_docs",
    top_k=5,
)

for chunk in result.chunks:
    print(chunk.document_id, chunk.score, chunk.text[:120])
```

Runnable example: [`examples/03_retrieve_only.py`](examples/03_retrieve_only.py)

### Ask A Question

Retrieve context and generate an answer:

```python
result = await client.rag.ask(
    query="What was revenue?",
    collection="ir_docs",
)

print(result.answer)
```

Runnable example: [`examples/01_quickstart.py`](examples/01_quickstart.py)

## Example Gallery

- [`examples/01_quickstart.py`](examples/01_quickstart.py)
  Ingest one document and ask one question.
- [`examples/02_from_env.py`](examples/02_from_env.py)
  Build the client from environment variables.
- [`examples/03_retrieve_only.py`](examples/03_retrieve_only.py)
  Use xrag for retrieval only.
- [`examples/04_multi_tenant.py`](examples/04_multi_tenant.py)
  Scope operations by tenant with `client.for_tenant(...)`.
- [`examples/05_index_chunks.py`](examples/05_index_chunks.py)
  Index pre-built chunks from Python objects.
- [`examples/05b_index_chunks_from_json.py`](examples/05b_index_chunks_from_json.py)
  Re-index a chunks JSON file from an offline pipeline run.
- [`examples/06_export_and_validate.py`](examples/06_export_and_validate.py)
  Export, validate, and round-trip chunks.
- [`examples/06b_export_chunks.py`](examples/06b_export_chunks.py)
  Export chunks from an indexed collection.
- [`examples/07_error_handling.py`](examples/07_error_handling.py)
  Handle typed xrag exceptions.
- [`examples/08_per_call_overrides.py`](examples/08_per_call_overrides.py)
  Override ingest and retrieval behavior per call.
- [`examples/09_parse_docx_api_vs_local.py`](examples/09_parse_docx_api_vs_local.py)
  Compare hosted and local DOCX parsing on the same file.

## Configuring `Xrag`

`Xrag(...)` is the main application-facing configuration surface. In
practice, most users set storage, embedding, and optionally generation,
then reuse the same client for ingest and query operations.

### Full Example

This shows the full constructor surface:

```python
from xrag import Xrag
from xrag.config.models import RetrievalConfig

client = Xrag(
    qdrant_url="http://localhost:6333",
    qdrant_path=None,
    embedding="openai:text-embedding-3-small",
    generation="openai:gpt-4.1-nano",
    enrichment=("auto_keywords", "auto_questions"),
    reranker=None,
    parser="unstructured",
    chunker="section_table",
    retrieval_defaults=RetrievalConfig(
        provider="hybrid",
        options={
            "top_k": 10,
            "bm25_candidates": 50,
            "vector_candidates": 50,
            "rrf_k": 60,
            "dedup_family": True,
        },
    ),
    upload_dir="/tmp/xrag_uploads",
    artifacts_dir=None,
    timeout_s=30.0,
    ingest_timeout_s=600.0,
    tracing=None,
)
```

### Parameter Reference

#### Storage

- `qdrant_url: str | None = None`
  URL of a running Qdrant server.
  Use this for normal server-backed deployments.
- `qdrant_path: str | Path | None = None`
  Local filesystem path for Qdrant local mode.
  Use this for local development or single-machine experiments.

You must set exactly one of `qdrant_url` or `qdrant_path`.

#### Core Model Configuration

- `embedding: str = "openai:text-embedding-3-small"`
  Embedding provider string.
  Required for indexing and retrieval.
  Example values:
  `openai:text-embedding-3-small`
- `generation: str | None = None`
  Generation provider string for `client.rag.ask(...)`.
  Omit this for retrieve-only usage.
  Example values:
  `openai:gpt-4.1-nano`
- `reranker: str | None = None`
  Optional reranker configuration.
  The client only uses it on calls where reranking is enabled.
  Example values:
  `cohere:rerank-v3.5`
- `tracing: str | None = None`
  Optional tracing backend.
  Example values:
  `langsmith:my-project`

#### Ingest Configuration

- `parser: str = "unstructured"`
  Parser configuration used by `client.documents.ingest(...)`.
  Default is the hosted Unstructured API parser.
  Use `unstructured_local:fast` for lightweight local parsing.
  Add `xrag[parser-unstructured-local-pdf]` only when you want local PDF/OCR
  parsing.
- `chunker: str = "section_table"`
  Chunking strategy used during ingest.
  Default is the section/table-aware chunker.
- `enrichment: tuple[str, ...] | None = ("auto_keywords", "auto_questions")`
  Ingest-time enrichment stages.
  Set `None` or `()` to disable enrichment.
  Available stage names currently include:
  `auto_keywords`, `auto_questions`, `table_context`, `table_summary`
- `upload_dir: str | Path = "/tmp/xrag_uploads"`
  Local spool directory for byte inputs and file-like inputs passed to
  `documents.ingest(...)`.
- `artifacts_dir: str | Path | None = None`
  Base directory for ingest artifacts.
  When unset, xrag creates an ephemeral artifacts directory per ingest
  call.

#### Retrieval Configuration

- `retrieval_defaults: RetrievalConfig | None = None`
  Default retrieval settings used by `client.retrievals.search(...)` and
  `client.rag.ask(...)`.
  If unset, xrag uses:

```python
RetrievalConfig(
    provider="hybrid",
    options={
        "top_k": 10,
        "bm25_candidates": 50,
        "vector_candidates": 50,
        "rrf_k": 60,
        "dedup_family": True,
    },
)
```

By default, reranking is still off at call time unless you explicitly
pass `rerank=True`.

#### Timeouts

- `timeout_s: float = 30.0`
  General request timeout for non-ingest operations.
- `ingest_timeout_s: float = 600.0`
  Longer timeout budget for ingest operations.

### What Most Users Actually Change

For most applications, the parameters that matter most are:

- `qdrant_url` or `qdrant_path`
- `embedding`
- `generation` if you use `rag.ask(...)`
- `reranker` if you want reranked retrieval
- `enrichment` if you want to disable or change ingest-time enrichment
- `artifacts_dir` if you want stable ingest artifacts on disk

### Environment-Based Setup

If you prefer environment variables:

```python
from xrag import Xrag

client = Xrag.from_env()
```

Useful environment variables:

- `XRAG_QDRANT_URL`
- `XRAG_QDRANT_PATH`
- `XRAG_DEFAULT_EMBEDDING`
- `XRAG_DEFAULT_GENERATION`
- `XRAG_DEFAULT_RERANKER`
- `XRAG_UPLOAD_DIR`
- `XRAG_ARTIFACTS_DIR`

## Docs Map

Detailed docs live under [`docs/`](docs/README.md).

Recommended path:

1. Read [`docs/README.md`](docs/README.md) for library-vs-CLI doc navigation.
2. Read [`docs/library.md`](docs/library.md) for the dedicated Python library guide.
3. Use [`docs/command.md`](docs/command.md) for exact CLI commands.
4. Use [`docs/chunker.md`](docs/chunker.md) for chunking strategies and output schema.
5. Use [`docs/enrichment.md`](docs/enrichment.md) for indexing-time text enrichment.
6. Use [`docs/rag-baseline.md`](docs/rag-baseline.md) for the simple baseline RAG flow.
7. Use [`docs/sub-plans/xrag-public-api.md`](docs/sub-plans/xrag-public-api.md) for the high-level Python client design.
8. Use [`docs/plans.md`](docs/plans.md) and [`docs/sub-plans/`](docs/sub-plans/) for roadmap and implementation details.

## Quickstart — library (v0.1, functional API)

```python
from pathlib import Path

from xrag import (
    AppConfig, IngestRequest,
    load_config_yaml, run_ingest, run_rag,
)

cfg: AppConfig = load_config_yaml("configs/rag-ir-mvp3-all-docs.yml")

# Ingest
ingest_result = run_ingest(IngestRequest(config=cfg, output_dir=Path("./data/dev/ir_document")))

# Query
result = run_rag("What was Cash as of December 31, 2024?", cfg)
print(result.answer)
```

The high-level `Xrag` async client (`client.documents.ingest(...)`, `client.retrievals.search(...)`, `client.rag.ask(...)`) ships in v0.2 — see [`docs/sub-plans/xrag-public-api.md`](docs/sub-plans/xrag-public-api.md).

## Quickstart — CLI

End-to-end pipeline:

```text
source document
  -> convert
  -> parser
  -> parser preprocess
  -> chunk prepare
  -> indexing enrichment (optional)
  -> rag baseline OR indexing + query
  -> eval create-dataset / eval run
```

Common commands:

```bash
uv run python -m xrag.cli convert --help
uv run python -m xrag.cli parser --help
uv run python -m xrag.cli parser preprocess --help
uv run python -m xrag.cli chunk prepare --help
uv run python -m xrag.cli rag baseline --help
uv run python -m xrag.cli --help
```

Smoke flow:

```bash
# Parse
uv run python -m xrag.cli parser \
  --config configs/baseline.yml \
  --output-dir data/dev/ir_document \
  --pretty

# Preprocess
uv run python -m xrag.cli parser preprocess \
  --input data/dev/ir_document/parsed/unstructured_elements.json \
  --output-dir data/dev/ir_document/normalized

# Chunk
uv run python -m xrag.cli chunk prepare \
  --input data/dev/ir_document/normalized/elements.json \
  --config configs/chunker.yml \
  --output-dir data/dev/ir_document/chunked

# Baseline RAG
uv run python -m xrag.cli rag baseline \
  --input data/dev/ir_document/chunked/chunks.json \
  --question "What was Cash as of December 31, 2024?" \
  --config configs/rag-simple-baseline.yml \
  --pretty

# Preview enrichment without embedding/vector-store writes
uv run python scripts/enrichment/preview.py \
  --input data/dev/ir_document/chunked/chunks.json \
  --config configs/enrichment/table-context.yml
```

## Setup

Install dependencies:

```bash
uv sync
```

Environment for the Unstructured parser:

```env
UNSTRUCTURED_API_KEY=...
UNSTRUCTURED_API_URL=...   # optional
```

For DOCX-to-PDF conversion install LibreOffice so `soffice` / `libreoffice` is on `PATH`.

## Phasing

| Version | Surface |
|---|---|
| **v0.1** (current) | Functional API: `run_ingest`, `run_rag`, Pydantic models, configs, loaders. |
| **v0.2** | High-level `Xrag` async client with resource-namespaced ops (`documents.*`, `retrievals.*`, `rag.*`). Tenant binding via `for_tenant`. Typed error hierarchy. |
| **v0.3** | `XragSync` mirror. |
| **v0.4** | Per-call `config_overrides`, batch ingest, persistent registry contract. |
| **v0.5+ (backlog)** | Streaming surfaces (`documents.ingest_stream`, `rag.ask_stream`). |

Full design: [`docs/sub-plans/xrag-public-api.md`](docs/sub-plans/xrag-public-api.md).

## Optional dependency extras

- `xrag[parser-unstructured-local-pdf]` — local PDF/image/OCR parsing with Unstructured
- `xrag[rerank-cohere]` — Cohere v3.5 reranker
- `xrag[cli]` — Typer-based CLI (`rag-cli` entry point)
- `xrag[chroma]` — Chroma backend (alternative to Qdrant)
- `xrag[all]` — everything

The base `xrag` install includes the hosted Unstructured API path, parser
preprocess support, and lightweight local parsing such as DOCX/text/HTML.
Add `xrag[parser-unstructured-local-pdf]` only if you want local PDF/image/OCR
parsing on the same machine.

### Dev-only tooling (`tools/`)

Eval scoring (RAGAS), dataset generation, and HTML viewers live under `tools/` at the repo root. They are **not** packaged into the wheel — wheel consumers don't need them. Running them requires a source clone plus the dev dependency group:

```bash
uv sync --group dev --extra cli --extra chroma --extra rerank-cohere
```

Add the heavy local PDF parser stack only when you need it:

```bash
uv sync --group dev --extra cli --extra chroma --extra rerank-cohere --extra parser-unstructured-local-pdf
```

`make eval`, `make unit-test`, `python -m xrag.cli eval ...` and similar dev commands rely on `tools/` being on `sys.path` (pytest is configured for this).

## Key defaults

- Parser provider: `unstructured`.
- Preprocess config: [`configs/preprocess.yml`](configs/preprocess.yml).
- Default chunker: `section_table` in [`configs/chunker.yml`](configs/chunker.yml).
- Alternative chunker: `ragflow` in [`configs/chunker-ragflow.yml`](configs/chunker-ragflow.yml).
- Simple one-shot RAG config: [`configs/rag-simple-baseline.yml`](configs/rag-simple-baseline.yml).
- Persistent index/query config: [`configs/rag-baseline.yml`](configs/rag-baseline.yml).
- Enrichment scenario configs: [`configs/enrichment/`](configs/enrichment).

## Project layout

- [`xrag/cli.py`](xrag/cli.py): Typer CLI entrypoint (gated behind `[cli]` extra).
- [`xrag/config/`](xrag/config/): YAML config and settings.
- [`xrag/core/`](xrag/core/): parser, chunker, enrichment, embedding, retrieval, reranker, generation, and tracing subsystems.
- [`xrag/pipelines/`](xrag/pipelines/): library pipelines — `parser`, `parser_preprocess`, `chunk_prepare`, `indexing`, `ingest`, `rag` (single-query, batch, and eval-shaped retrieve+generate). Eval scoring/dataset/HTML-viewer pipelines live under [`tools/`](tools/).
- [`tools/`](tools/): **dev-only** — eval scoring, dataset generation, HTML viewers, eval datasets. Excluded from the wheel.
- [`configs/`](configs/): runtime YAML configs.
- [`docs/`](docs/README.md): user docs, plans, surveys, and terminology.
- [`data/`](data/): local inputs and generated artifacts.

## Development checks

```bash
make lint
make check
make unit-test
```

After changing behavior, run a live CLI command that exercises the changed path.

## CI/CD

GitHub Actions now verifies the same core paths contributors should run locally:

- `lint`: Ruff lint and format checks on a locked dev environment
- `unit-test`: full-extras unit tests plus a real `python -m xrag.cli --help` smoke check
- `package`: wheel + sdist build plus `twine check`
- `contract`: installed-wheel compatibility against Python `3.11` and `3.12`, with both the pinned LangChain `0.2.17` floor and the `0.3` line

Tag pushes matching `v*` reuse the verified package artifacts and publish them to the GitHub Release instead of rebuilding a second time during release.

## License

MIT — see [LICENSE](LICENSE).
