Metadata-Version: 2.4
Name: local-pageindex
Version: 0.1.1
Summary: Local-only Python SDK mirroring PageIndex Cloud API — vectorless RAG, no cloud required.
Author: local-pageindex contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/VectifyAI/PageIndex
Project-URL: Repository, https://github.com/VectifyAI/PageIndex
Project-URL: Documentation, https://docs.pageindex.ai
Project-URL: Bug Tracker, https://github.com/VectifyAI/PageIndex/issues
Keywords: rag,pageindex,document-processing,retrieval,llm,vectorless,local,openai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: litellm>=1.83.0
Requires-Dist: PyPDF2>=3.0.1
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.24.0; extra == "pdf"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-mock>=3.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Provides-Extra: all
Requires-Dist: local-pageindex[pdf]; extra == "all"
Dynamic: license-file

# local-pageindex

[![PyPI version](https://img.shields.io/pypi/v/local-pageindex.svg)](https://pypi.org/project/local-pageindex/)
[![Python versions](https://img.shields.io/pypi/pyversions/local-pageindex.svg)](https://pypi.org/project/local-pageindex/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Tests](https://github.com/VectifyAI/PageIndex/actions/workflows/test.yml/badge.svg)](https://github.com/VectifyAI/PageIndex/actions/workflows/test.yml)

A local-only Python SDK that mirrors the [PageIndex Cloud API](https://docs.pageindex.ai/) as importable Python methods — **no cloud account, no API key, no external service**.

All document processing, retrieval, and chat happens on your machine. The only external calls are to your configured OpenAI-compatible LLM endpoint (for indexing, retrieval reasoning, and chat).

---

## Why this exists

[PageIndex](https://github.com/VectifyAI/PageIndex) is an open-source vectorless RAG system that builds hierarchical tree indexes from documents and uses LLM reasoning to navigate them — achieving 98.7 % accuracy on FinanceBench without a vector database.

This SDK wraps that library with:

- **Local storage** — all indexes, trees, and retrieval results saved under `storage_path`
- **Full API parity** — local equivalents for every PageIndex Cloud REST endpoint
- **Streaming chat** — Python generator interface
- **Tenant/workspace/workflow isolation** — multi-tenant scoping and boundary enforcement
- **Batch ingestion** — folder and file-list ingestion with success/failure summary
- **Citation support** — every retrieval and chat response includes source references

---

## Comparison with PageIndex Cloud

| Feature | PageIndex Cloud | local-pageindex |
|---|---|---|
| Document tree building | Cloud-hosted | Local, via open-source `pageindex` library |
| Storage | PageIndex servers | Your `storage_path` directory |
| Authentication | `PAGEINDEX_API_KEY` | Your OpenAI-compatible `api_key` |
| PDF text extraction | Cloud OCR | PyMuPDF / PyPDF2 |
| Retrieval | Cloud-hosted vectorless RAG | Local LLM tree navigation |
| Chat | Cloud API | Local LLM with retrieved context |
| Streaming | SSE over HTTPS | Python generator |
| Multi-tenant isolation | Managed | Metadata-based filtering |
| Cost | PageIndex pricing | LLM API calls only |

---

## Installation

```bash
pip install local-pageindex
```

For better PDF extraction accuracy, also install PyMuPDF:

```bash
pip install "local-pageindex[pdf]"
```

**Important:** PDF and Markdown tree building requires the open-source `pageindex` library, which is not on PyPI. Install it separately:

```bash
pip install git+https://github.com/VectifyAI/PageIndex.git
```

Without `pageindex`, text file ingestion and all retrieval/chat methods work normally. PDF and Markdown ingestion will raise `LLMProviderError` at runtime.

---

## Quick start

```python
import os
from local_pageindex import LocalPageIndexClient

client = LocalPageIndexClient(
    storage_path="./my_index",
    api_key=os.environ["OPENAI_API_KEY"],
    base_url=os.environ.get("OPENAI_BASE_URL"),  # optional
    model="gpt-4.1",
    reasoning_model="gpt-5.1",
)

# Ingest a document
result = client.ingest_document("report.pdf")
doc_id = result["doc_id"]

# Ask a question
answer = client.ask("What are the key findings?", doc_id=doc_id)
print(answer)
```

---

## Document ingestion

### PDF

```python
result = client.ingest_document(
    "annual_report.pdf",
    document_id="report-2024",         # optional; UUID generated if omitted
    metadata={
        "tenant_id": "acme",
        "workspace_id": "ws-finance",
        "source_type": "workspace",
    },
)
# {"doc_id": "report-2024", "status": "completed", "retrieval_ready": True}
```

### Plain text

```python
result = client.ingest_document("notes.txt", document_id="notes-001")
```

### Markdown

```python
result = client.ingest_markdown(
    "README.md",
    options={
        "add_node_summary": True,
        "add_doc_description": True,
        # PageIndex-style strings also accepted: "if_add_node_summary": "yes"
    },
)
```

### Folder ingestion

```python
summary = client.ingest_folder(
    "./documents/",
    metadata_defaults={"tenant_id": "acme", "workspace_id": "ws-1"},
    recursive=True,
)
# {"succeeded": [...], "failed": [...], "skipped": [...], "success_count": N}
```

Supported types: `.pdf`, `.md`, `.markdown`, `.txt`, `.text`.
Unsupported types appear in `skipped` — no exception is raised.

---

## Get document data

```python
# Hierarchical tree (sections / subsections)
tree = client.get_document_tree("report-2024")
tree_with_summaries = client.get_document_tree("report-2024", summary=True)

# Extracted text — page format  [{"page_index": 1, "text": "..."}]
client.get_document_ocr("report-2024", format="page")

# Extracted text — node format  [{"node_id": "0001", "title": "...", "text": "..."}]
client.get_document_ocr("report-2024", format="node")

# Extracted text — raw string
client.get_document_ocr("report-2024", format="raw")

# Metadata  {"id", "name", "description", "status", "createdAt", "pageNum"}
client.get_document_metadata("report-2024")

# List  {"documents": [...], "total": N, "limit": 50, "offset": 0}
client.list_documents(limit=20, offset=0)

# Delete
client.delete_document("report-2024")
```

---

## Retrieval

```python
result = client.retrieve(
    document_id="report-2024",
    query="What are the key risk factors?",
    thinking=False,           # True: use reasoning_model for deeper analysis
    max_results=5,
    max_context_tokens=4000,
)
# {
#   "retrieval_id": "...",
#   "doc_id": "report-2024",
#   "status": "completed",
#   "query": "...",
#   "retrieved_nodes": [
#     {
#       "title": "Risk Factors",
#       "node_id": "0005",
#       "relevant_contents": [
#         {"page_index": 12, "relevant_content": "The primary risk..."}
#       ]
#     }
#   ]
# }
```

Task-style retrieval (PageIndex Legacy API compatibility):

```python
task = client.create_retrieval_task("report-2024", "What are the risks?")
result = client.get_retrieval_result(task["retrieval_id"])
```

---

## Chat

```python
result = client.chat_completion(
    messages=[{"role": "user", "content": "Summarise the key findings."}],
    doc_id="report-2024",
    enable_citations=True,
)
# {
#   "id": "chatcmpl-...",
#   "choices": [{"message": {"role": "assistant", "content": "..."}, "finish_reason": "end_turn"}],
#   "usage": {"prompt_tokens": N, "completion_tokens": N, "total_tokens": N},
#   "citations": [{"document_id": "...", "section_title": "...", "page_number": N, ...}]
# }

# Multi-document
result = client.chat_completion(
    messages=[{"role": "user", "content": "Compare the two reports."}],
    document_ids=["report-2024", "report-2023"],
)

# Convenience method — returns plain string
answer = client.ask("What is the revenue?", doc_id="report-2024")
```

---

## Streaming chat

```python
for chunk in client.stream_chat(
    messages=[{"role": "user", "content": "Summarise the findings."}],
    doc_id="report-2024",
):
    if chunk["type"] == "content":
        print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
    elif chunk["type"] == "done":
        print()
        print("Citations:", chunk.get("citations", []))
```

Chunk types emitted:

| Type | Description |
|---|---|
| `text_block_start` | Stream begins |
| `content` | Text delta in `choices[0].delta.content` |
| `text_stop` | Text stream complete |
| `done` | Final chunk; includes `citations` list |

---

## Metadata filtering

Every document stores arbitrary metadata for later filtering:

```python
client.ingest_document(
    "contract.pdf",
    metadata={
        "tenant_id": "acme",
        "workspace_id": "ws-legal",
        "workflow_id": "wf-onboarding",
        "user_id": "user-42",
        "source_type": "workflow_upload",   # "workspace" | "workflow_upload"
    },
)

docs = client.list_documents(filters={"workspace_id": "ws-legal"})
```

---

## Workspace and workflow isolation

`search()` and `scoped_chat()` enforce strict isolation boundaries:

```python
results = client.search(
    query="What are the obligations?",
    tenant_id="acme",
    workspace_id="ws-legal",
    include_workspace_context=True,
    include_uploaded_documents=True,
)

response = client.scoped_chat(
    messages=[{"role": "user", "content": "Analyse this contract."}],
    tenant_id="acme",
    workspace_id="ws-legal",
    workflow_id="wf-onboarding",
    include_workspace_context=True,
    include_uploaded_documents=True,
)
```

**Boundary guarantees:**

- Never searches across tenants
- Never searches across workspaces
- Workflow-uploaded documents are only visible within that workflow
- `include_workspace_context=False` → workspace docs excluded
- `include_uploaded_documents=False` → workflow-uploaded docs excluded
- Both `False` → empty context, no LLM call

---

## Data locality

All data is stored under your `storage_path`:

```
storage_path/
  documents/
    {document_id}/
      metadata.json
      tree.json
      extracted_text.json
      pages.json
      nodes.json
      retrievals/
      chats/
  manifests/
    manifest.json
  tasks/
```

**No data is sent to PageIndex Cloud.** The only external calls are to your configured LLM endpoint for:

- Tree building and node summarisation (during ingestion)
- Tree navigation and answer generation (during retrieval)
- Chat responses

---

## API parity table

| Cloud Endpoint | Local SDK Method | Supported |
|---|---|---|
| `POST /doc/` | `ingest_document()` / `ingest_pdf()` | Yes |
| `GET /doc/{id}/?type=tree` | `get_document_tree()` | Yes |
| `GET /doc/{id}/?type=tree&summary=true` | `get_document_tree(summary=True)` | Yes |
| `GET /doc/{id}/?type=ocr&format=page` | `get_document_ocr(format="page")` | Yes |
| `GET /doc/{id}/?type=ocr&format=node` | `get_document_ocr(format="node")` | Yes |
| `GET /doc/{id}/?type=ocr&format=raw` | `get_document_ocr(format="raw")` | Yes |
| `GET /doc/{id}/metadata` | `get_document_metadata()` | Yes |
| `GET /docs?limit=&offset=` | `list_documents(limit, offset)` | Yes |
| `DELETE /doc/{id}/` | `delete_document()` | Yes |
| `POST /markdown/` | `ingest_markdown()` / `convert_markdown_to_tree()` | Yes |
| `POST /chat/completions` | `chat_completion()` / `chat()` / `ask()` | Yes |
| `POST /chat/completions` (stream) | `stream_chat()` | Yes (Python generator) |
| `POST /retrieval/` | `retrieve()` / `create_retrieval_task()` | Yes |
| `GET /retrieval/{id}/` | `get_retrieval_result()` | Yes |
| Folder ingestion | `ingest_folder()` | Local-only |
| Batch ingestion | `ingest_documents()` / `batch_ingest()` | Local-only |
| Workspace isolation | `search()` / `scoped_chat()` | Local-only |

### Known limitations

| Cloud Feature | Local Approximation |
|---|---|
| Enhanced cloud OCR | PyMuPDF / PyPDF2 text extraction (may be less accurate for scanned PDFs) |
| Hosted MCP tooling | Not implemented — local SDK uses direct LLM calls |
| MCP streaming events | Approximated as `text_block_start` / `content` / `done` chunks |
| Async processing queue | Synchronous — `create_retrieval_task()` runs inline and stores the result |

---

## Attribution

This package wraps the open-source [PageIndex](https://github.com/VectifyAI/PageIndex) library
by [VectifyAI](https://github.com/VectifyAI) (MIT License). No PageIndex source code is
incorporated — `pageindex` is used as a library dependency. See [ATTRIBUTION.md](ATTRIBUTION.md)
for details.

`local-pageindex` is not affiliated with VectifyAI or PageIndex Cloud.

---

## License

[MIT](LICENSE)
