Metadata-Version: 2.4
Name: rnsr
Version: 0.1.5
Summary: Recursive Neural-Symbolic Retriever - Hierarchical document retrieval with font-based structure analysis
Author: RNSR Contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/theeufj/RNSR
Project-URL: Documentation, https://github.com/theeufj/RNSR#readme
Project-URL: Repository, https://github.com/theeufj/RNSR.git
Project-URL: Issues, https://github.com/theeufj/RNSR/issues
Keywords: rag,retrieval,document-processing,llm,hierarchical-indexing,pdf-parsing,neural-symbolic
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: structlog>=23.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: llama-index>=0.10.0
Requires-Dist: langgraph>=0.2.0
Requires-Dist: langchain-core>=0.2.0
Requires-Dist: transformers>=4.35.0
Requires-Dist: torch>=2.0.0
Requires-Dist: torchvision>=0.15.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: pymupdf>=1.23.0
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Requires-Dist: llama-index-llms-openai>=0.1.0; extra == "openai"
Requires-Dist: llama-index-embeddings-openai>=0.1.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.20.0; extra == "anthropic"
Requires-Dist: llama-index-llms-anthropic>=0.1.0; extra == "anthropic"
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0.0; extra == "gemini"
Provides-Extra: all
Requires-Dist: rnsr[anthropic,gemini,openai]; extra == "all"
Provides-Extra: benchmarks
Requires-Dist: datasets>=2.0.0; extra == "benchmarks"
Requires-Dist: ragas>=0.1.0; extra == "benchmarks"
Provides-Extra: demo
Requires-Dist: gradio>=4.0.0; extra == "demo"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# RNSR - Recursive Neural-Symbolic Retriever

A state-of-the-art document retrieval system that preserves hierarchical structure for superior RAG performance. Combines PageIndex, Recursive Language Models (RLM), Knowledge Graphs, and Tree of Thoughts navigation.

## Overview

RNSR combines neural and symbolic approaches to achieve accurate document understanding:

- **Font Histogram Algorithm** - Automatically detects document hierarchy from font sizes (no training required)
- **Skeleton Index Pattern** - Lightweight summaries with KV store for efficient retrieval
- **Tree-of-Thoughts Navigation** - LLM reasons about document structure to find answers
- **RLM Unified Extraction** - LLM writes extraction code, grounded in actual text
- **Knowledge Graph** - Entity and relationship storage for cross-document linking
- **Self-Reflection Loop** - Iterative answer improvement through self-critique
- **Adaptive Learning** - System learns from your document workload over time

## Key Features

| Feature | Description |
|---------|-------------|
| **Hierarchical Extraction** | Preserves document structure (sections, subsections, paragraphs) |
| **RLM Unified Extractor** | LLM writes extraction code + ToT validation (grounded, no hallucination) |
| **Provenance System** | Every answer traces back to exact document citations |
| **LLM Response Cache** | Semantic-aware caching for 10x cost/speed improvement |
| **Self-Reflection** | Iterative self-correction improves answer quality |
| **Reasoning Memory** | Learns successful query patterns for faster future queries |
| **Query Clarification** | Detects ambiguous queries and asks clarifying questions |
| **Table/Chart Parsing** | SQL-like queries over tables, chart trend analysis |
| **Adaptive Learning** | 6 registries that learn from usage and persist to disk |
| **Multi-Document Detection** | Automatically splits bundled PDFs |
| **Vision Mode** | OCR-free analysis for scanned documents and charts |

## Installation

```bash
# Clone the repository
git clone https://github.com/theeufj/RNSR.git
cd RNSR

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install with all LLM providers
pip install -e ".[all]"

# Or install with specific provider
pip install -e ".[openai]"      # OpenAI only
pip install -e ".[anthropic]"   # Anthropic only
pip install -e ".[gemini]"      # Google Gemini only
```

## Quick Start

### 1. Set up API keys

Create a `.env` file:

```bash
cp .env.example .env
# Edit .env with your API keys
```

```env
# Choose your preferred LLM provider
OPENAI_API_KEY=sk-...
# or
ANTHROPIC_API_KEY=sk-ant-...
# or
GOOGLE_API_KEY=AI...

# Optional: Override default models
LLM_PROVIDER=anthropic
SUMMARY_MODEL=claude-sonnet-4-5
```

### 2. Use the Python API

```python
from rnsr import RNSRClient

# Simple one-line Q&A
client = RNSRClient()
answer = client.ask("contract.pdf", "What are the payment terms?")
print(answer)

# Advanced navigation with verification and self-reflection
result = client.ask_advanced(
    "complex_report.pdf",
    "Compare liability clauses in sections 5 and 8",
    enable_verification=True,
    enable_self_reflection=True,
    max_recursion_depth=3,
)
```

### 3. Run the Demo UI

```bash
python demo.py
# Open http://localhost:7860 in your browser
```

## New Features

### Provenance System

Every answer includes traceable citations:

```python
from rnsr.agent import ProvenanceTracker, format_citations_for_display

tracker = ProvenanceTracker(kv_store=kv_store, skeleton=skeleton)
record = tracker.create_provenance_record(
    answer="The payment terms are net 30.",
    question="What are the payment terms?",
    variables=navigation_variables,
)

print(f"Confidence: {record.aggregate_confidence:.0%}")
print(format_citations_for_display(record.citations))
# Output:
# **Sources:**
# 1. [contract.pdf] Section: Payment Terms, Page 5: "Payment shall be due within 30 days..."
```

### LLM Response Caching

Automatic caching reduces costs and latency:

```python
from rnsr.agent import wrap_llm_with_cache, get_global_cache

# Wrap any LLM function with caching
cached_llm = wrap_llm_with_cache(llm.complete, ttl_seconds=3600)

# Use cached LLM - repeated prompts hit cache
response = cached_llm("What is 2+2?")  # Calls LLM
response = cached_llm("What is 2+2?")  # Returns cached (instant)

# Check cache stats
print(get_global_cache().get_stats())
# {'entries': 150, 'hits': 89, 'hit_rate': 0.59}
```

### Self-Reflection Loop

Answers are automatically critiqued and improved:

```python
from rnsr.agent import SelfReflectionEngine, reflect_on_answer

# Quick one-liner
result = reflect_on_answer(
    answer="The contract expires in 2024.",
    question="When does the contract expire?",
    evidence="Contract dated 2023, 2-year term...",
)

print(f"Improved: {result.improved}")
print(f"Final answer: {result.final_answer}")
print(f"Iterations: {result.total_iterations}")
```

### Reasoning Chain Memory

The system learns from successful queries:

```python
from rnsr.agent import get_reasoning_memory, find_similar_chains

# Find similar past queries
matches = find_similar_chains("What is the liability cap?")
for match in matches:
    print(f"Similar query: {match.chain.query}")
    print(f"Similarity: {match.similarity:.0%}")
    print(f"Past answer: {match.chain.answer}")
```

### Table Parsing

Extract and query tables from documents:

```python
from rnsr.ingestion import TableParser, TableQueryEngine

parser = TableParser()
tables = parser.parse_from_text(document_text)

# SQL-like queries
engine = TableQueryEngine(tables[0])
results = engine.select(
    columns=["Name", "Amount"],
    where={"Status": "Active"},
    order_by="Amount",
)

# Aggregations
total = engine.aggregate("Amount", "sum")
```

### Query Clarification

Handle ambiguous queries gracefully:

```python
from rnsr.agent import QueryClarifier, needs_clarification

# Check if query needs clarification
is_ambiguous, analysis = needs_clarification(
    "What does it say about the clause?"
)

if is_ambiguous:
    print(f"Ambiguity: {analysis.ambiguity_type}")
    print(f"Clarifying question: {analysis.suggested_clarification}")
    # "What does 'it' refer to in your question?"
```

## Adaptive Learning

RNSR learns from your document workload. All learned data persists in `~/.rnsr/`:

```
~/.rnsr/
├── learned_entity_types.json       # New entity types discovered
├── learned_relationship_types.json # New relationship types
├── learned_normalization.json      # Title/suffix patterns
├── learned_stop_words.json         # Domain-specific stop words
├── learned_header_thresholds.json  # Document-type font thresholds
├── learned_query_patterns.json     # Successful query patterns
├── reasoning_chains.json           # Successful reasoning chains
└── llm_cache.db                    # LLM response cache
```

The more you use RNSR, the better it gets at understanding your domain.

## How It Works

### Document Ingestion Pipeline

```
PDF → Font Analysis → Header Classification → Tree Building → Skeleton Index
         ↓                    ↓                    ↓              ↓
   Detect font sizes   Classify H1/H2/H3    Build hierarchy   Create summaries
                                                  ↓
                                        Multi-doc detection
                                        (page number resets)
```

### Query Processing

```
Question → Clarify → Pre-Filter → Tree Navigation → Answer → Self-Reflect → Verify
              ↓           ↓              ↓             ↓           ↓           ↓
        Ask if ambig  Keyword scan  ToT reasoning  Synthesize  Critique   Fact-check
                                         ↓                        ↓
                                  Sub-LLM recursion        Improve answer
                                  (complex queries)        (if issues)
```

### Entity Extraction (RLM Unified)

```
Document → LLM writes code → Execute on DOC_VAR → ToT validation → Cross-validate
              ↓                     ↓                   ↓               ↓
     Generates regex/Python   Grounded results   Probability scores  Entity↔Relationship
                                    ↓
                            All tied to exact text spans
```

### RLM Navigation Architecture (ToT + REPL Integration)

RNSR uses a unique combination of Tree of Thoughts (ToT) reasoning and a REPL (Read-Eval-Print Loop) environment for document navigation. This is what sets RNSR apart from naive RAG approaches:

**The Problem with Naive RAG:**
Traditional RAG splits documents into chunks, embeds them, and retrieves based on similarity. This loses hierarchical structure and often retrieves irrelevant chunks for complex queries.

**RNSR's RLM Navigation Solution:**

```
Query → NavigationREPL → LLM Generates Code → Execute → Findings → ToT Validation → Answer
           ↓                    ↓                ↓          ↓            ↓
    Expose document       search_tree()      Find relevant  Store     Verify with
    as environment        navigate_to()      nodes          findings  probabilities
                          get_content()
```

**How it works:**

1. **Document as Environment**: The document tree is exposed as a programmable environment through `NavigationREPL`. The LLM can write Python code to search, navigate, and extract information.

2. **Code Generation Navigation**: Instead of keyword matching, the LLM writes code like:
   ```python
   # LLM-generated code to find CEO salary
   results = search_tree(r"CEO|chief executive|compensation|salary")
   for match in results[:3]:
       navigate_to(match.node_id)
       content = get_node_content(match.node_id)
       if "salary" in content.lower():
           store_finding("ceo_salary", content, match.node_id)
   ready_to_synthesize()
   ```

3. **Iterative Search**: The LLM can execute multiple rounds of code, drilling deeper into promising sections, just like a human would browse a document.

4. **ToT Validation**: Findings are validated using Tree of Thoughts - each potential answer gets a probability score based on how well it matches the query and document evidence.

5. **Grounded Answers**: All answers are tied to specific document sections. If the LLM can't find reliable information, it honestly reports "Unable to find reliable information" rather than hallucinating.

**Available NavigationREPL Functions:**

| Function | Description |
|----------|-------------|
| `search_content(pattern)` | Regex search within current node |
| `search_children(pattern)` | Search direct children |
| `search_tree(pattern)` | Search entire subtree with relevance scoring |
| `navigate_to(node_id)` | Move to a specific section |
| `go_back()` | Return to previous section |
| `go_to_root()` | Return to document root |
| `get_node_content(node_id)` | Get full text of a section |
| `store_finding(key, content, node_id)` | Save relevant information |
| `ready_to_synthesize()` | Signal that enough info has been gathered |

**Why This Outperforms Naive RAG:**

- **Hierarchical Understanding**: RNSR understands that "Section 42" might contain the CEO salary even if the query doesn't mention "Section 42"
- **Multi-hop Reasoning**: Can navigate from a table of contents to a specific subsection to find buried information
- **Document Length Agnostic**: Works equally well on 10-page and 1000-page documents - the LLM navigates to relevant sections rather than trying to fit everything in context
- **No Hallucination**: If information isn't found through code execution, the system admits it rather than making up answers

## Architecture

```
rnsr/
├── agent/                   # Query processing
│   ├── rlm_navigator.py     # Main navigation agent (RLM + ToT)
│   ├── nav_repl.py          # NavigationREPL for code-based navigation (NEW)
│   ├── repl_env.py          # Base REPL environment
│   ├── provenance.py        # Citation tracking
│   ├── llm_cache.py         # Response caching
│   ├── self_reflection.py   # Answer improvement
│   ├── reasoning_memory.py  # Chain memory
│   ├── query_clarifier.py   # Ambiguity handling
│   ├── graph.py             # LangGraph workflow
│   └── variable_store.py    # Context management
├── extraction/              # Entity/relationship extraction
│   ├── rlm_unified_extractor.py  # Best extractor (NEW)
│   ├── learned_types.py     # Adaptive type learning
│   ├── entity_linker.py     # Cross-document linking
│   └── models.py            # Entity/Relationship models
├── indexing/                # Index construction
│   ├── skeleton_index.py    # Summary generation
│   ├── knowledge_graph.py   # Entity/relationship storage
│   ├── kv_store.py          # SQLite/in-memory storage
│   └── semantic_search.py   # Optional vector search
├── ingestion/               # Document processing
│   ├── pipeline.py          # Main ingestion orchestrator
│   ├── font_histogram.py    # Font-based structure detection
│   ├── header_classifier.py # H1/H2/H3 classification
│   ├── table_parser.py      # Table extraction (NEW)
│   ├── chart_parser.py      # Chart interpretation (NEW)
│   └── tree_builder.py      # Hierarchical tree construction
├── llm.py                   # Multi-provider LLM abstraction
├── client.py                # High-level API
└── models.py                # Data structures
```

## API Reference

### High-Level API

```python
from rnsr import RNSRClient

client = RNSRClient(
    llm_provider="anthropic",  # or "openai", "gemini"
    llm_model="claude-sonnet-4-5"
)

# Simple query
answer = client.ask("document.pdf", "What is the main topic?")

# Vision mode (for scanned docs)
answer = client.ask_vision("scanned.pdf", "What does the chart show?")
```

### Low-Level API

```python
from rnsr import (
    ingest_document,
    build_skeleton_index,
    run_rlm_navigator,
    SQLiteKVStore
)
from rnsr.extraction import RLMUnifiedExtractor
from rnsr.agent import ProvenanceTracker, SelfReflectionEngine

# Step 1: Ingest document
result = ingest_document("document.pdf")
print(f"Extracted {result.tree.total_nodes} nodes")

# Step 2: Build index
kv_store = SQLiteKVStore("./data/index.db")
skeleton = build_skeleton_index(result.tree, kv_store)

# Step 3: Extract entities (grounded, no hallucination)
extractor = RLMUnifiedExtractor()
extraction = extractor.extract(
    node_id="section_1",
    doc_id="document",
    header="Introduction",
    content="..."
)

# Step 4: Query with provenance
answer = run_rlm_navigator(
    question="What are the key findings?",
    skeleton=skeleton,
    kv_store=kv_store
)

# Step 5: Get citations
tracker = ProvenanceTracker(kv_store=kv_store)
record = tracker.create_provenance_record(answer, question, variables)
```

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `LLM_PROVIDER` | Primary LLM provider | `auto` (detect from keys) |
| `SUMMARY_MODEL` | Model for summarization | Provider default |
| `AGENT_MODEL` | Model for navigation | Provider default |
| `EMBEDDING_MODEL` | Embedding model | `text-embedding-3-small` |
| `KV_STORE_PATH` | SQLite database path | `./data/kv_store.db` |
| `LOG_LEVEL` | Logging verbosity | `INFO` |
| `RNSR_LLM_CACHE_PATH` | Custom cache location | `~/.rnsr/llm_cache.db` |
| `RNSR_REASONING_MEMORY_PATH` | Custom memory location | `~/.rnsr/reasoning_chains.json` |

### Supported Models

| Provider | Models |
|----------|--------|
| **OpenAI** | `gpt-5.2`, `gpt-5-mini`, `gpt-5-nano`, `gpt-4.1`, `gpt-4o-mini` |
| **Anthropic** | `claude-opus-4-5`, `claude-sonnet-4-5`, `claude-haiku-4-5` |
| **Gemini** | `gemini-3-pro-preview`, `gemini-3-flash-preview`, `gemini-2.5-pro`, `gemini-2.5-flash` |

## Benchmarks

RNSR is designed for complex document understanding tasks:

- **Multi-document PDFs** - Automatically detects and separates bundled documents
- **Hierarchical queries** - "Compare section 3.2 with section 5.1"
- **Cross-reference questions** - "What does the appendix say about the claim in section 2?"
- **Entity extraction** - Grounded extraction with ToT validation (no hallucination)
- **Table queries** - "What is the total for Q4 2024?"

## Sample Documents

RNSR includes sample documents for testing and demonstration:

### Synthetic Documents (`samples/`)

| File | Type | Features Demonstrated |
|------|------|----------------------|
| `sample_contract.md` | Legal Contract | Entities (people, orgs), relationships, payment tables, legal terms |
| `sample_financial_report.md` | Financial Report | Financial tables, metrics, executive names, quarterly data |
| `sample_research_paper.md` | Academic Paper | Citations, hierarchical sections, technical content, tables |

### Real Test Documents (`rnsr/test-documents/`)

Legal documents from the Djokovic visa case (public court records) for testing with actual PDFs:
- Affidavits and court applications
- Legal submissions and orders
- Interview transcripts

### Using Sample Documents

```python
from pathlib import Path
from rnsr.ingestion import TableParser
from rnsr.extraction import CandidateExtractor

# Parse a sample document
sample = Path("samples/sample_contract.md").read_text()

# Extract tables
parser = TableParser()
tables = parser.parse_from_text(sample)
print(f"Found {len(tables)} tables")

# Extract entities
extractor = CandidateExtractor()
candidates = extractor.extract_candidates(sample)
print(f"Found {len(candidates)} entity candidates")
```

## Testing

### Test Suite Overview

RNSR has comprehensive test coverage with **281+ tests**:

```bash
# Run all tests
pytest tests/ -v

# Run specific feature tests
pytest tests/test_provenance.py tests/test_llm_cache.py -v

# Run end-to-end workflow tests
pytest tests/test_e2e_workflow.py -v

# Run with coverage
pytest tests/ --cov=rnsr --cov-report=html
```

### Test Categories

| Test File | Tests | Coverage |
|-----------|-------|----------|
| `test_e2e_workflow.py` | 18 | Full pipeline: ingestion → extraction → KG → query → provenance |
| `test_provenance.py` | 17 | Citations, contradictions, provenance records |
| `test_llm_cache.py` | 17 | Cache get/set, TTL, persistence |
| `test_self_reflection.py` | 13 | Critique, refinement, iteration limits |
| `test_reasoning_memory.py` | 15 | Chain storage, similarity matching |
| `test_query_clarifier.py` | 19 | Ambiguity detection, clarification |
| `test_table_parser.py` | 26 | Markdown/ASCII tables, SQL-like queries |
| `test_chart_parser.py` | 16 | Chart detection, trend analysis |
| `test_rlm_unified.py` | 13 | REPL execution, code cleaning |
| `test_learned_types.py` | 13 | Adaptive learning registries |

### End-to-End Workflow Tests

The `test_e2e_workflow.py` demonstrates the complete pipeline:

```python
# Tests cover:
# 1. Document Ingestion - Parse structure and tables
# 2. Entity Extraction - Pattern-based grounded extraction  
# 3. Knowledge Graph - Store entities and relationships
# 4. Query Processing - Ambiguity detection, table queries
# 5. Provenance - Citations and evidence tracking
# 6. Self-Reflection - Answer improvement loop
# 7. Reasoning Memory - Learn from successful queries
# 8. LLM Cache - Response caching
# 9. Adaptive Learning - Type discovery
# 10. Full Workflow - Contract and financial analysis
```

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run linting
ruff check .

# Type checking
mypy rnsr/
```

## Requirements

- Python 3.10+
- At least one LLM API key (OpenAI, Anthropic, or Gemini)

## License

MIT License - see [LICENSE](LICENSE) for details.

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## Research

RNSR is inspired by:
- [Hybrid Document Retrieval System Design](Research/Hybrid%20Document%20Retrieval%20System%20Design.pdf) - Core architecture and design principles
- [PageIndex (VectifyAI)](https://github.com/VectifyAI/PageIndex) - Vectorless reasoning-based tree search
- [Recursive Language Models](https://arxiv.org/html/2512.24601v1) - REPL environment with recursive sub-LLM calls
- Tree of Thoughts - LLM-based decision making with probabilities
- Self-Refine / Reflexion - Iterative self-correction patterns
