Metadata-Version: 2.4
Name: doclens
Version: 1.1.3
Summary: Structure-aware document retrieval. FTS5/BM25 keyword matching over document trees.
Author-email: Lianghao <zhlhao@163.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/zhlhaohao/doclens
Project-URL: Repository, https://github.com/zhlhaohao/doclens
Project-URL: Issues, https://github.com/zhlhaohao/doclens/issues
Keywords: doclens,rag,retrieval,tree-search,bm25,fts5,document-indexing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.25.0
Requires-Dist: fastapi>=0.110.0
Requires-Dist: httpx>=0.27.1
Requires-Dist: jieba>=0.42
Requires-Dist: pydantic<3.0.0,>=2.11.0
Requires-Dist: pydantic-settings>=2.5.2
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: sse-starlette>=2.0.0
Requires-Dist: textual>=0.47.0
Requires-Dist: tqdm>=4.42
Requires-Dist: ulid-py>=1.1
Requires-Dist: uvicorn[standard]>=0.27.0
Requires-Dist: ripgrep-bin>=15.1.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: markitdown>=0.1.0
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: pathspec>=0.11
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: python-docx>=1.0.0
Requires-Dist: python-pptx>=1.0.0
Requires-Dist: tree-sitter-languages>=1.10
Requires-Dist: watchdog>=3.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=24.1.1; extra == "dev"
Requires-Dist: flake8>=7.0.0; extra == "dev"
Requires-Dist: mypy>=1.8.0; extra == "dev"
Requires-Dist: playwright>=1.40.0; extra == "dev"
Dynamic: license-file

# doclens

> Structure-aware document retrieval — FTS5/BM25 keyword search over document trees, with an interactive TUI and a PWA Web UI.

[![PyPI version](https://badge.fury.io/py/doclens.svg)](https://badge.fury.io/py/doclens)
[![Python](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](pyproject.toml)
[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](LICENSE)

**doclens** parses documents into tree structures (headings, classes, functions…) and searches them with FTS5/BM25 keyword matching — no embeddings, no chunking, no vector DB required. Works entirely offline.

---

## Features

| | |
|---|---|
| **Structure-aware search** | Returns results anchored to document headings, code classes, or function definitions — not orphaned line fragments |
| **Multi-format** | Markdown, PDF, DOCX, PPTX, Excel, HTML, JSON, CSV, code (Python AST + tree-sitter) |
| **Two UIs** | Textual TUI (terminal) and Lit + Shoelace PWA (browser) |
| **LLM-augmented QA** | Send search results to Anthropic Claude for natural-language answers |
| **Background watching** | Auto-reindexes changed files via `watchdog` |
| **Web search** | Fetch + extract public web pages as markdown before searching |

---

## Installation

```bash
pip install doclens
```

Requires Python ≥ 3.10.

**Quick setup:**

```bash
# Index your documents
doclens index --force

# Search from CLI
doclens search "authentication"

# Or launch the Web UI (opens browser automatically)
doclens gui
```

---

## CLI Reference

```
doclens [--workdir DIR] <command>
```

| Command | Description |
|---------|-------------|
| `doclens search <query…>` | Keyword search across indexed documents |
| `doclens search_v2 '<json>'` | Structured search: AND / OR / NOT / PHRASE operators |
| `doclens ai <message…>` | Send a message to the Claude agent |
| `doclens index [--force]` | Build or update the document index |
| `doclens status` | Show index statistics and system status |
| `doclens gui [--port PORT]` | Launch the Web UI (PWA) |
| `doclens read_document --path <path>` | Read a document with structure info |
| `doclens web <query…>` | Search the live web |
| `doclens webfetch <url>` | Extract a web page as markdown |
| `doclens grep <pattern>` | Ripgrep-style regex search |

---

## Quick Start

### 1. Index your documents

```bash
# Index the current directory
doclens index --force

# Or specify a working directory
doclens --workdir /path/to/project index
```

doclens automatically discovers supported files (`.md`, `.py`, `.pdf`, `.docx`, `.xlsx`, …) and skips common ignore patterns (`.git`, `node_modules`, `__pycache__`, `.venv`).

### 2. Search

```bash
doclens search "authentication flow"
doclens search "量子 计算"          # Chinese supported via jieba

# Structured query
doclens search_v2 '{"type": "and", "terms": ["auth", "token"]}'
```

### 3. Interactive TUI

```bash
doclens
```

Opens the full terminal UI with live preview, command history, and keyboard navigation.

### 4. Web UI

```bash
doclens gui
# INFO: Uvicorn running on http://127.0.0.1:7860
```

Browser opens automatically. Port may vary if 7860 is in use — check the startup log.

### 5. Ask the AI

```bash
doclens ai "How does the authentication system work?"
```

doclens first retrieves relevant document sections, then sends them to Anthropic Claude as context for a grounded answer.

---

## Configuration

doclens reads `.env` in the project root. Copy and customize:

```bash
cp doclens/.env.example .env
```

Key variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `CORTEX_SEARCH_PATH` | `.` | Root directory to index and search |
| `CORTEX_DB_PATH` | `.cortex/sessions.db` | SQLite database path |
| `ANTHROPIC_API_KEY` | — | Required for `ai` and `web` commands |
| `ANTHROPIC_BASE_URL` | — | Custom API endpoint (optional) |

---

## Architecture

```
┌─────────────────────────────────────────────┐
│                  TUI (Textual)              │
│  ┌───────────────────────────────────────┐  │
│  │  HeaderBar │ ContentArea │ InputBox   │  │
│  └───────────────────────────────────────┘  │
└────────────────────┬────────────────────────┘
                     │
┌────────────────────▼────────────────────────┐
│           Web UI (Lit + Shoelace PWA)      │
│         FastAPI + SSE streaming             │
└────────────────────┬────────────────────────┘
                     │
┌────────────────────▼────────────────────────┐
│         IndexManager + Scoring              │
│    TreeSearch (FTS5 + BM25)                │
└────────────────────┬────────────────────────┘
                     │
┌────────────────────▼────────────────────────┐
│    treesearch/  —  parsers, indexer, FTS5  │
│    planify/     —  AI agent runner          │
└─────────────────────────────────────────────┘
```

- **treesearch**: Powers the indexing and retrieval engine (FTS5/BM25 over document trees)
- **planify**: Drives the AI agent, session management, and tool execution
- **doclens**: Ties them together — CLI, TUI, Web UI, event bus, and file watcher

---

## License

Apache License 2.0 — see [LICENSE](LICENSE).
