Metadata-Version: 2.4
Name: research-hub-pipeline
Version: 0.45.0
Summary: CLI + MCP server for Zotero + Obsidian + NotebookLM research pipelines. Run `research-hub init` after install.
Project-URL: Homepage, https://github.com/WenyuChiou/research-hub
Project-URL: Repository, https://github.com/WenyuChiou/research-hub
Project-URL: Issues, https://github.com/WenyuChiou/research-hub/issues
License: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.10
Requires-Dist: networkx>=3.0
Requires-Dist: platformdirs>=4.0
Requires-Dist: pyyaml>=6
Requires-Dist: pyzotero>=1.5.18
Requires-Dist: rapidfuzz>=3.0
Requires-Dist: requests>=2.28
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.10; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: responses>=0.23; extra == 'dev'
Provides-Extra: import
Requires-Dist: pdfplumber>=0.11; extra == 'import'
Requires-Dist: python-docx>=1.1; extra == 'import'
Requires-Dist: readability-lxml>=0.8; extra == 'import'
Requires-Dist: requests>=2.28; extra == 'import'
Provides-Extra: mcp
Requires-Dist: fastmcp>=2.0; extra == 'mcp'
Provides-Extra: playwright
Requires-Dist: patchright>=1.55; extra == 'playwright'
Provides-Extra: secrets
Requires-Dist: cryptography>=42; extra == 'secrets'
Description-Content-Type: text/markdown

# research-hub

> **Build your research cluster once. Ask AI about it thousands of times.**
> Zotero + Obsidian + NotebookLM, wired together for AI agents.

[![PyPI](https://img.shields.io/pypi/v/research-hub-pipeline.svg)](https://pypi.org/project/research-hub-pipeline/)
[![Tests](https://img.shields.io/badge/tests-1423%20passing-brightgreen.svg)](docs/audit_v0.41.md)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](pyproject.toml)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![CI: Linux · macOS · Windows](https://img.shields.io/badge/CI-Linux%20%C2%B7%20macOS%20%C2%B7%20Windows-blue)](.github/workflows/ci.yml)

繁體中文說明 → [README.zh-TW.md](README.zh-TW.md)

![Dashboard Overview](docs/images/dashboard-overview-researcher.png)

---

## The 10-minute story

Say you want to get up to speed on **"harness engineering for LLM agents"** — a subfield that barely existed 6 months ago. Traditional workflow: search arXiv, skim abstracts, manually note key claims, fight Obsidian, occasionally wish you had a RAG. 2 hours.

research-hub workflow:

```bash
research-hub clusters new --query "LLM evaluation harness" --slug llm-evaluation-harness
research-hub search "language model evaluation harness" --to-papers-input \
    --cluster llm-evaluation-harness > papers.json
research-hub ingest --cluster llm-evaluation-harness --no-verify
```

3 minutes later your vault has 6 key papers with structured notes. Now push them to NotebookLM and pull back an auto-generated briefing:

```bash
research-hub notebooklm bundle   --cluster llm-evaluation-harness
research-hub notebooklm upload   --cluster llm-evaluation-harness   # Chrome CDP-attach, no API key
research-hub notebooklm generate --cluster llm-evaluation-harness --type brief
research-hub notebooklm download --cluster llm-evaluation-harness --type brief
```

Or click the same actions in `research-hub serve --dashboard` -> Manage on that cluster card. In live mode the buttons run the identical CLI flow, so there is no extra terminal step.

Your vault now has `.research_hub/artifacts/llm-evaluation-harness/brief-*.txt` — a ~300-character synthesis covering all 6 papers, generated by NotebookLM from the uploaded sources (no prompt engineering needed, no headless-browser hacks — it attaches to your existing Chrome session).

2 more minutes: generate the AI summary layer (**crystals**) and the structured entity/claim registry (**memory**):

```bash
research-hub crystal emit --cluster llm-evaluation-harness > prompt.md
# (paste prompt to Claude/GPT, save response as crystals.json)
research-hub crystal apply --cluster llm-evaluation-harness --scored crystals.json

research-hub memory emit  --cluster llm-evaluation-harness > mem-prompt.md
research-hub memory apply --cluster llm-evaluation-harness --scored memory.json
```

Now open Claude Desktop and ask:

> **You:** "What's the current SOTA in LLM evaluation harness?"
> **Claude (via MCP):** calls `read_crystal("llm-evaluation-harness", "sota-and-open-problems")` → gets a pre-written 180-word answer with paper citations. **~1 KB read, 0 abstracts fetched at query time.**

That pre-written answer is a **crystal**. You paid the reasoning cost once; every subsequent question is ~1 KB of cached analysis. See [`hub/llm-evaluation-harness/crystals/`](hub/llm-evaluation-harness/crystals/) in your vault for the 10 canonical Q&As generated above.

---

## What makes it different

### 1. Crystals — pre-computed answers, not lazy retrieval (v0.28)

Every RAG system, including Karpathy's "LLM wiki", still assembles context at query time. research-hub's answer: **store the AI's reasoning, not the inputs**.

For each cluster you generate ~10 canonical Q→A crystals once, using any LLM you like. When an AI agent later asks "what's the SOTA in X?", it reads a pre-written paragraph — not 20 paper abstracts. **Token cost per query: ~1 KB (crystal read) vs ~30 KB (cluster digest). 30× compression.**

Because the quality was pre-computed, it doesn't degrade at query time. See the [harness-engineering example crystals](hub/llm-evaluation-harness/crystals/) — one folder, 10 Q&As answering "what is this field?", "what are the main threads?", "where do experts disagree?", "what's SOTA?", etc.

[→ Why this is not RAG](docs/anti-rag.md)

### 2. Structured memory layer — entities, claims, methods (v0.36)

Crystals store prose. **Memory** stores the underlying structure: named entities (benchmarks, models, concepts), typed claims with confidence + supporting papers, and method taxonomies. For the harness cluster:

```
hub/llm-evaluation-harness/memory.json
├── 14 entities  (vla-eval, SafeHarness, M*, LIBERO, SEC-bench, ...)
├── 12 claims    ("Harness is locus of progress", "Specialized beats generic +22%", ...)
└── 7 methods    (reflective code evolution, lifecycle-integrated defense, ...)
```

AI agents query entities via `list_entities`, claims via `list_claims(min_confidence="high")`, methods via `list_methods`. No RAG over prose — structured lookup over structured data.

### 3. 4 personas, 1 codebase, dashboard adapts (v0.38)

Same vault, 4 rendered dashboards:

| Persona | Install | Dashboard vocabulary | Hidden tabs |
|---|---|---|---|
| **Researcher** (PhD STEM, Zotero) | `pip install research-hub-pipeline[playwright,secrets]` | Cluster / Crystal / Paper / Citation graph | (none) |
| **Humanities** (Zotero, quote-heavy) | `pip install research-hub-pipeline[playwright,secrets]` | Theme / Synthesis / Source | (none) |
| **Analyst** (industry, no Zotero) | `pip install research-hub-pipeline[import,secrets]` | Topic / AI Brief / Document | Diagnostics, Bind-Zotero |
| **Internal KM** (lab / company) | `pip install research-hub-pipeline[import,secrets]` | Project area / AI Brief / Document | Diagnostics, Bind-Zotero |

Side-by-side screenshots: [`docs/personas.md`](docs/personas.md). [Your first 10 minutes guide →](docs/first-10-minutes.md)

### 4. Live dashboard with direct execution (v0.27, expanded v0.42/v0.43/v0.44)

```bash
research-hub serve --dashboard
```

Localhost HTTP dashboard at `http://127.0.0.1:8765/`. Every Manage-tab button **directly executes** the CLI — no copy-paste.

**5 tabs**:
- **Overview** — treemap + storage map + recent additions
- **Library** — cluster cards with paper rows
- **Briefings** — NotebookLM brief preview + artifact links
- **Diagnostics** — health badges + drift alerts
- **Manage** — per-cluster actions (rename / merge / split / NLM upload / NLM ask / polish-markdown / bases emit)

[→ Full dashboard walkthrough](docs/dashboard-walkthrough.md)

### 5. Cluster integrity + 100% orphan coverage (v0.37 + v0.39)

Papers drift, rebind v2 catches it. On the maintainer's 1063-orphan vault: 33% → **100% coverage** via 8-heuristic chain + auto-create-from-folder proposals.

```bash
research-hub doctor                       # catches 12+ classes of drift
research-hub clusters rebind --emit       # proposes 80%+ assignments
research-hub clusters rebind --apply report.md --auto-create-new
```

[→ 6 failure modes × 4 personas mitigation matrix](docs/cluster-integrity.md)

---

## Install

```bash
# Researcher / Humanities (use Zotero + NotebookLM)
pip install research-hub-pipeline[playwright,secrets]

# Analyst / Internal KM (no Zotero, import local files)
pip install research-hub-pipeline[import,secrets]

research-hub init              # 4-option interactive persona prompt
research-hub serve --dashboard # opens browser
```

Python 3.10+. **No OpenAI/Anthropic API key required** — research-hub is provider-agnostic (all AI generation uses emit/apply pattern; you feed prompts to your own AI).

## For Claude Code / Claude Desktop users

Add to `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "research-hub": {
      "command": "research-hub",
      "args": ["serve"]
    }
  }
}
```

60 MCP tools cover: paper ingest, cluster CRUD, labels, quotes, draft composition, citation graph, NotebookLM, crystal generation, fit-check, autofill, cluster memory, cluster rebind workflows.

Then talk to Claude:

> "Claude, what's in my llm-evaluation-harness cluster?" → `read_crystal("what-is-this-field")` → 180-word answer
> "Claude, which claims have high confidence?" → `list_claims(cluster="llm-evaluation-harness", min_confidence="high")` → 10 structured claims with paper refs
> "Claude, add arxiv 2310.06770 to LLM-SE cluster" → `add_paper(...)` → Zotero + Obsidian + NotebookLM entries

---

## Status

- **Latest**: v0.41.0 (2026-04-19)
- **Tests**: 1423 passing, 15 skipped, 3 xfailed (CI: Linux + Windows + macOS × Python 3.10/3.11/3.12)
- **Platforms**: Windows, macOS, Linux
- **Python**: 3.10+
- **Dependencies**: `pyzotero`, `pyyaml`, `requests`, `rapidfuzz`, `networkx`, `platformdirs` (all pure-Python)
- **Optional**: `playwright` extra for NotebookLM browser automation

## Architecture docs

- [Your first 10 minutes](docs/first-10-minutes.md) — guided tour for each of the 4 personas
- [User personas](docs/personas.md) — 4 persona profiles with per-persona feature matrix
- [Cluster integrity](docs/cluster-integrity.md) — 6 failure modes + mitigation matrix across all 4 personas
- [MCP tools reference](docs/mcp-tools.md) — all 60 tools categorized + signatures
- [Example Claude Desktop flow](docs/example-claude-mcp-flow.md) — worked example: ingest → crystallize → query
- [Import folder](docs/import-folder.md) — local file ingest for analyst persona (PDF/DOCX/MD/TXT/URL)
- [Anti-RAG crystals](docs/anti-rag.md) — why pre-computed Q→A beats retrieval
- [Upgrade guide](UPGRADE.md) — migrating from older versions
- [Task-level workflows](docs/task-workflows.md) — v0.33+ 5 MCP wrappers (ask/brief/sync/compose/collect)
- [Screenshot workflow](docs/screenshot-workflow.md) — re-render any dashboard tab
- [Audit reports](docs/) — `audit_v0.26.md` … `audit_v0.41.md`
- [NotebookLM setup](docs/notebooklm.md) — CDP attach flow + troubleshooting
- [Papers input schema](docs/papers_input_schema.md) — ingestion pipeline reference

## Workflow reference

| Stage | Command | What it does |
|---|---|---|
| **Init** | `init` / `doctor` | First-time config + health check (doctor has 12+ checks, `--autofix` for mechanical backfills) |
| **Find** | `search` / `verify` / `discover new` | Multi-backend paper search + DOI resolution + AI-scored discovery |
| **Ingest** | `add` / `ingest` / `import-folder` | One-shot or bulk paper ingest into Zotero + Obsidian |
| **Organize** | `clusters new/list/show/bind/merge/split/rename/delete/rebind/scaffold-missing` | Cluster CRUD + 8-heuristic rebind + hub scaffolding |
| **Topic** | `topic scaffold/propose/assign/build` | Sub-topic notes from `subtopics:` frontmatter |
| **Label** | `label` / `find --label` / `paper prune` / `paper lookup-doi` | Canonical label vocabulary + Crossref DOI backfill |
| **Crystal** | `crystal emit/apply/list/read/check` | Pre-computed canonical Q→A answers |
| **Memory** | `memory emit/apply/list/read` | Structured entities/claims/methods registry |
| **Analyze** | `clusters analyze --split-suggestion` | Citation-graph community detection for big clusters |
| **Sync** | `sync status` / `pipeline repair` | Detect + repair Zotero ↔ Obsidian drift |
| **Dashboard** | `dashboard` / `serve --dashboard` / `vault graph-colors` | Static HTML or live HTTP server + auto-refresh Obsidian graph |
| **NotebookLM** | `notebooklm bundle/upload/generate/download` | Browser-automated NLM flows (CDP attach) |
| **Write** | `quote` / `compose-draft` / `cite` | Quote capture, markdown draft assembly, BibTeX export |

## For developers

```bash
git clone https://github.com/WenyuChiou/research-hub.git
cd research-hub
pip install -e '.[dev,playwright]'
python -m pytest -q  # 1423 passing
```

Contributing: see [CONTRIBUTING.md](CONTRIBUTING.md). Reporting security issues: see [SECURITY.md](.github/SECURITY.md).

Package name on PyPI: **research-hub-pipeline**
CLI entry point: **research-hub**

## License

MIT. See [LICENSE](LICENSE).
