Metadata-Version: 2.4
Name: paper-distiller
Version: 0.5.1
Summary: Distill arxiv research papers into an Obsidian-compatible markdown wiki.
Project-URL: Homepage, https://github.com/jesson-hh/paper-distiller
Project-URL: Issues, https://github.com/jesson-hh/paper-distiller/issues
Author-email: jesson-hh <xuyi1030@proton.me>
License: MIT
License-File: LICENSE
Keywords: arxiv,distillation,knowledge-base,llm,obsidian,research
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Requires-Dist: arxiv>=2.1
Requires-Dist: httpx>=0.27
Requires-Dist: pymupdf>=1.24
Requires-Dist: python-dotenv>=1.0
Requires-Dist: tomli>=2.0; python_version < '3.11'
Provides-Extra: dev
Requires-Dist: pytest-mock>=3.12; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# paper-distiller

> Distill arxiv research papers into an Obsidian-compatible markdown wiki.

[![CI](https://github.com/jesson-hh/paper-distiller/actions/workflows/ci.yml/badge.svg)](https://github.com/jesson-hh/paper-distiller/actions/workflows/ci.yml)
[![PyPI version](https://img.shields.io/pypi/v/paper-distiller.svg)](https://pypi.org/project/paper-distiller/)
[![Python versions](https://img.shields.io/pypi/pyversions/paper-distiller.svg)](https://pypi.org/project/paper-distiller/)
![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)

## What it does

paper-distiller has two modes, both writing into the same Obsidian-compatible vault:

**`paper-distiller`** — single-pass mode. Give it a topic or author and it will:
1. Search arxiv + Semantic Scholar for relevant papers
2. Use an LLM to rank the top N most relevant
3. Download each PDF and extract the text
4. Distill each paper into a structured Chinese markdown note (一句话 / 问题动因 / 方法 / 关键结果 / 我的 take)
5. Cross-link each note to existing entries in your vault via `[[wikilinks]]`
6. Compose a session survey tying the new notes together

**`paper-distiller-qa`** *(new in v0.5)* — question-driven multi-round research loop. Give it a *research question* and it will:
1. Have the LLM judge what's known vs. missing each round, and propose the next search query
2. Run the single-pass distillation pipeline on the top N results
3. Repeat until the LLM is confident, OR a budget cap fires (rounds / articles / cost / `Ctrl+C`)
4. Synthesize a final cited answer document with an audit trail of all rounds

The output drops directly into a vault directory that opens in Obsidian — graph view, Dataview, tag pane, full-text search all work out of the box.

## Install

**From PyPI**:

    pip install paper-distiller

**From source** (for development or the latest unreleased changes):

    git clone https://github.com/jesson-hh/paper-distiller
    cd paper-distiller
    python -m venv .venv
    .venv\Scripts\activate          # Windows
    # source .venv/bin/activate     # Linux/macOS
    pip install -e .

## Quick start

1. Copy the config template and fill in your LLM API key:

       cp examples/example.env .env
       # then edit .env, replace PD_API_KEY=sk-your-key-here with your real key

2. Point it at your Obsidian vault. Pick a mode:

   **Single-pass** — distill N papers on a topic:

       paper-distiller --vault /path/to/your/vault --topic "diffusion models for finance" --n 5

   **Question-driven** — let the agent plan multiple rounds and answer a question:

       paper-distiller-qa --vault /path/to/your/vault \
                          --question "What are the latest advances in diffusion models for long-horizon time-series forecasting?" \
                          --max-rounds 3 --per-round 2 --max-cost-cny 5

3. Open your vault in Obsidian. New articles appear under `articles/`, a session survey under `surveys/`. QA sessions also write `<vault>/.paper_distiller/qa-sessions/<sid>/state.json` for `--resume` after a pause or crash.

## How it works

**Single-pass** (`paper-distiller`):

```
search arxiv + SS  →  LLM filter  →  fetch PDFs       →  distill each  →  save articles  →  compose survey
   (~30 hits)         (→ top N)      (with fallback)     (LLM call)       (md+frontmatter)   (LLM call)
```

**Question-driven** (`paper-distiller-qa`) wraps the above in a state-machine loop:

```
                ┌──────────────────────────────────────────────────────────┐
                │   LLM reflect  →  search  →  rank  →  distill (N papers) │  ← one round
                │       ↑                                          │       │
                │       └──────────────────────────────────────────┘       │
                │                                                          │
                │   Stops when: LLM done / budget hit / no new candidates  │
                └──────────────────────────────────────────────────────────┘
                                       ↓
                          synthesize cited answer  →  surveys/qa-….md
```

Per-paper cost on `qwen-plus` / `qwen3.5-plus` (Aliyun Bailian): roughly $0.02 per paper. A 5-paper single-pass run is around $0.10 USD (~¥0.70). A typical 3-round QA session with 2 papers/round costs ~¥1.5-3.

For module structure, data flow internals, the 7 stop reasons, and how state persistence works, see **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)**.

## Vault layout

paper-distiller writes into a vault with these category subdirectories (created on first run):

| Category | What goes there |
|---|---|
| `articles/` | Paper notes — one entry per paper |
| `surveys/` | Cluster mini-surveys composed by paper-distiller, linking multiple articles |
| `techniques/`, `directions/`, `open-problems/`, `authors/` | Reserved for human-curated content. paper-distiller doesn't write here in v0.1. |

Frontmatter and `[[wikilinks]]` follow Obsidian conventions — no custom format.

## Configuration

| Env var | Purpose | Default |
|---|---|---|
| `PD_API_KEY` | LLM API key (Aliyun Bailian, DeepSeek, OpenRouter — any OpenAI-compatible) | required |
| `PD_BASE_URL` | Endpoint base URL | required |
| `PD_MODEL` | Model identifier | required |
| `PD_PROVIDER_NAME` | Logging tag only | `unspecified` |
| `PD_PDF_TIMEOUT` | PDF download timeout (seconds) | `60` |
| `PD_MIN_SURVEY` | Min articles before composing a survey | `2` |

CLI flags override env vars where applicable (`--model`, `--provider`).

## CLI reference

    paper-distiller --vault <path> {--topic <str> | --author <str>}
                    [--n 5] [--pool 30] [--source {arxiv,ss,both}]
                    [--force] [--dry-run] [--verbose] [--model <name>] [--provider <name>]

    paper-distiller-qa --vault <path> --question <str>
                       [--max-rounds 5] [--max-articles 15] [--max-cost-cny 20.0]
                       [--confidence-threshold 8] [--per-round 2]
                       [--source {arxiv,ss,both}] [--interactive] [--resume <session-id>]
                       [--dry-run] [--verbose] [--model <name>] [--provider <name>]

`--dry-run` skips all LLM calls and vault writes — useful for verifying config before spending API budget.

`paper-distiller-qa` flags worth knowing:

| Flag | What it does |
|---|---|
| `--max-rounds N` | Hard upper bound on loop iterations (default 5). The loop also exits early on `llm_done`, `llm_brake`, `no_candidates`, or budget caps. |
| `--max-articles N` | Stop after distilling N total articles across rounds (default 15) |
| `--max-cost-cny F` | Cost circuit breaker, CNY (default 20.0). Uses qwen-plus pricing. |
| `--confidence-threshold N` | LLM `is_done` confidence required to stop early (0-10, default 8) |
| `--interactive` | Pause after each round and prompt Y/n/q |
| `--resume <sid>` | Pick up a paused or errored session from its state.json |

## Customizing prompts

All 5 LLM prompts live as plain markdown — edit them directly to change tone, structure, or output language. No Python changes needed.

- `src/paper_distiller/prompts/{filter,article,survey}.md` — single-pass mode
- `src/paper_distiller/qa/prompts/{reflect,answer}.md` — question-driven mode

## Optional companion: semantic search via vault-mcp

paper-distiller does NOT ship its own semantic-search engine for your vault. To search the vault by meaning (not keywords) from Claude Code, pair it with [**vault-mcp**](https://github.com/robbiemu/vault-mcp) — a standalone MCP server purpose-built for markdown vaults, with live sync and multi-provider embedding support.

See [docs/vault-mcp-recommendation.md](docs/vault-mcp-recommendation.md) for setup and rationale.

## LLM provider examples

| Provider | `PD_BASE_URL` | `PD_MODEL` |
|---|---|---|
| **Aliyun Bailian (recommended, cheapest)** | `https://dashscope.aliyuncs.com/compatible-mode/v1` | `qwen-plus` |
| Aliyun Bailian (coding plan) | `https://coding.dashscope.aliyuncs.com/v1` | `qwen3.5-plus` |
| DeepSeek | `https://api.deepseek.com/v1` | `deepseek-chat` |
| OpenRouter | `https://openrouter.ai/api/v1` | `qwen/qwen3.5-plus` |
| Local Ollama | `http://localhost:11434/v1` | `qwen2.5` |

## Why "math research" specifically?

The default category schema (articles / techniques / directions / open-problems / authors / surveys) was designed for mathematical/scientific paper research. The tool works fine for other domains today; configurable schema is on the v0.2 roadmap.

## Status

**v0.5.0 — alpha.** Single-pass (`paper-distiller`) and question-driven multi-round (`paper-distiller-qa`) modes both work end-to-end. 78 tests passing on Python 3.10/3.11/3.12.

### Shipped

- **v0.1.0** — L2 single-pass search-and-distill against arxiv; LLM filter + ranker; PyMuPDF-based extraction; Obsidian-compatible markdown output.
- **v0.2.0** — arxiv-id-based dedup (prevents sibling articles for the same paper under different slugs); restored 500-char full-pdf threshold.
- **v0.3.0** — Semantic Scholar as second paper source (`--source {arxiv,ss,both}`, default both); PDF fallback chain (when arxiv's PDF download fails, try SS's `openAccessPdf`); DOI-based dedup.
- **v0.5.0** — `paper-distiller-qa` question-driven multi-round research loop. State-machine with 7 stop reasons, `--interactive` and `--resume` modes, audit-trail-equipped final answer survey. (See [docs/ARCHITECTURE.md](docs/ARCHITECTURE.md) for internals.)

### Future roadmap

- **v0.4** — *deferred*. We explored shipping our own semantic-search MCP server; concluded the right answer is to recommend [vault-mcp](https://github.com/robbiemu/vault-mcp) instead (see [docs/vault-mcp-recommendation.md](docs/vault-mcp-recommendation.md)). No v0.4 tag exists.
- **v0.6** — citation-graph traversal: given a seed article, follow its references / cited-by edges and rank them for inclusion.
- **v0.7** — broaden sources beyond arxiv + Semantic Scholar. Likely candidate: integrate [OpenCLI](https://github.com/jackwener/OpenCLI) to pull from logged-in browser sessions (ACM Digital Library, IEEE Xplore, lab homepages, Chinese platforms like 知乎/B站). Useful for venue-only papers and discussion context around papers.
- **Later / on-demand** — per-vault `paper-distiller.toml` for custom category schemas; LEANN-backed in-pipeline crosslink retrieval (useful only when vault grows past ~500 entries).

### Known limitations

- arxiv.org occasionally returns 503 / 429; paper-distiller retries 3× then exits with a friendly error (use `--verbose` for the traceback).
- The "full-pdf vs abstract-only" threshold (500 chars) is conservative; PyMuPDF rarely returns less, but scanned-only PDFs do correctly fall back to abstract-only mode.

## Contributing

Issues and PRs welcome. Run tests before submitting:

    pip install -e ".[dev]"
    pytest -v

## License

MIT
