Metadata-Version: 2.4
Name: paperhub-cli
Version: 0.1.0
Summary: Academic paper search CLI with multi-provider discovery, reading, downloading, and optional LLM planning
Keywords: academic,papers,research,cli,arxiv,acl,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: click>=8.1.0
Requires-Dist: diskcache>=5.6.0
Requires-Dist: pypdf>=4.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
Requires-Dist: twine; extra == "dev"

# paperhub-cli

`paperhub-cli` is a Python CLI for searching, reading, and downloading academic papers across multiple providers.

It supports:

- direct multi-provider search with `aiohttp`
- optional LLM-guided query planning and decomposition
- normalized paper records with stable ids like `arxiv:...`, `acl:...`, `doi:...`, and `openalex:W...`
- provider capability metadata for search / download / read

## Features

- Search many providers from one CLI entrypoint.
- Read metadata and, where available, extract PDF text.
- Download PDFs when an open or direct PDF link exists.
- Fan out across providers and merge/dedupe results.
- Keep planner tool hints aligned with the same provider registry used by direct search.

---

## Supported Providers

Current provider ids include:

- `arxiv`
- `acl`
- `crossref`
- `openalex`
- `dblp`
- `openaire`
- `pubmed`
- `europepmc`
- `pmc`
- `biorxiv`
- `medrxiv`
- `zenodo`
- `hal`
- `semantic_scholar`
- `core`
- `doaj`
- `unpaywall`
- `iacr`
- `citeseerx`
- `base`
- `ssrn`
- `google_scholar`
- `scihub`
- `ieee`
- `acm`

Run this to see the exact capability levels in your install:

```bash
paperhub-cli providers
```

Capability levels are defined in code as values such as `full`, `info_only`, `oa_only`, `best_effort`, `unsupported`, and `skeleton`.


| Platform           | Search           | Download             | Read                 | Notes                                                                                                        |
| ------------------ | ---------------- | -------------------- | -------------------- | ------------------------------------------------------------------------------------------------------------ |
| arXiv              | ✅                | ✅                    | ✅                    | Open API; reliable                                                                                           |
| PubMed             | ✅                | ❌                    | ⚠️ info-only         | Open API; reliable                                                                                           |
| bioRxiv            | ✅                | ✅                    | ✅                    | Open API; reliable                                                                                           |
| medRxiv            | ✅                | ✅                    | ✅                    | Open API; reliable                                                                                           |
| Google Scholar     | ⚠️               | ❌                    | ❌                    | Bot-detection active; optional `PAPERHUB_GOOGLE_SCHOLAR_PROXY_URL`                                           |
| IACR               | ✅                | ✅                    | ✅                    | Open API; reliable                                                                                           |
| Semantic Scholar   | ✅                | ✅ (OA)               | ✅ (OA)               | Works without key (rate-limited); key improves limits; key rejection (403) retried automatically without key |
| Crossref           | ✅                | ❌                    | ⚠️ info-only         | Open API; reliable                                                                                           |
| OpenAlex           | ✅                | ❌                    | ⚠️ info-only         | Open API; reliable                                                                                           |
| PMC                | ✅                | ✅ (OA only)          | ✅ (OA only)          | OA PDFs only; direct download may be blocked by some proxy environments                                      |
| CORE               | ✅                | ✅ (record-dependent) | ✅ (record-dependent) | Free key recommended; connector retries with backoff and falls back to key-less on 401/403                   |
| Europe PMC         | ✅                | ✅ (OA)               | ✅ (OA)               | OA PDFs only; direct download may be blocked by some proxy environments                                      |
| dblp               | ✅                | ❌                    | ⚠️ info-only         | Open API; reliable                                                                                           |
| OpenAIRE           | ✅                | ❌                    | ❌                    | Open API; retries 3× with escalating request profiles on transient 403                                       |
| CiteSeerX          | ⚠️               | ✅ (record-dependent) | ⚠️                   | API endpoint intermittently unavailable / redirects to web archive                                           |
| DOAJ               | ✅                | ⚠️ (URL-dependent)   | ⚠️ (URL-dependent)   | PDF availability varies by article; free key raises rate limits                                              |
| BASE               | ⚠️               | ✅ (record-dependent) | ✅ (record-dependent) | OAI-PMH endpoint requires institutional IP registration; returns empty gracefully otherwise                  |
| Zenodo             | ✅                | ✅ (record-dependent) | ✅ (record-dependent) | Open API; reliable                                                                                           |
| HAL                | ✅                | ✅ (record-dependent) | ✅ (record-dependent) | Open API; reliable                                                                                           |
| SSRN               | ⚠️               | ⚠️ best-effort       | ⚠️ best-effort       | 403 bot-detection active; public PDF only                                                                    |
| Unpaywall          | ✅ (DOI lookup)   | ❌                    | ❌                    | **Requires** `PAPERHUB_UNPAYWALL_EMAIL`                                                                      |
| Sci-Hub (optional) | ⚠️ fallback-only | ✅                    | ❌                    | Optional; unstable mirrors; user responsibility                                                              |
| **IEEE Xplore** 🔑 | 🚧 skeleton      | 🚧 skeleton          | 🚧 skeleton          | Requires `PAPERHUB_IEEE_API_KEY` to activate                                                                 |
| **ACM DL** 🔑      | 🚧 skeleton      | 🚧 skeleton          | 🚧 skeleton          | Requires `PAPERHUB_ACM_API_KEY` to activate                                                                  |


> ✅ = reliable in live tests.  ⚠️ = works but subject to upstream instability or access restrictions.  ❌ = not supported.  🔑 = key required.  🚧 = skeleton only.

---

## Installation

### Install from PyPI

Once published, end users can install it with:

```bash
pip install paperhub-cli
```

### Local editable install

```bash
python -m venv .venv
source .venv/bin/activate
pip install -e .
```

### Install with dev dependencies

```bash
pip install -e .[dev]
```

### Build and publish to PyPI

From the repository root:

```bash
python -m pip install --upgrade build twine
python -m build
python -m twine check dist/*
python -m twine upload dist/*
```

After upload succeeds, users can install it anywhere with:

```bash
pip install paperhub-cli
```

---

## CLI Usage

### Search

LLM-assisted planning:

```bash
paperhub-cli search "retrieval augmented generation for scientific QA"
```

Direct search without planner:

```bash
paperhub-cli search "vision transformers" --no-plan
```

Search specific providers:

```bash
paperhub-cli search "long context language models" \
  --no-plan \
  --sources arxiv,openalex,semantic_scholar
```

Restrict by year:

```bash
paperhub-cli search "biomedical relation extraction" \
  --no-plan \
  --sources pubmed,europepmc \
  --year-from 2022
```

Use recent-year mode:

```bash
paperhub-cli search "multimodal agents" --recent-years 3 --no-plan
```

Resume a previous planned run:

```bash
paperhub-cli search --resume <research_id>
```

### Read

Read normalized metadata by stable id:

```bash
paperhub-cli read --id arxiv:2005.11401
paperhub-cli read --id acl:2023.acl-long.1
paperhub-cli read --id doi:10.1145/nnnnnnn.nnnnnnn
paperhub-cli read --id openalex:W2741809807
```

Try full-text PDF extraction when possible:

```bash
paperhub-cli read --id arxiv:2005.11401 --full
```

### Download

Download a paper PDF to the current directory:

```bash
paperhub-cli download --id arxiv:2005.11401
```

Export as plain text or Markdown instead of PDF:

```bash
paperhub-cli download --id arxiv:2005.11401 --format txt
paperhub-cli download --id arxiv:2005.11401 --format md
```

Choose a destination:

```bash
paperhub-cli download --id doi:10.1000/182 --dest papers/
```

---

## Important Flags

- `--sources`: comma-separated provider ids for direct backend selection
- `--source`: legacy `arxiv` / `acl` / `both` selector used when `--sources` is not set
- `--no-plan`: bypass LLM planning and run one direct search
- `--depth`: planner depth for decomposed research runs
- `--top-k`: maximum number of papers per query or subtopic
- `--recent-years`: convenience filter for recent work
- `--verbose`: print LLM diagnostics and INFO logs

---

## Environment Variables

The project uses `PAPERHUB_*` environment variables for provider-specific settings.

### General

- `PAPERHUB_HTTP_USER_AGENT`: override the default HTTP user agent
- `PAPERHUB_LL_DIAG`: enable LLM diagnostics output

### Provider-specific

- `PAPERHUB_CROSSREF_MAILTO`: contact email sent in Crossref requests
- `PAPERHUB_OPENALEX_EMAIL`: email used for polite OpenAlex identification
- `PAPERHUB_UNPAYWALL_EMAIL`: required for Unpaywall DOI lookups
- `PAPERHUB_SEMANTIC_SCHOLAR_API_KEY`: optional Semantic Scholar API key
- `PAPERHUB_CORE_API_KEY`: optional CORE API key
- `PAPERHUB_DOAJ_API_KEY`: optional DOAJ API key
- `PAPERHUB_ZENODO_ACCESS_TOKEN`: optional Zenodo token
- `PAPERHUB_GOOGLE_SCHOLAR_PROXY_URL`: optional proxy endpoint for fragile Scholar access
- `PAPERHUB_SCIHUB_ENABLED=1`: explicit opt-in gate for the Sci-Hub stub
- `PAPERHUB_IEEE_API_KEY`: future IEEE integration gate
- `PAPERHUB_ACM_API_KEY`: future ACM integration gate

Example:

```bash
export PAPERHUB_UNPAYWALL_EMAIL="you@example.com"
export PAPERHUB_SEMANTIC_SCHOLAR_API_KEY="..."
paperhub-cli search "agentic retrieval" --no-plan --sources semantic_scholar,openalex
```

---

## LLM Provider Configuration

Planning mode can run with:

- direct OpenAI-compatible APIs
- direct Gemini API
- a LiteLLM proxy that routes to many providers (OpenAI, Anthropic Claude, Vertex Gemini, Bedrock, and others)

### Option A: OpenAI-compatible direct

```bash
export LLM_API_KEY="sk-..."
export LLM_MODEL="gpt-4o-mini"
# optional (defaults to https://api.openai.com/v1)
export LLM_HOST="https://api.openai.com/v1"
```

You can also use `OPENAI_API_KEY` instead of `LLM_API_KEY`.

### Option B: Gemini direct

```bash
export LLM_PROVIDER="gemini"
export GEMINI_API_KEY="..."
export LLM_MODEL="gemini-2.0-flash"
```

---

## Architecture

The codebase centers around a normalized `Paper` record and an async provider layer.

- `paperhub_cli/models.py`: `Paper`, `SearchFilters`, and known `Source` values
- `paperhub_cli/providers/`: provider implementations, capability metadata, id parsing, merging, and registry
- `paperhub_cli/search/orchestrator.py`: direct multi-provider search orchestration
- `paperhub_cli/tools/__init__.py`: planner-facing tool registry built from the same provider layer
- `paperhub_cli/planner/agents/`: LLM rephrase + decomposition workflow
- `paperhub_cli/reader/fetcher.py`: provider-aware metadata fetch and PDF text extraction

---

## Planner Tool Hints

Planner execution uses grouped tool hints backed by the same provider registry. Examples include:

- `multi_default`
- `open_metadata`
- `biomedical`
- `preprints_wide`
- `broad_scholarly`

## This keeps direct CLI search and planner-guided search aligned.

## Testing

Run the test suite with:

```bash
PYTHONPATH=. pytest -q
```

The test suite includes:

- unit tests for provider resolution and reader id normalization
- fixture-based parsing tests for provider payload normalization
- planner/tool registry coverage

---

## MCP Server Notes

This repository currently provides the provider layer and tool abstractions needed for an MCP server, but it does not yet ship a dedicated MCP server module.

If you want to expose it over MCP, the intended pattern is:

1. create a thin MCP adapter around the existing provider registry and reader entrypoints
2. expose stable tools such as `search_papers`, `read_paper`, `download_paper`, and `list_providers`
3. return JSON-serializable `Paper.to_dict()` payloads
4. keep MCP tool definitions generic and pass provider ids as parameters instead of creating one MCP tool per provider

Example MCP tools to expose:

- `search_papers`: search across one or more providers
- `read_paper`: fetch normalized metadata by stable id
- `download_paper`: download or export a paper as `pdf`, `txt`, or `md`
- `list_providers`: return provider capability metadata

Example Claude Desktop configuration:

```json
{
  "mcpServers": {
    "paperhub": {
      "command": "python",
      "args": ["-m", "paperhub_cli.mcp.server"],
      "env": {
        "PAPERHUB_UNPAYWALL_EMAIL": "you@example.com",
        "PAPERHUB_SEMANTIC_SCHOLAR_API_KEY": "your-api-key"
      }
    }
  }
}
```

If you prefer a CLI-style entrypoint, the intended UX would be similar to:

```json
{
  "mcpServers": {
    "paperhub": {
      "command": "paperhub-cli",
      "args": ["mcp"]
    }
  }
}
```

## Caveats

- Some providers are best-effort or metadata-only.
- Read/download support depends on OA links or provider capabilities.
- Fragile sources such as Scholar-style scraping and Sci-Hub-style flows should remain opt-in.
- Upstream APIs and HTML layouts may change and require parser maintenance.

---

## License / Attribution

This project references provider behavior and capability ideas similar to `paper-search-mcp`, but implements native async provider support directly in this repository rather than depending on that package.
