Metadata-Version: 2.1
Name: deepxiv-sdk
Version: 0.2.4
Summary: A Python package for arXiv paper access with CLI and MCP server support
Home-page: https://github.com/qhjqhj00/deepxiv-sdk
Author: Hongjin Qian
License: MIT
Project-URL: Homepage, https://1stauthor.com/
Project-URL: Documentation, https://github.com/qhjqhj00/deepxiv-sdk#readme
Project-URL: Repository, https://github.com/qhjqhj00/deepxiv-sdk
Project-URL: Bug Tracker, https://github.com/qhjqhj00/deepxiv-sdk/issues
Project-URL: Demo, https://1stauthor.com/
Project-URL: API Documentation, https://data.rag.ac.cn/api/docs
Keywords: arxiv,research,papers,agent,llm,react,mcp,cli
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Requires-Dist: click>=8.0.0
Requires-Dist: python-dotenv>=0.19.0
Provides-Extra: agent
Requires-Dist: openai>=1.0.0; extra == "agent"
Requires-Dist: langgraph>=0.0.20; extra == "agent"
Requires-Dist: langchain-core>=0.1.0; extra == "agent"
Provides-Extra: all
Requires-Dist: requests>=2.31.0; extra == "all"
Requires-Dist: click>=8.0.0; extra == "all"
Requires-Dist: python-dotenv>=0.19.0; extra == "all"
Requires-Dist: mcp[cli]>=1.2.0; extra == "all"
Requires-Dist: openai>=1.0.0; extra == "all"
Requires-Dist: langgraph>=0.0.20; extra == "all"
Requires-Dist: langchain-core>=0.1.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: isort>=5.0.0; extra == "dev"
Provides-Extra: mcp
Requires-Dist: mcp[cli]>=1.2.0; extra == "mcp"

# deepxiv-sdk

**DeepXiv is an agent-first paper search and progressive reading tool.**

Install it with `pip`, start using it immediately, and let the CLI auto-register a token on first use. No extra setup is required before your first query.

- **📚 API Documentation**: [https://data.rag.ac.cn/api/docs](https://data.rag.ac.cn/api/docs)
- **🎥 Demo Video**: [![Watch Demo](https://img.shields.io/badge/YouTube-Watch%20Demo-red)](https://youtu.be/atr71CbQybM)
- **📄 Technical Report**: [![arxiv](https://img.shields.io/badge/arXiv-2603.00084-b31b1b)](https://arxiv.org/abs/2603.00084)
- **📖 中文文档**: [README.zh.md](README.zh.md)

## What DeepXiv Does

DeepXiv is built around two core workflows that matter for agents:

1. **Search + Progressive Content Access**
2. **Trending + Popularity signals**

Instead of blindly loading full papers, DeepXiv lets agents read in layers, based on token budget and task value.

## Quick Start

```bash
pip install deepxiv-sdk
```

On first use, deepxiv automatically registers a free token and saves it to `~/.env`:

```bash
deepxiv search "agentic memory" --limit 5
```

If you want the full stack including MCP and the built-in research agent:

```bash
pip install "deepxiv-sdk[all]"
```

## CLI-First Workflow

The CLI is the primary interface. DeepXiv is designed so agents can work like researchers: search first, judge quickly, then read only the most valuable parts.

```bash
deepxiv search "agentic memory" --limit 5
deepxiv paper 2603.21489 --brief
deepxiv paper 2603.21489 --head
deepxiv paper 2603.21489 --section Analysis
```

Three commands matter most for progressive reading:

- `--brief`: decide whether a paper is worth deeper reading
- `--head`: inspect structure, sections, and token distribution
- `--section`: read only the most valuable parts such as `Introduction`, `Method`, or `Experiments`

This is the core DeepXiv idea: agents should not load full papers unless they truly need them.

## CLI Features

### 1. Paper Search and Reading

```bash
deepxiv search "transformer" --limit 10
deepxiv paper 2409.05591 --brief
deepxiv paper 2409.05591 --head
deepxiv paper 2409.05591 --section Introduction
deepxiv paper 2409.05591
```

### 2. Trending and Popularity

Research is not only about what exists, but what is worth reading now.

```bash
deepxiv trending --days 7 --limit 30
deepxiv paper 2409.05591 --popularity
```

- `trending` finds the hottest recent papers from social signals
- `--popularity` gives paper-level propagation metrics such as views, tweets, likes, and replies

### 3. Web Search

```bash
deepxiv wsearch "karpathy"
deepxiv wsearch "karpathy" --json
```

Notes:
- `deepxiv wsearch` calls the DeepXiv web search endpoint
- each `wsearch` request costs **20 limit**
- a registered token gets **10,000 limit per day**, so this is roughly **500 web searches per day**

### 4. Semantic Scholar Metadata by ID

```bash
deepxiv sc 258001
deepxiv sc 258001 --json
```

`deepxiv sc` fetches metadata using a Semantic Scholar paper ID.

Notes:
- this is useful when your workflow already has Semantic Scholar IDs
- DeepXiv will **soon provide a Semantic Scholar search service** that returns Semantic Scholar IDs directly

### 5. Biomedical Papers

```bash
deepxiv pmc PMC544940 --head
deepxiv pmc PMC544940
```

## Example Agent Workflows

### Workflow 1: Review recent hot papers

```bash
deepxiv trending --days 7 --limit 30 --json
```

Then an agent can:

1. run `--brief` for each paper
2. run `--head` for the most promising ones
3. read only key sections
4. produce a report without manually opening dozens of papers

### Workflow 2: Enter a new research topic

```bash
deepxiv search "agentic memory" --date-from 2026-03-01 --limit 100 --format json
```

Then an agent can:

1. batch-brief the results
2. prioritize papers with GitHub links
3. inspect experiments via `--head`
4. read `Experiments` / `Results`
5. turn datasets, metrics, and scores into a baseline table

## Built-in Deep Research Agent

If you do not want to compose the workflow manually, the CLI already includes a built-in research agent.

```bash
pip install "deepxiv-sdk[all]"
deepxiv agent config
deepxiv agent query "What are the latest papers about agent memory?" --verbose
```

If you already have your own agent stack, you can also just plug in the DeepXiv CLI skill and keep your own orchestration.

## Agent Integration

DeepXiv is designed to work well inside Codex, Claude Code, OpenClaw, and similar agent runtimes.

### MCP Server

Add to Claude Desktop MCP config file:

**macOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`

**Windows**: `%APPDATA%\Claude\claude_desktop_config.json`

**Linux**: `~/.config/Claude/claude_desktop_config.json`

```json
{
  "mcpServers": {
    "deepxiv": {
      "command": "deepxiv",
      "args": ["serve"],
      "env": {
        "DEEPXIV_TOKEN": "your_token_here"
      }
    }
  }
}
```

### CLI Skill

```bash
mkdir -p $CODEX_HOME/skills
ln -s "$(pwd)/skills/deepxiv-cli" $CODEX_HOME/skills/deepxiv-cli
```

For frameworks without native skill support, load [skills/deepxiv-cli/SKILL.md](skills/deepxiv-cli/SKILL.md) as operating instructions.

## Python Usage

```python
from deepxiv_sdk import Reader

reader = Reader()

results = reader.search("agent memory", size=5)
brief = reader.brief("2409.05591")
head = reader.head("2409.05591")
intro = reader.section("2409.05591", "Introduction")

web = reader.websearch("karpathy")
sc_meta = reader.semantic_scholar("258001")
```

## Roadmap

DeepXiv is moving toward an **academic paper data interface at 100M+ scale**.

The roadmap is:

1. **Full arXiv coverage with T+1 automatic updates**
2. **anyXiv coverage**, including bioRxiv, medRxiv, and similar repositories
3. **Full open-access literature coverage**

The metadata backbone will increasingly rely on **Semantic Scholar metadata as the base layer**, while continuously expanding coverage and enrichment quality.

## Current Coverage

- ✅ **arXiv** - current primary source
- ✅ **PubMed Central (PMC)** - biomedical and life sciences
- 🔄 **Semantic Scholar metadata integration** - expanding as the metadata foundation

> DeepXiv focuses on open-access literature so agents can work on unrestricted paper data instead of getting blocked by subscription barriers.

## Complete API Reference

### Search and Query

```python
reader.search(query, size=10, search_mode="hybrid", categories=None, min_citation=None)
reader.websearch(query)            # Web search (20 limit per request)
reader.semantic_scholar(sc_id)     # Metadata lookup by Semantic Scholar ID
reader.head(arxiv_id)              # Paper metadata and sections overview
reader.brief(arxiv_id)             # Quick summary (title, TLDR, keywords, citations, GitHub URL)
reader.section(arxiv_id, section)  # Read specific section
reader.raw(arxiv_id)               # Full paper
reader.preview(arxiv_id)           # Paper preview (~10k characters)
reader.json(arxiv_id)              # Complete structured JSON
```

### PMC (Biomedical Papers)

```python
reader.pmc_head(pmc_id)            # PMC paper metadata
reader.pmc_full(pmc_id)            # Complete PMC paper JSON
```

### Agent (Optional)

```python
from deepxiv_sdk import Agent

agent = Agent(api_key="your_openai_key", model="gpt-4")
answer = agent.query("What are the latest papers about agent memory?")
print(answer)
```

## Token Management

deepxiv supports 4 ways to configure tokens:

**1. Auto-registration (Recommended)** - Automatically creates and saves on first use
```bash
deepxiv search "agent"
```

**2. Using config command**
```bash
deepxiv config --token YOUR_TOKEN
```

**3. Environment variable**
```bash
export DEEPXIV_TOKEN="your_token"
```

**4. Command-line option**
```bash
deepxiv paper 2409.05591 --token YOUR_TOKEN
```

**Increase daily limit**: Default is 10,000 requests/day. For higher limits, email your name, email, and phone to `tommy@chien.io`.

### Free Test Papers

These papers can be accessed without a token:

**arXiv**: `2409.05591`, `2504.21776`
**PMC**: `PMC544940`, `PMC514704`

## MCP Tools

Available tools when using MCP Server:

| Tool | Description |
|------|-------------|
| `search_papers` | Search arXiv papers |
| `get_paper_brief` | Quick summary |
| `get_paper_metadata` | Full metadata |
| `get_paper_section` | Read specific section |
| `get_full_paper` | Complete paper |
| `get_paper_preview` | Paper preview |
| `get_pmc_metadata` | PMC paper metadata |
| `get_pmc_full` | Complete PMC paper |

## Agent Usage (Optional)

The built-in ReAct agent can automatically search papers, read content, and perform multi-turn reasoning:

```python
from deepxiv_sdk import Agent

agent = Agent(
    api_key="your_deepseek_key",
    base_url="https://api.deepseek.com/v1",
    model="deepseek-chat"
)

answer = agent.query("Compare key ideas in transformers and attention mechanisms")
print(answer)
```

Or via CLI:

```bash
deepxiv agent config  # Configure LLM API
deepxiv agent query "What are the latest papers about agent memory?" --verbose
```

## Error Handling

deepxiv provides specific exception types:

```python
from deepxiv_sdk import (
    Reader,
    AuthenticationError,  # 401 - Invalid or expired token
    RateLimitError,       # 429 - Daily limit reached
    NotFoundError,        # 404 - Paper not found
    ServerError,          # 5xx - Server error
    APIError              # Other API errors
)

try:
    paper = reader.brief("2409.05591")
except AuthenticationError:
    print("Please update your token")
except RateLimitError:
    print("Daily limit reached")
except NotFoundError:
    print("Paper not found")
except APIError as e:
    print(f"API error: {e}")
```

## Troubleshooting

**Q: Do I need a token to use?**
A: No. Some papers are free to access. Search and some content require a token, but it's auto-created on first use.

**Q: What's the maximum search results?**
A: 100 per request. Use `offset` parameter for pagination.

**Q: How to handle timeouts?**
A: Reader automatically retries (max 3 times) with exponential backoff. You can customize:
```python
reader = Reader(timeout=120, max_retries=5)
```

**Q: Can I cache paper content?**
A: Yes. After getting content with reader, cache locally to database or file system.

**Q: Which LLMs does the agent support?**
A: Any OpenAI-compatible API (OpenAI, DeepSeek, OpenRouter, local Ollama, etc.).

## Examples

See [examples/](examples/) directory:

- `quickstart.py` - 5-minute quick start
- `example_reader.py` - Basic Reader usage
- `example_agent.py` - Agent usage
- `example_advanced.py` - Advanced patterns
- `example_error_handling.py` - Error handling examples

## License

MIT License - see [LICENSE](LICENSE) file

## Support

- 🐛 **GitHub Issues**: [https://github.com/qhjqhj00/deepxiv_sdk/issues](https://github.com/qhjqhj00/deepxiv_sdk/issues)
- 📚 **API Documentation**: [https://data.rag.ac.cn/api/docs](https://data.rag.ac.cn/api/docs)
- 📧 **Higher Limits**: Email with your name, email, and phone to `tommy@chien.io`
