Metadata-Version: 2.4
Name: academic-search
Version: 0.7.2
Summary: MCP Server for multi-provider academic search (Semantic Scholar, Crossref, OpenAlex, PubMed) with regex filtering and statistics
Author-email: Yohann <yohann@example.com>
License: MIT
Requires-Python: >=3.10
Requires-Dist: mcp[cli]>=1.6.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: requests>=2.31.0
Description-Content-Type: text/markdown

# Academic Search MCP Server

An MCP (Model Context Protocol) server for searching, filtering, and exploring academic papers across multiple databases, with citation graph walking.

**Providers**: Semantic Scholar, Crossref, OpenAlex, PubMed

## Features

- **Multi-provider search** — unified interface across 4 academic databases
- **Universal regex filtering** — regex post-filtering on all providers (not just Semantic Scholar)
- **Rich post-hoc filters** — year range, citation count, journal, publication type, open access, author
- **Citation graph walk** — random walk on the Semantic Scholar citation graph with backtracking, cross-edge detection, and topic-aware pruning
- **Text similarity** — pure-Python TF cosine similarity for topic drift detection and most-similar candidate selection (no heavy dependencies)
- **Stats** — per-query metadata on authors, years, and field coverage
- **Normalised output** — all providers return data in the same schema

## Tools

### `search_papers`
Search papers by keyword with optional regex post-filtering on title and abstract.

Parameters: `query`, `search_type`, `limit`, `regex_filter`, `regex_search_fields`, `match_mode`, `year_min`, `year_max`, `min_citation_count`, `max_citation_count`, `open_access_only`, `has_pdf`, `journal`, `exclude_journals`, `publication_types`, `exclude_publication_types`, `author`, `has_abstract`, `provider`

### `search_by_author`
Find papers by author name with the same post-hoc filters.

### `explore_citations`
Walk the Semantic Scholar citation graph step by step, starting from a seed paper.

| Parameter | Default | Description |
|---|---|---|
| `seed_paper_id` | (required) | S2 paper ID |
| `num_steps` | 30 | Max walk length |
| `max_depth` | None | Max hops from seed before forced backtrack |
| `direction_choice` | `"random"` | `"forward"`, `"backward"`, `"alternating"`, or `"random"` |
| `bias` | `"random"` | `"top_cited"`, `"bottom_cited"`, `"most_similar"`, or `"random"` |
| `candidates_per_step` | 100 | Candidates per API call |
| `stop_similarity` | None | 0.0–1.0 topic drift threshold |

Includes the same post-hoc filters as `search_papers`.

### `get_paper_stats`
Fetch papers and compute statistics: field availability, author counts, publication year distributions.

### `build_extended_query`
Preview how a regex pattern gets transformed for a given provider's API.

## Usage

### Via `uvx` (recommended)

```bash
uvx academic-search
```

### Via `uv run`

```bash
uv run --with academic-search academic-search
```

### From source

```bash
git clone ...
cd academic-search
uv run academic-search
```

### With Claude Desktop

```json
{
  "mcpServers": {
    "academic-search": {
      "command": "uvx",
      "args": ["academic-search"]
    }
  }
}
```

## Development

```bash
uv sync
uv run academic-search
```

## Providers

| Provider | Citation graph | Notes |
|---|---|---|
| Semantic Scholar | Yes | Full citation/reference graph endpoints, citation counts |
| Crossref | No | Good DOI coverage, no OA metadata |
| OpenAlex | No | Best OA metadata, abstracts from inverted index |
| PubMed | No | NIH literature |

## Design Notes

- **`query` and `regex_filter` are separate** — the API query is plain text; regex filtering is a post-hoc step on all providers
- **Citation walk is a true graph walk** — one step at a time, not bucket collection. Uses backtracking, cross-edge recording, and per-step filtering
- **Caching** — API responses for `(paper_id, direction)` are cached per walk to avoid redundant fetches during backtracking
- **Text similarity** — pure-Python TF cosine similarity (no sentence-transformers); used for `most_similar` bias and `stop_similarity` pruning
- **No API key required** — all providers have free tiers, but Semantic Scholar rate limits are ~1 req/5s without a key. Set `S2_API_KEY` for up to 100 req/s
