Metadata-Version: 2.3
Name: pydantic_researchers
Version: 0.1.0
Summary: Standalone, pluggable port of gpt-researcher to pydantic-ai — decoupled from pydantic-deep.
Requires-Dist: pydantic-ai-slim[openai,anthropic,google,cohere,voyageai,bedrock,mcp]>=2.0.0b6
Requires-Dist: pydantic-ai-backend[console]>=0.2.10
Requires-Dist: pydantic-settings
Requires-Dist: python-dotenv
Requires-Dist: tiktoken
Requires-Dist: httpx
Requires-Dist: numpy
Requires-Dist: beautifulsoup4
Requires-Dist: lxml
Requires-Dist: logfire
Requires-Dist: pymupdf4llm>=1.27.2.3
Requires-Dist: aiofiles>=24.1.0
Requires-Dist: aiosqlite>=0.22.1
Requires-Dist: filelock
Requires-Dist: genai-prices
Requires-Dist: summarization-pydantic-ai
Requires-Dist: ddgs>=9.14.4
Requires-Dist: starlette>=1.3.1
Requires-Dist: cryptography>=48.0.1
Requires-Dist: aiohttp>=3.14.1
Requires-Dist: pydantic-deep ; extra == 'deep'
Requires-Dist: pypdf ; extra == 'documents'
Requires-Dist: python-docx ; extra == 'documents'
Requires-Dist: openpyxl ; extra == 'documents'
Requires-Dist: python-pptx ; extra == 'documents'
Requires-Dist: readability-lxml ; extra == 'documents'
Requires-Dist: pymupdf ; extra == 'documents'
Requires-Dist: arxiv ; extra == 'documents'
Requires-Dist: langchain-community ; extra == 'embeddings'
Requires-Dist: langchain-huggingface ; extra == 'embeddings'
Requires-Dist: sentence-transformers ; extra == 'embeddings'
Requires-Dist: hnswlib ; extra == 'embeddings'
Requires-Dist: ddgs ; extra == 'retrievers'
Requires-Dist: pytest>=8 ; extra == 'test'
Requires-Dist: pytest-asyncio>=0.23 ; extra == 'test'
Requires-Dist: respx>=0.21 ; extra == 'test'
Requires-Dist: pytest-cov>=5 ; extra == 'test'
Requires-Python: >=3.14
Provides-Extra: deep
Provides-Extra: documents
Provides-Extra: embeddings
Provides-Extra: retrievers
Provides-Extra: test
Description-Content-Type: text/markdown

# pydantic-researchers

[![CI](https://github.com/yourusername/pydantic-researchers/actions/workflows/ci.yml/badge.svg)](https://github.com/yourusername/pydantic-researchers/actions/workflows/ci.yml)
[![Python Version](https://img.shields.io/badge/python-3.13+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)

A standalone, pluggable port of [gpt-researcher](https://github.com/assafelovic/gpt-researcher)
to [pydantic-ai](https://ai.pydantic.dev). **No hard dependency on `pydantic-deep`** — the
research pipeline drives raw `pydantic_ai.Agent` instances directly. (`pydantic-deep` is an
*optional* enhancement — see [Enhanced fallback](#enhanced-fallback-optional).)

## Installation

```bash
pip install pydantic-researchers
```

For optional features:

```bash
pip install "pydantic-researchers[embeddings,retrievers,documents]"
```

## Quick Start

```python
from pydantic_researchers import GPTResearcher

gr = GPTResearcher(query="Why is Nvidia stock going up?", report_type="research_report")
context = await gr.conduct_research()
report = await gr.write_report()
```

## CLI Usage

```bash
# Run deep research
pydantic-researchers deep-research "Why is Nvidia stock going up?" --depth 2 --breadth 4

# Inspect memory
pydantic-researchers memory list
pydantic-researchers memory recall "nvidia stock"

# Check version
pydantic-researchers version
```

## Features

### Report Types
- `research_report` — intro → body → conclusion → references (gpt-researcher pipeline)
- `detailed_report` — subtopic decomposition with dedup, TOC, references
- `deep_research` — breadth × depth recursive tree → synthesized report

### Retrievers
Configured via `Config.retrievers` (e.g. `["tavily", "duckduckgo"]`); multiple
retrievers are fused with reciprocal-rank fusion.

### Scrapers
Multiple scraper backends:
- `bs` — BeautifulSoup (default, no API key required)
- `browser` — Browser automation
- `firecrawl` — Firecrawl API
- `mcp` — MCP server integration

### Embeddings / Memory (Optional)
`pip install "pydantic-researchers[embeddings]"`. Use an offline provider
(`huggingface:sentence-transformers/all-MiniLM-L6-v2` or `ollama:…`) to run with
no API key.

### MCP Integration
Built-in support for MCP servers with preset configurations for common providers
(Tavily, Exa, Firecrawl, Playwright, etc.). **No MCP servers are enabled by
default** — see [MCP presets](#mcp-presets).

## Configuration

All settings are configurable via environment variables or a `Config` object:

```python
from pydantic_researchers import Config, GPTResearcher

config = Config(
    fast_llm="openai:gpt-4o-mini",
    smart_llm="anthropic:claude-opus-4-6",
    retrievers=["duckduckgo", "tavily"],
    cost_budget_usd=5.0,
)

gr = GPTResearcher(query="...", config=config)
```

Every field below can be set either as a `Config(...)` keyword argument or as an
`UPPER_SNAKE_CASE` environment variable (e.g. `fast_llm` ↔ `FAST_LLM`).

---

## LLM Models

The pipeline uses **three model slots**, each a pydantic-ai native
`"<provider>:<model>"` string:

| Slot | Field | Default | Role |
|------|-------|---------|------|
| **fast** | `fast_llm` | `openai:gpt-4o-mini` | Researcher — high-volume calls (summarization, extraction) |
| **smart** | `smart_llm` | `anthropic:claude-opus-4-6` | Writer — final report synthesis |
| **strategic** | `strategic_llm` | `anthropic:claude-opus-4-6` | Planner — outline generation, decomposition |

```bash
# Via env vars
FAST_LLM="openai:gpt-4o-mini"
SMART_LLM="anthropic:claude-opus-4-6"
STRATEGIC_LLM="anthropic:claude-opus-4-6"
```

```python
# Via Config
config = Config(
    fast_llm="openai:gpt-4o-mini",
    smart_llm="anthropic:claude-opus-4-6",
    strategic_llm="openai:gpt-5",
)
```

### API keys for the built-in providers

pydantic-ai reads each provider's key from its conventional env var. Set the
ones you use:

| Provider | Env var | Example model string |
|----------|---------|----------------------|
| OpenAI | `OPENAI_API_KEY` | `openai:gpt-4o`, `openai:gpt-4o-mini`, `openai:gpt-5` |
| Anthropic | `ANTHROPIC_API_KEY` | `anthropic:claude-opus-4-6`, `anthropic:claude-sonnet-4-20250514` |
| Google | `GOOGLE_API_KEY` | `google:gemini-2.5-pro`, `google:gemini-2.5-flash` |
| Groq | `GROQ_API_KEY` | `groq:llama-3.3-70b-versatile` |
| Mistral | `MISTRAL_API_KEY` | `mistral:mistral-large-latest` |
| Cohere | `COHERE_API_KEY` | `cohere:command-r-plus` |

### Custom OpenAI-compatible endpoint

Point **all three slots** at one OpenAI-compatible endpoint (MiniMax,
OpenRouter, a local server, NVIDIA NIM, etc.) without touching code:

| Field | Env var | Purpose |
|-------|---------|---------|
| `llm_base_url` | `LLM_BASE_URL` | Base URL for the endpoint (e.g. `https://api.minimax.io/v1`) |
| `llm_api_key_env` | `LLM_API_KEY_ENV` | Name of the env var holding the key (e.g. `MINIMAX_API_KEY`) |
| `llm_wire_api` | `LLM_WIRE_API` | Wire protocol: `chat` (default), `responses`, or `anthropic` |
| `llm_max_tokens` | `LLM_MAX_TOKENS` | Cap output tokens per call (some providers, e.g. NVIDIA NIM, require this) |

```bash
# Example: route everything through OpenRouter
LLM_BASE_URL="https://openrouter.ai/api/v1"
LLM_API_KEY_ENV="OPENROUTER_API_KEY"
LLM_WIRE_API="chat"
FAST_LLM="openai:gpt-4o-mini"
SMART_LLM="anthropic:claude-opus-4-6"
```

When `llm_base_url` is set, every model string is built against that base URL
(regardless of the `provider:` prefix), and the key is read from the env var
named by `llm_api_key_env`. Leave `llm_base_url` unset to use the real provider
named in each model string.

---

## Vendor registry (multi-provider fallback)

For vendors that are **not** first-class pydantic-ai providers, a built-in
registry (`pydantic_researchers.llm.providers.PROVIDERS`) knows their endpoints,
key env vars, and a `best` / `small` model each:

| Key | Endpoint | API key env var | `best` model | `small` model | Wire |
|-----|----------|-----------------|--------------|---------------|------|
| `zai` | `https://api.z.ai/api/coding/paas/v4` | `ZAI_CODING_PLAN_API_KEY` | `glm-5.2` (reasoning) | `glm-4.5` | openai |
| `minimax` | `https://api.minimax.io/anthropic` | `MINIMAX_API_KEY` | `MiniMax-M3` | `MiniMax-M3` | anthropic |
| `deepseek` | `https://api.deepseek.com/v1` | `DEEPSEEK_API_KEY` | `deepseek-v4-pro` | `deepseek-v4-flash` | openai |
| `nvidia` | `https://integrate.api.nvidia.com/v1` | `NVIDIA_API_KEY` | `nvidia/nemotron-3-ultra-550b-a55b` | `nvidia/nemotron-3-super-120b-a12b` | openai |
| `hf` | `https://router.huggingface.co/v1` | `HUGGINFACE_TOKEN` | `MiniMaxAI/MiniMax-M3:together` | `MiniMaxAI/MiniMax-M3:together` | openai |

Set the env var(s) for the vendor(s) you want to use:

```bash
# Z.AI (GLM models)
export ZAI_CODING_PLAN_API_KEY="sk-..."

# MiniMax
export MINIMAX_API_KEY="..."

# DeepSeek
export DEEPSEEK_API_KEY="..."

# NVIDIA NIM
export NVIDIA_API_KEY="nvapi-..."

# Hugging Face router
export HUGGINFACE_TOKEN="hf_..."
```

---

## Fallback chains

If a primary model fails with a transient error (429 / 5xx), the pipeline can
fall through to a list of fallback models — one **per role**. A missing key
fails loudly at construction time (no silent degradation).

| Field | Env var | Default |
|-------|---------|---------|
| `planner_fallback_models` | `PLANNER_FALLBACK_MODELS` | `[]` (no fallback) |
| `researcher_fallback_models` | `RESEARCHER_FALLBACK_MODELS` | `[]` (no fallback) |
| `writer_fallback_models` | `WRITER_FALLBACK_MODELS` | `[]` (no fallback) |

Each list is a comma-separated set of entries. An entry may be:

- A **vendor key** — resolved from the [registry](#vendor-registry-multi-provider-fallback)
  above (e.g. `minimax`, `zai`). Uses the vendor's `best` model.
- A **vendor key with slot** — e.g. `zai:small` (uses the vendor's `small` model).
- A **raw `provider:model` string** — e.g. `openai:gpt-4o`, resolved by
  pydantic-ai directly (requires that provider's key).

Mix freely; there is no limit on chain length.

```bash
# Planner: strategic_llm (Claude) → fall back to GLM, then GPT-4o
PLANNER_FALLBACK_MODELS="zai,openai:gpt-4o"

# Researcher: fast_llm (gpt-4o-mini) → fall back to DeepSeek small, then Z.AI small
RESEARCHER_FALLBACK_MODELS="deepseek:small,zai:small"

# Writer: smart_llm (Claude) → fall back to MiniMax, then GPT-4o
WRITER_FALLBACK_MODELS="minimax,openai:gpt-4o"
```

**Single model = no overhead.** When a fallback list is empty (the default),
the primary model is used as-is — pydantic-ai handles a single model natively,
with **no `FallbackModel` wrapper**. The wrapper is only added when a list is
non-empty.

### Enhanced fallback (optional)

Install `pydantic-deep` to get an enhanced fallback wrapper with **auth-error
filtering** (a 401/403 stops the chain instead of burning every model with the
same bad key) and **hop-counter reset** telemetry:

```bash
pip install pydantic-deep
```

Controlled by `use_deep_fallback` (default `True`): when `pydantic_deep` is
importable **and** a backend is available, the enhanced wrapper is used; it
silently degrades to a plain `FallbackModel` otherwise. No code changes needed.

---

## Context windows & max tokens

Each role has a max-token budget that drives compression and research-depth
budgets:

| Field | Env var | Default |
|-------|---------|---------|
| `planner_max_tokens` | `PLANNER_MAX_TOKENS` | `None` (auto) |
| `researcher_max_tokens` | `RESEARCHER_MAX_TOKENS` | `None` (auto) |
| `writer_max_tokens` | `WRITER_MAX_TOKENS` | `None` (auto) |

When `None`, the budget is **auto-computed as the minimum common denominator
across the whole chain** (primary + fallbacks), so a mixed chain (e.g. 200K
Claude + 64K DeepSeek) safely uses the 64K floor. Context windows are looked up
from `genai-prices` (`ModelInfo.context_window`); unknown models fall back to a
conservative 128K default. The auto-computed value uses 80% of the minimum
(20% headroom). Set a field explicitly to override the auto-computation.

```bash
# Explicit overrides (otherwise auto-computed)
WRITER_MAX_TOKENS=8000
```

---

## DuckDuckGo proxy rotation

DuckDuckGo (the default retriever, no API key required) can rotate through a
list of proxies to avoid rate limits. Create a `proxies.txt` (one proxy per
line, `#` comments):

```
# HTTP proxies
http://user:pass@proxy1.example.com:8080
# SOCKS5
socks5://127.0.0.1:9050
# Tor shortcut (the ddgs library expands this to socks5h://127.0.0.1:9150)
tb
```

Point the pipeline at it:

| Field | Env var | Default |
|-------|---------|---------|
| `ddgs_proxy_file` | `DDGS_PROXY_FILE` | `None` (direct connection) |

```bash
DDGS_PROXY_FILE="./proxies.txt"
```

The rotator does **thread-safe round-robin** with per-proxy health checking:
3 consecutive failures → 60s cooldown → auto-retry later. On each search it
tries the next healthy proxy; on failure it marks cooldown and retries the
next; if all proxies are dead it falls back to a direct connection (and
returns `[]` on failure — graceful degradation). Each research branch gets its
own rotator instance to avoid lock contention.

---

## MCP presets

Short names for common MCP servers, expanded into full configs at build time.
**No presets are enabled by default** — `mcp_presets` defaults to an empty
list. Opt in explicitly:

| Field | Env var | Default |
|-------|---------|---------|
| `mcp_presets` | `MCP_PRESETS` | `[]` (none) |

```python
config = Config(mcp_presets=["tavily", "firecrawl", "playwright"])
```

Available presets (all default to Streamable HTTP transport unless noted):

| Preset | Needs API key | Env var | Notes |
|--------|--------------|---------|-------|
| `tavily` | yes | `TAVILY_API_KEY` | Web search |
| `exa` | yes | `EXA_API_KEY` | Web search |
| `brave` | yes | `BRAVE_API_KEY` | Web search |
| `firecrawl` | yes | `FIRECRAWL_API_KEY` | Web extraction / scraping |
| `jina` | yes | `JINA_API_KEY` | Web extraction |
| `arxiv` | no | — | Academic search |
| `pubmed` | no | — | Biomedical literature |
| `semantic_scholar` | no | — | Academic search |
| `playwright` | no | — | Browser automation (HTTP). **Requires you to run the server separately:** `npx @playwright/mcp@latest --http --port 8080` |
| `obscura` | no | — | Stealth browser automation (**stdio**, auto-spawned). Defaults to the binary at `~/obscura`. See [Obscura](#obscura-browser-automation) below |
| `filesystem` | no | — | Local documents |
| `fetch` | no | — | URL fetch |

> **Note:** there is no browser/scraper added by default. The two
> browser-capable presets are `playwright` and `obscura`, and both must be
> explicitly enabled (`mcp_presets=["obscura"]`). Without that, the default
> retriever is DuckDuckGo and the default scraper is BeautifulSoup (`bs`).

Presets are defaults, not constants — override any of them by passing a raw
entry with the same `name` in `Config.mcp_configs`.

### Obscura (browser automation)

[Obscura](https://github.com/) ships an MCP server that exposes browser
automation tools to AI agents. The `obscura` preset uses **stdio** transport
(Obscura's documented default), so the pipeline auto-spawns the server
subprocess — no separate `obscura mcp --http` step needed.

**Prerequisite:** place the Obscura binary at `~/obscura` (the default the
preset looks for). A leading `~` is expanded to your home directory at build
time, so there's no hardcoded username. If the binary lives elsewhere, override
the preset's `command` (see below).

```python
config = Config(mcp_presets=["obscura"])
```

```bash
MCP_PRESETS="obscura"
```

**Tools exposed** (same surface as `playwright`, plus a few extras):

| Tool | Description |
|------|-------------|
| `browser_navigate` | Navigate to a URL (`url`, optional `waitUntil`: `load` / `domcontentloaded` / `networkidle0`) |
| `browser_snapshot` | Return the current page URL, title, and body text |
| `browser_click` | Click an element by CSS selector |
| `browser_fill` | Set an input value (triggers `input` + `change` events) |
| `browser_type` | Append text to an input |
| `browser_press_key` | Dispatch a keyboard event (`key`, optional `selector`) |
| `browser_select_option` | Select an `<option>` by value or text |
| `browser_evaluate` | Evaluate a JavaScript expression and return the result |
| `browser_wait_for` | Wait for a CSS selector to appear (`selector`, optional `timeout` in seconds) |
| `browser_network_requests` | List network requests made by the current page |
| `browser_console_messages` | Return console messages logged by the page |
| `browser_close` | Close the page and reset browser state |

**Optional flags** — Obscura supports `--proxy <URL>`, `--user-agent <UA>`, and
`--stealth` (anti-detection mode). Override the preset in `mcp_configs` to pass
them (or to point at a different binary path):

```python
# Custom binary path
config = Config(
    mcp_presets=["obscura"],
    mcp_configs=[{"name": "obscura", "command": "/usr/local/bin/obscura", "args": ["mcp"]}],
)

# Stealth mode (anti-detection)
config = Config(
    mcp_presets=["obscura"],
    mcp_configs=[{"name": "obscura", "command": "~/obscura", "args": ["mcp", "--stealth"]}],
)

# Route through a proxy + custom User-Agent + stealth
config = Config(
    mcp_presets=["obscura"],
    mcp_configs=[{
        "name": "obscura",
        "command": "~/obscura",
        "args": ["mcp", "--proxy", "socks5://127.0.0.1:9050",
                 "--user-agent", "Mozilla/5.0 ...", "--stealth"],
    }],
)
```

**HTTP transport instead of stdio** — run the server yourself and override with
a `connection_url`:

```bash
# Terminal 1: start the HTTP server
~/obscura mcp --http --port 8080   # endpoint: http://127.0.0.1:8080/mcp
```

```python
# Terminal 2: point the pipeline at it
config = Config(
    mcp_presets=["obscura"],
    mcp_configs=[{
        "name": "obscura",
        "connection_url": "http://127.0.0.1:8080/mcp",
        "transport": "streamable_http",
    }],
)
```

---

## Development

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.

## License

MIT License - see [LICENSE](LICENSE) for details.

## Changelog

See [CHANGELOG.md](CHANGELOG.md) for version history.
