Metadata-Version: 2.4
Name: web-search-tool
Version: 0.1.0
Summary: Unified CLI over Tavily, Firecrawl, and SerpAPI web-search/content SDKs
Project-URL: Homepage, https://github.com/vladistan/web-search-tool
Project-URL: Repository, https://github.com/vladistan/web-search-tool
Project-URL: Bug Tracker, https://github.com/vladistan/web-search-tool/issues
Author-email: Vlad Korolev <vlad@v-lad.org>
License: MIT
License-File: LICENSE
Keywords: agent-tools,cli,firecrawl,research,scraping,search,serpapi,tavily,web-search
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Python: >=3.13
Requires-Dist: firecrawl-py>=1.0.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: markdownify>=0.13.0
Requires-Dist: pydantic-settings>=2.2.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sentry-sdk>=1.40.0
Requires-Dist: serpapi>=0.1.5
Requires-Dist: structlog>=24.0
Requires-Dist: tavily-python>=0.5.0
Requires-Dist: typer>=0.12.0
Provides-Extra: dev
Requires-Dist: mypy>=1.13; extra == 'dev'
Requires-Dist: pre-commit>=4.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.8; extra == 'dev'
Description-Content-Type: text/markdown

# web-search-tool

A unified [Typer](https://typer.tiangolo.com/) CLI over the **Tavily**,
**Firecrawl**, and **SerpAPI** web-search/content SDKs — one command surface and
one normalized output schema for humans, scripts, and agents.

- **One schema, every provider.** `search` and every vertical return the same
  envelope shape (`title`, `url`, `snippet`, `score`) plus documented
  vertical-specific fields, regardless of which provider served the request.
- **A resilient fetch cascade.** `fetch` walks `markdownify → tavily → firecrawl`,
  falling through on thin content, hard HTTP errors, and unsupported sites.
- **Agent-friendly.** Deterministic exit codes, JSON when piped, an explicit
  escalation signal on total fetch failure, and byte-bounded/chunked output.

## Install

```bash
cd web-search-tool
uv sync --extra dev
uv run web-search-tool --version
```

## Configuration

Settings resolve with the precedence **flag > env > config file > default**.

- **Tool settings** use the `SEARCH_TOOL_` env prefix
  (e.g. `SEARCH_TOOL_DEFAULT_PROVIDER=tavily`, `SEARCH_TOOL_TIMEOUT=30`).
- **Provider keys** use their conventional names: `TAVILY_API_KEY`,
  `FIRECRAWL_API_KEY`, `SERPAPI_API_KEY`.
- Set a key to `SKIP` to opt a provider out of launch validation. Requesting a
  vertical that *only* that provider serves (e.g. `scholar`/`jobs` with
  `SERPAPI_API_KEY=SKIP`) fails with a specific SKIP-conflict error.
- An optional TOML config lives at `~/.config/web-search-tool/config.toml`
  (override with `WEB_SEARCH_TOOL_CONFIG` or the `--config` flag). See
  [`config.example.toml`](config.example.toml) for every tunable.

```bash
# Minimal: keys in the environment
export SERPAPI_API_KEY=...        # required for the SerpAPI verticals
export TAVILY_API_KEY=...         # search, research, one cascade converter
export FIRECRAWL_API_KEY=SKIP     # opt out if you don't have a Firecrawl key
```

| Setting | Env var | Default | Purpose |
|---------|---------|---------|---------|
| Default provider | `SEARCH_TOOL_DEFAULT_PROVIDER` | `serpapi` | Provider for generic `search`/`resolve` |
| Timeout | `SEARCH_TOOL_TIMEOUT` | `20` | Per-request timeout (seconds) |
| Default format | `SEARCH_TOOL_DEFAULT_FORMAT` | _(auto)_ | `table` or `json`; auto-detects TTY when unset |
| Tavily key | `TAVILY_API_KEY` | — | Tavily search + deep research |
| Firecrawl key | `FIRECRAWL_API_KEY` | — | Firecrawl search + scrape converter |
| SerpAPI key | `SERPAPI_API_KEY` | — | All seven SerpAPI verticals |
| Sentry DSN | `SENTRY_DSN` | _(off)_ | Enables error tracking + tracing when set |
| Sentry environment | `SENTRY_ENVIRONMENT` | `local` | Deployment tag on Sentry events |
| Trace sample rate | `SENTRY_TRACES_SAMPLE_RATE` | `1.0` | Transaction sampling, 0.0–1.0 |

The Sentry settings also live under a `[monitoring]` table in the config file and
follow the same precedence. See [Monitoring](#monitoring).

## Output contract

- **stdout carries only the payload.** All logs go to **stderr**; API keys never
  appear in logs, output, or written files.
- **Format**: a TTY renders a rich table; a pipe emits JSON. `--format/-f
  table|json` overrides both; `--compact` removes JSON indentation.
- Every payload is wrapped in an **envelope** — `{command, ok, data}` — so
  callers can branch on `ok` without parsing the body.
- **Exit codes** are deterministic and category-specific:

  | Code | Name | Meaning |
  |------|------|---------|
  | 0 | SUCCESS | Completed |
  | 1 | GENERAL | Unclassified error |
  | 2 | USAGE | Bad CLI usage |
  | 3 | INPUT | Bad input / no URL resolved |
  | 4 | NOT_FOUND | No results |
  | 5 | NETWORK | Network/provider failure |
  | 6 | TIMEOUT | Bounded wait exceeded |
  | 7 | CONFIG | Missing/SKIP key, bad config |

## Commands

### Search & verticals

```bash
web-search-tool search "python typer cli" --limit 10 [--provider tavily|firecrawl|serpapi]
web-search-tool news "us treasury yields" -n 5
web-search-tool jobs "platform engineer" --location "Boston, MA"
web-search-tool images "golden gate bridge"
web-search-tool videos "rust async tutorial"
web-search-tool reverse-image "https://example.com/photo.jpg"
web-search-tool scholar "continual learning llm" \
    --author "Bengio" --min-cites 50 --since 2020 --until 2024 --sort cites
```

All search commands accept `--limit/-n`, `--format/-f`, and `--compact`.
`--limit` is enforced uniformly — SerpAPI verticals are capped client-side.

**Scholar filters** (`scholar` only):

| Flag | Effect |
|------|--------|
| `--author/-a NAME` | Native `author:"NAME"` operator |
| `--min-cites N` | Keep only results cited ≥ N times (client-side over a pool) |
| `--since YEAR` / `--until YEAR` | Native publication-year range |
| `--sort relevance\|date\|cites` | `relevance` (default), most-recent, or most-cited |

### Fetch — URL → markdown via the cascade

```bash
web-search-tool fetch "https://example.com/article"        # probe: first success wins
web-search-tool fetch URL --compare                        # all converters, labeled
web-search-tool fetch URL --raw-html                        # local markdownify only
web-search-tool fetch URL --order markdownify,firecrawl     # custom chain
web-search-tool fetch URL --max-bytes 20000 --offset 0      # bounded / chunked window
web-search-tool fetch URL --stdout                          # stream instead of writing a file
```

- Default mode is a **probe**: walk the chain, first success wins; total failure
  raises chain-exhaustion carrying the exact signal
  `Fetch chain completely failed, try using agent-browser`.
- `--max-bytes` overflow truncates and appends the marker
  `*OUTPUT TRUNCATED PLS INCREASE THE CAP OR DO CHUNKED REQUEST*`; combine with
  `--offset` for deterministic chunked continuation.
- Without `--stdout`, output is written to a collision-safe slug file
  (`<UTC-timestamp>-<slug>.md`) in `--out-dir` (default: current directory).

### Resolve — search, then fetch the top-N URLs

```bash
web-search-tool resolve "best rust web framework" --limit 5
web-search-tool resolve "fed rate decision" --vertical news -n 3
web-search-tool resolve "q" --stdout
```

Runs a search over `--vertical` (default `search`), then fetches each result URL
through the cascade in **auto-resolve** mode: a URL that exhausts its chain
records an escalation but never aborts the batch. Exits `0` if *any* URL
resolved, `3` (INPUT) only if all failed.

### Research — deep cited research via Tavily

```bash
web-search-tool research run "history of CRISPR patents"            # returns request_id
web-search-tool research run "q" --file notes.md --file data.json   # attach local context
web-search-tool research run "q" --wait --max-wait 300              # poll to completion
web-search-tool research status <request_id>                        # look up once
```

`research run` returns a `request_id` immediately; `--wait` polls until a
terminal status or `--max-wait` (then exits TIMEOUT with a resume hint).
`research status` reports status and, on completion, the cited markdown report
with numbered sources. `--file` attaches up to 5 local `.txt`/`.md`/`.json`
files as research context.

## Monitoring

Sentry is **off by default** and fully optional — set a DSN to turn it on. When
enabled, the tool reports errors and wraps each invocation in a transaction with
child spans around every outbound call (provider search, each fetch-cascade tier,
research launch/poll), tagged with non-secret data (provider, tier, status,
result counts). API key values are never attached to spans, tags, or events.

Configure it via env or the `[monitoring]` table in the config file (same
precedence as everything else):

```bash
export SENTRY_DSN="https://<key>@<org>.ingest.sentry.io/<project>"
export SENTRY_ENVIRONMENT=prod          # optional, defaults to "local"
export SENTRY_TRACES_SAMPLE_RATE=0.2    # optional, defaults to 1.0
```

Verify the setup end to end — emits a span tree plus one synthetic, harmless
exception, then prints the captured event id:

```bash
web-search-tool test sentry
```

It exits with the config code (7) when no DSN is configured (nothing to test).

## Development

```bash
uv run pytest                          # unit tests (live tests skipped without keys)
uv run pytest -m live                  # opt-in tests against real provider APIs
uv run ruff check . && uv run ruff format --check .
uv run mypy --strict src/web_search_tool
uv run pytest --cov=web_search_tool --cov-report=term-missing
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for the full workflow.

## License

[MIT](LICENSE) © Vlad Korolev
