Metadata-Version: 2.4
Name: twat-search
Version: 2.7.13
Summary: Web search plugin for twat
Project-URL: Documentation, https://github.com/twardoch/twat-search#readme
Project-URL: Issues, https://github.com/twardoch/twat-search/issues
Project-URL: Source, https://github.com/twardoch/twat-search
Author-email: Adam Twardoch <adam+github@twardoch.com>
License-File: LICENSE
Keywords: api,plugin,search,search-engine,twat,web,web-api,web-scraper,web-scraper-api,web-scraper-api-client,web-scraping,web-search
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.13.0
Requires-Dist: fire>=0.7.0
Requires-Dist: httpx>=0.28.1
Requires-Dist: klepto>=0.2.6
Requires-Dist: pydantic-settings>=2.8.1
Requires-Dist: pydantic>=2.10.6
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: requests>=2.32.3
Requires-Dist: rich>=13.9.4
Requires-Dist: twat-cache[all]>=2.6.7
Requires-Dist: twat>=2.7.0
Provides-Extra: all
Requires-Dist: duckduckgo-search>=7.5.0; extra == 'all'
Requires-Dist: googlesearch-python>=1.3.0; extra == 'all'
Requires-Dist: lxml>=5.3.1; extra == 'all'
Requires-Dist: playwright>=1.50.0; extra == 'all'
Requires-Dist: scrape-bing>=0.1.2.1; extra == 'all'
Requires-Dist: serpapi>=0.1.5; extra == 'all'
Requires-Dist: tavily-python>=0.5.1; extra == 'all'
Provides-Extra: bing-scraper
Requires-Dist: scrape-bing>=0.1.2.1; extra == 'bing-scraper'
Provides-Extra: brave
Provides-Extra: dev
Requires-Dist: absolufy-imports>=0.3.1; extra == 'dev'
Requires-Dist: isort>=6.0.1; extra == 'dev'
Requires-Dist: mypy>=1.15.0; extra == 'dev'
Requires-Dist: pre-commit>=4.1.0; extra == 'dev'
Requires-Dist: pyupgrade>=3.19.1; extra == 'dev'
Requires-Dist: ruff>=0.9.9; extra == 'dev'
Requires-Dist: uv>=0.1.18; extra == 'dev'
Provides-Extra: docs
Requires-Dist: myst-parser>=4.0.1; extra == 'docs'
Requires-Dist: sphinx-autodoc-typehints>=3.1.0; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=3.0.2; extra == 'docs'
Requires-Dist: sphinx>=8.3.0; extra == 'docs'
Provides-Extra: duckduckgo
Requires-Dist: duckduckgo-search>=7.5.0; extra == 'duckduckgo'
Provides-Extra: falla
Requires-Dist: lxml>=5.3.1; extra == 'falla'
Requires-Dist: playwright>=1.50.0; extra == 'falla'
Provides-Extra: google-scraper
Requires-Dist: googlesearch-python>=1.3.0; extra == 'google-scraper'
Provides-Extra: hasdata
Provides-Extra: pplx
Provides-Extra: serpapi
Requires-Dist: serpapi>=0.1.5; extra == 'serpapi'
Provides-Extra: tavily
Requires-Dist: tavily-python>=0.5.1; extra == 'tavily'
Provides-Extra: test
Requires-Dist: coverage[toml]>=7.6.12; extra == 'test'
Requires-Dist: pytest-asyncio>=0.25.3; extra == 'test'
Requires-Dist: pytest-benchmark[histogram]>=5.1.0; extra == 'test'
Requires-Dist: pytest-cov>=6.0.0; extra == 'test'
Requires-Dist: pytest-xdist>=3.6.1; extra == 'test'
Requires-Dist: pytest>=8.3.5; extra == 'test'
Description-Content-Type: text/markdown

---
this_file: README.md
---

# twat-search 2.0

`twat-search` is the search layer for the `twat` toolchain: one Python package
and CLI for API search, browser search, scraper search, LLM-assisted search,
metasearch, enrichment, deduplication, and evidence capture.

Version 2.0 treats search as a routing problem. A query can fan out through
official APIs, unofficial endpoints, browser automation, direct HTML scrapers,
and LLM research providers, then return one normalized result stream with
source provenance, ranking metadata, raw payloads, and reproducible diagnostics.

## Install

```bash
uv pip install "twat-search[all]"
```

For local development:

```bash
uv sync --all-extras
uvx hatch test
```

## CLI

```bash
twat-search web q "Adam Twardoch" -e brave,serpapi,duckduckgo -n 5 --json
twat-search web info --plain
twat search web q "font engineering news" --route resilient --json
```

The `twat` host package exposes this plugin as `twat search`.

## Python API

```python
import asyncio

from twat_search.web.api import search, search_detailed


async def main() -> None:
    results = await search(
        "best current browser automation anti-bot patterns",
        engines=["brave", "google_serpapi", "duckduckgo"],
        num_results=5,
    )
    for result in results:
        print(result.source, result.title, result.url)

    detailed = await search_detailed("same query", route="resilient")
    for failure in detailed.failures:
        print(failure.engine, failure.kind, failure.message)


asyncio.run(main())
```

## Engines

2.0 supports four backend families:

- API engines: Brave, Brave News, SerpAPI, HasData, Tavily, You.com, You.com
  News, Perplexity, Critique, Google CSE, Serper, DataForSEO, Exa, Firecrawl,
  Jina, Search1API, Gensearch, and other configured providers.
- Scraper engines: DuckDuckGo, Google, Bing, qbittorrent-style plugin scrapers,
  GitHub/code search providers, and specialized dork engines.
- Browser engines: Playwright-controlled Google, Bing, Qwant, Yahoo, Yandex,
  Mojeek, Gibiru, and future browser-to-API adapters.
- LLM engines: user-selected OpenAI-compatible models for query planning,
  reranking, source critique, summary extraction, and answer synthesis.

Engines are independent adapters behind a common `SearchEngine` contract. A
failure in one adapter is reported as data and does not poison the whole run.

## Configuration

Configuration is loaded from `.env`, environment variables, optional JSON
config, and explicit Python arguments. API keys are discovered from provider
specific names when possible.

Key providers already visible in `/Users/adam/.env.anon.txt` and targeted for
2.0 integration include:

- `BRAVE_API_KEY`
- `SERPAPI_API_KEY`
- `TAVILY_API_KEY`
- `HASDATA_API_KEY`
- `GOOGLE_CSE_API_KEY` and `GOOGLE_CSE_ID`
- `DATAFORSEO_API_KEY`
- `SERPER_API_KEY`
- `EXAAI_API_KEY`
- `FIRECRAWL_API_KEY`
- `JINA_API_KEY`
- `GENSEE_SEARCH_API_KEY`
- `SEARCH1API_KEY`
- `AISEARCH_API_KEY`
- `APIFY_API_KEY`
- `BRIGHTDATA_API_KEY`, `DECODO_API_KEY`, and Webshare proxy variables

### Rotating Proxies

`twat-search` understands Webshare-style rotating proxy variables:

```bash
export WEBSHARE_PROXY_USER="user"
export WEBSHARE_PROXY_PASS="pass"
export WEBSHARE_DOMAIN_NAME="p.webshare.io"
export WEBSHARE_PROXY_PORT="80"
```

When configured, HTTP and browser engines can use the proxy URL for resilient
scraping, relaxed pacing, and parallel browser sessions. Proxy use is explicit
per route or engine so API-backed searches remain clean by default.

### LLM Routing

LLM use is provider-neutral. Any OpenAI-compatible endpoint can be selected for
query rewriting, provider choice, result reranking, or synthesis:

```bash
export TWAT_SEARCH_LLM_PROVIDER="openai-compatible"
export TWAT_SEARCH_LLM_MODEL="gpt-5-mini"
export TWAT_SEARCH_LLM_API_KEY="$OPENAI_API_KEY"
export TWAT_SEARCH_LLM_BASE_URL="https://api.openai.com/v1"
export TWAT_SEARCH_LLM_QUERY_REWRITE="true"
export TWAT_SEARCH_LLM_RESULT_RERANK="true"
export TWAT_SEARCH_LLM_ANSWER_SYNTHESIS="true"
```

The package stores LLM decisions as provenance, not hidden magic. A result set
can show whether a title/snippet came from the source, an extractor, or a model.
Callers can override configured query rewriting per request with
`rewrite_query=True` or `rewrite_query=False`, and result reranking with
`rerank_results=True` or `rerank_results=False`. Answer synthesis is similarly
explicit via `synthesize_answer=True` or `synthesize_answer=False`; synthesized
answers cite result URLs and keep provider failures visible.

## Result Model

Every result carries:

- normalized `title`, `url`, `snippet`, `source`, and `rank`
- optional raw provider payload
- detailed `SearchResponse` data with the parsed `SearchRequest`, per-engine
  `EngineOutcome` records, and structured `SearchFailure` objects
- future engine timing, proxy, and route metadata
- future content artifacts such as fetched HTML, extracted text, screenshots,
  citations, and LLM summaries

## Reference Integrations

The 2.0 design borrows from the local `private/` reference projects instead of
adding dependencies blindly:

- `private/ytrix`: Webshare proxy URL construction, proxy-aware pacing, and
  parallelism decisions.
- `private/crapi`: browser lifecycle, stealth, browser-to-API, and proxy manager
  patterns.
- `private/abersetz`: OpenAI-compatible provider catalog and local/offline LLM
  selection patterns.
- `private/flchimp`: explicit validation, dirty-data handling, and operational
  CLI ergonomics.
- `private/TO_INTEGRATE.md`: external search project ideas including plugin
  search, deep search, GitHub search, dorks, and open deep research.

## Development Contract

2.0 is allowed to break compatibility with the current 0.x code. The priority
is a smaller, typed, testable core:

- clean engine registry
- strict provider capability metadata
- explicit proxy and browser configuration
- normalized result and error data
- no obsolete Falla-specific mess in the public API
- tests for each adapter boundary before live provider tests

Run the normal check before handing off changes:

```bash
uvx hatch test
```
