Metadata-Version: 2.4
Name: raysearch
Version: 0.1.0
Summary: Omni meta-search engine for agentic AI.
Author-email: Kotodama <jameswjj0416@gmail.com>
License-Expression: Apache-2.0
Keywords: search,serp,research,rag,crawler,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: AsyncIO
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <4,>=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anyio==4.9.0
Requires-Dist: httpx<1,>=0.27
Requires-Dist: jsonschema>=4.23
Requires-Dist: pydantic<3,>=2.7
Requires-Dist: jieba
Requires-Dist: pyyaml
Requires-Dist: typing_extensions<5,>=4.10
Provides-Extra: extract
Requires-Dist: beautifulsoup4; extra == "extract"
Requires-Dist: selectolax>=0.3.26; extra == "extract"
Requires-Dist: trafilatura>=1.10.0; extra == "extract"
Requires-Dist: html-to-markdown>=2.28.0; extra == "extract"
Provides-Extra: extract-pdf
Requires-Dist: pymupdf>=1.24.0; extra == "extract-pdf"
Requires-Dist: pymupdf4llm>=0.3.4; extra == "extract-pdf"
Requires-Dist: pypdf>=5.2.0; extra == "extract-pdf"
Provides-Extra: extract-plus
Requires-Dist: inscriptis>=2.5.0; extra == "extract-plus"
Provides-Extra: crawl
Requires-Dist: curl_cffi; extra == "crawl"
Requires-Dist: playwright>=1.49.1; extra == "crawl"
Requires-Dist: beautifulsoup4; extra == "crawl"
Provides-Extra: rank
Requires-Dist: rank-bm25; extra == "rank"
Requires-Dist: scikit-learn>=1.8.0; extra == "rank"
Requires-Dist: sentence-transformers>=5.2.3; extra == "rank"
Provides-Extra: cache
Requires-Dist: aiosqlite>=0.22.1; extra == "cache"
Requires-Dist: aioredis>=2.0.1; extra == "cache"
Requires-Dist: aiomysql>=0.3.2; extra == "cache"
Requires-Dist: asyncmy>=0.2.11; extra == "cache"
Requires-Dist: sqlalchemy>=2.0.46; extra == "cache"
Provides-Extra: api
Requires-Dist: fastapi<1,>=0.110; extra == "api"
Requires-Dist: uvicorn[standard]<1,>=0.27; extra == "api"
Provides-Extra: overview
Requires-Dist: openai>=2.17.0; extra == "overview"
Requires-Dist: google-genai>=1.63.0; extra == "overview"
Requires-Dist: dashscope>=1.0.0; extra == "overview"
Provides-Extra: to-zh-tw
Requires-Dist: opencc>=1.2.0; extra == "to-zh-tw"
Provides-Extra: stopwords
Requires-Dist: marisa-trie>=1.3.1; extra == "stopwords"
Provides-Extra: tracking
Requires-Dist: structlog>=24.1.0; extra == "tracking"
Provides-Extra: full
Requires-Dist: raysearch[api,cache,crawl,extract,extract_pdf,overview,rank,tracking]; extra == "full"
Dynamic: license-file

# RaySearch

RaySearch is an async-first search orchestration engine for building AI-overview style workflows on top of multiple providers, crawlers, extractors, rankers, and LLM backends.

It exposes four high-level pipelines:

- `search`: multi-provider retrieval with optional fetch and rerank stages
- `fetch`: page crawling, extraction, abstracting, overview generation, and related links
- `answer`: search plus grounded answer generation with citations
- `research`: multi-round research reports with synthesis and structured output

## Why RaySearch

- Component-based architecture with pluggable providers, crawlers, extractors, rankers, caches, and LLM clients
- Async-only runtime with a single `Engine` entry point
- YAML/JSON settings loader plus environment injection for provider and model secrets
- Built-in tracking and metering sinks for observability
- Designed for search-heavy and research-heavy agent workflows rather than chat-only use cases

## Installation

Core install:

```bash
uv pip install raysearch
```

Common full install:

```bash
uv pip install "raysearch[extract,extract_pdf,crawl,rank,cache,api,overview,tracking]"
```

When using Playwright-based crawling, install browser binaries separately:

```bash
playwright install
```

## Public API

```python
from raysearch import Engine, SearchRequest, load_settings
```

Primary entry points:

- `load_settings(path=None, env=None)`
- `Engine.from_settings(setting_file=None, *, settings=None, overrides=None)`
- `await engine.search(request)`
- `await engine.fetch(request)`
- `await engine.answer(request)`
- `await engine.research(request)`

## Quick Start

```python
from raysearch import Engine, SearchRequest

async def main() -> None:
    async with Engine.from_settings("demo/search_config_example.yaml") as engine:
        response = await engine.search(
            SearchRequest(
                query="latest multimodal model papers",
                mode="deep",
                max_results=8,
            )
        )
        for item in response.results:
            print(item.title, item.url)
```

## Configuration

RaySearch loads settings in this order:

1. Explicit `path` passed to `load_settings(...)`
2. `RAYSEARCH_CONFIG_PATH`
3. `raysearch.yaml`
4. In-code defaults

The main configuration groups are:

- `components`: provider, crawl, extract, rank, llm, cache, tracking, metering, http, and rate limiting
- `telemetry`: tracking and metering emitter behavior
- `search`: search-mode profiles and query-expansion behavior
- `fetch`: extraction, abstract, and overview tuning
- `answer`: planning and generation model selection
- `research`: report-generation budgets and model routing
- `runner`: concurrency and queue limits

Component families use a simple default-plus-instance shape:

```yaml
components:
  provider:
    default: google
    google:
      enabled: true
      cookies:
        CONSENT: "YES+"
    duckduckgo:
      enabled: true
      base_url: https://html.duckduckgo.com/html
      allow_redirects: false
```

Reference configuration:

- `demo/search_config_example.yaml`

## Providers And Pipelines

Built-in provider coverage includes:

- `google`
- `google_news`
- `duckduckgo`
- `searxng`
- `github`
- `reddit`
- `reuters`
- `openalex`
- `semantic_scholar`
- `wikidata`
- `wikipedia`
- `arxiv`
- `marginalia`
- `blend` for combining providers

Built-in pipeline support includes:

- Search result expansion and reranking
- Markdown-first fetch extraction
- Abstract generation and page overview synthesis
- Citation-grounded answer generation
- Multi-round research report generation

## Environment Variables

The loader preserves the full process environment in `AppSettings.runtime_env`, and component config models pull values from there as needed.

Common examples:

- `OPENAI_API_KEY`
- `OPENAI_BASE_URL`
- `GEMINI_API_KEY`
- `GEMINI_BASE_URL`
- `DASHSCOPE_API_KEY`
- `DASHSCOPE_BASE_URL`
- Provider-specific overrides such as `GITHUB_TOKEN` or `SEARXNG_BASE_URL`

## Tracking And Metering

Tracking and metering are configured independently from the request pipelines.

Default artifact names now follow the package name:

- tracking JSONL: `.raysearch_tracking.jsonl`
- metering JSONL: `.raysearch_metering.jsonl`
- metering SQLite: `.raysearch_metering.sqlite3`
- cache SQLite: `.raysearch_cache.sqlite3`

## Development

The repo includes runnable demos:

- `demo/search.py`
- `demo/fetch.py`
- `demo/answer.py`
- `demo/research.py`

Example settings:

- `demo/search_config_example.yaml`

## Notes

- `search.mode` supports `fast`, `auto`, and `deep`
- RaySearch is async-only
- Component discovery loads from `raysearch.components`
- JS-heavy crawling requires Playwright plus installed browsers
