Metadata-Version: 2.4
Name: omniscout
Version: 0.1.0
Summary: OmniScout CLI (harness): local-first browser automation, semantic search, and research for AI agents
Project-URL: Homepage, https://github.com/sriramramnath/omniscout
Project-URL: Repository, https://github.com/sriramramnath/omniscout
Project-URL: Issues, https://github.com/sriramramnath/omniscout/issues
Author: OmniScout
License: MIT
Keywords: agent,cli,playwright,research,scraping,search
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Requires-Dist: aiohttp>=3.9
Requires-Dist: httpx>=0.27
Requires-Dist: markdownify>=0.13
Requires-Dist: nltk>=3.8
Requires-Dist: platformdirs>=4.2
Requires-Dist: playwright>=1.45
Requires-Dist: pydantic>=2.7
Requires-Dist: qdrant-client>=1.9
Requires-Dist: rich>=13.7
Requires-Dist: selectolax>=0.3.21
Requires-Dist: sentence-transformers>=2.7
Requires-Dist: sumy>=0.11
Requires-Dist: tomli>=2.0; python_version < '3.11'
Requires-Dist: trafilatura>=1.12
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# scout — OmniScout CLI

Local-first browser automation, semantic search, and research for AI agents.
No cloud APIs, no hosted browser sessions, no MCP, no SDK.

The CLI is the interface.

## Install

Requires Python 3.11+ and Google Chrome (already installed on most macOS
machines at `/Applications/Google Chrome.app`).

### Recommended: install as a global tool

```bash
cd cli
uv tool install --editable .   # creates ~/.local/bin/omniscout on PATH
scout install              # verifies Chrome + prefetches embedding model
```

After this, `scout` works from any directory. Edits to source files are
picked up live (editable install).

`omniscout` and `harness` remain available as compatibility aliases.

### Alternative: project venv

If you prefer not to install globally:

```bash
cd cli
uv venv --python 3.11 .venv
uv pip install -e ".[dev]"
source .venv/bin/activate
scout install
```

If you don't have Chrome installed, add `--bundled` to also download
Playwright's bundled Chromium (~190MB).

`scout install` also prefetches the local sentence-transformers model into
OmniScout's app data directory so later commands do not need to fetch it again.
Use `--no-model` to skip model prefetch.

## Quickstart

```bash
# Search the web (DuckDuckGo HTML + local embedding rerank)
scout search "local-first browser agents"
# same command via alias:
scout search "local-first browser agents"

# Extract a URL to clean Markdown
scout extract https://example.com

# Capture a screenshot of a real page using your installed Chrome
scout browser screenshot https://example.com --out page.png

# Run a multi-step research pipeline (search -> crawl -> extract -> rerank -> summarize)
scout research "state of local AI agents in 2026"

# Manage persistent browser profiles (cookies, logins persist across runs)
scout profile create work
scout browser open https://news.ycombinator.com --profile work --headful

# Long-lived browser sessions (other tools can attach via CDP)
scout session start --headful
scout session list
scout session kill --all
```

## JSON output (for agents)

Every command emits structured JSON when invoked with `--json` (or with
`HARNESS_JSON=1` in the environment). Logs always go to stderr; stdout is
reserved for the structured result.

```bash
HARNESS_JSON=1 scout search "robotics simulators" --limit 5
```

## Architecture

```
harness/
  app.py              # Typer root
  commands/           # CLI sub-commands (thin)
  engines/
    browser.py        # Playwright + system Chrome
    extractor.py      # trafilatura + markdownify
    crawler.py        # async httpx + Chrome fallback
    search/
      ddg.py          # DuckDuckGo HTML
      embed.py        # sentence-transformers (all-MiniLM-L6-v2)
      index.py        # embedded Qdrant on-disk
      rerank.py       # cosine rerank
      pipeline.py     # ddg | index | hybrid
    research.py       # full pipeline (search -> crawl -> extract -> rerank -> summarize)
  store/
    cache.py          # SQLite + content-hashed HTML cache
    sessions.py       # SQLite registry of browser sessions
  models.py           # pydantic result types (the JSON contract)
```

On-disk state lives under `~/Library/Application Support/harness/` (macOS) /
`$XDG_DATA_HOME/harness/` (Linux):

| Path             | Purpose                                            |
| ---------------- | -------------------------------------------------- |
| `profiles/`      | Persistent Chrome user-data-dirs                   |
| `qdrant/`        | Embedded vector index                              |
| `sessions.sqlite`| Registry of long-lived browser sessions            |
| `cache/pages/`   | Content-hashed HTML cache used by extract+crawler |

Override via `HARNESS_DATA_DIR`, `HARNESS_CONFIG_DIR`, `HARNESS_CACHE_DIR`,
or settings in `~/Library/Application Support/harness/config.toml`.

## Configuration

`config.toml` example:

```toml
default_source = "ddg"           # search source default
search_limit = 10
research_results = 8
request_throttle_seconds = 1.0   # per-host throttle in the crawler
embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
embedding_local_only = true         # default; never fetch model files at query time
browser_channel = "chrome"       # uses installed Google Chrome
# browser_executable = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
summary_sentences = 6
```

Set `HARNESS_EMBED_LOCAL_ONLY=0` to allow runtime Hugging Face fetches.

## Testing

```bash
.venv/bin/pytest
```

Tests run offline — search, extract, and the research pipeline are all
exercised against saved HTML fixtures and patched network seams.

## Why local Chrome?

Using your system Chrome (channel = "chrome") gives you:

- Real cookies, login state, extensions, and font rendering
- No extra ~190MB Chromium download
- The same user-agent fingerprint as your daily browsing
- Cleaner integration with `omniscout session start` for long-lived sessions
  that other tools can attach to over CDP

If Chrome isn't available, the engine transparently falls back to Playwright's
bundled Chromium.
