Metadata-Version: 2.4
Name: edgeless-smart-scraper
Version: 0.1.0
Summary: Intelligent web scraping with automatic backend selection.
License: MIT License
        
        Copyright (c) 2024 Edgeless Labs
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/edgeless-ai/smart-scraper
Project-URL: Repository, https://github.com/edgeless-ai/smart-scraper
Project-URL: Issues, https://github.com/edgeless-ai/smart-scraper/issues
Keywords: scraping,web scraper,firecrawl,tavily,beautifulsoup
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28
Requires-Dist: beautifulsoup4>=4.11
Provides-Extra: firecrawl
Requires-Dist: firecrawl-py>=1.0; extra == "firecrawl"
Provides-Extra: tavily
Requires-Dist: tavily-python>=0.3; extra == "tavily"
Provides-Extra: all
Requires-Dist: firecrawl-py>=1.0; extra == "all"
Requires-Dist: tavily-python>=0.3; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: mypy>=1.5; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: types-requests>=2.28; extra == "dev"
Requires-Dist: types-beautifulsoup4>=4.11; extra == "dev"
Dynamic: license-file

[![CI](https://github.com/edgeless-ai/smart-scraper/actions/workflows/ci.yml/badge.svg)](https://github.com/edgeless-ai/smart-scraper/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://python.org)

# smart-scraper

Intelligent web scraping for Python with automatic backend selection.

Chooses the right scraping strategy for every URL — lightweight static fetch,
cloud-based JavaScript rendering, or search-powered extraction — so you do not
have to think about it.

## Features

- **Zero-config core** — works out of the box with `requests` + `beautifulsoup4`
- **Automatic backend selection** — static HTML, JS-heavy SPAs, and paywalled content each get the best tool
- **Optional cloud backends** — Firecrawl and Tavily activate when their SDK and API key are present
- **Graceful degradation** — always returns a `ScrapeResult`, never raises on network errors
- **Clean Markdown output** — navigation, footers, scripts, and styles are stripped automatically
- **Typed API** — full type annotations, compatible with mypy strict mode

## Install

```bash
# Core (no API keys required)
pip install smart-scraper

# With Firecrawl (JS-heavy sites)
pip install smart-scraper[firecrawl]

# With Tavily (search + paywalled content)
pip install smart-scraper[tavily]

# Everything
pip install smart-scraper[all]
```

## Quick start

```python
from smart_scraper import scrape_url

# Automatic backend selection — just pass a URL
result = scrape_url("https://docs.python.org/3/library/json.html")

if result.success:
    print(result.title)    # "json — JSON encoder and decoder"
    print(result.content)  # clean Markdown text
else:
    print(result.error)
```

## Backend comparison

| Backend | Best for | Dependencies | Free tier |
|---------|----------|--------------|-----------|
| **basic** | Static HTML, docs, blogs, GitHub raw files | `requests`, `beautifulsoup4` | Unlimited |
| **firecrawl** | SPAs, React/Next.js, social platforms, anti-bot sites | `firecrawl-py` + API key | 500 scrapes |
| **tavily** | Research queries, paywalled/login-walled content | `tavily-python` + API key | 1,000 credits/month |

## Automatic selection logic

`scrape_url()` runs the following decision tree before making any network call:

1. **URL pattern match** — raw file hosts (`raw.githubusercontent.com`, `pastebin.com/raw/`, `arxiv.org/abs/`) always use the basic backend regardless of what is installed.
2. **Paywall domains** — `wsj.com`, `ft.com`, `nytimes.com`, etc. prefer Tavily (its cached access often bypasses paywalls). Falls back to basic when Tavily is not configured.
3. **JS-heavy domains** — `medium.com`, `substack.com`, `notion.so`, `twitter.com`, etc. prefer Firecrawl. Falls back to basic when Firecrawl is not configured.
4. **Unknown domains** — prefer Firecrawl (highest quality), then Tavily, then basic.

Override automatic selection at any time:

```python
# Force a specific backend
result = scrape_url("https://example.com", backend="firecrawl")
result = scrape_url("https://example.com", backend="tavily")
result = scrape_url("https://example.com", backend="basic")

# Shorthand flags
result = scrape_url("https://notion.so/page", force_firecrawl=True)
result = scrape_url("https://wsj.com/article", force_tavily=True)
```

## API reference

### `scrape_url(url, *, backend=None, force_firecrawl=False, force_tavily=False, only_main_content=True, timeout=30, firecrawl_api_key=None, tavily_api_key=None)`

Scrape a URL and return a `ScrapeResult`. Always returns — never raises.

**Parameters**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | `str` | — | Fully-qualified URL |
| `backend` | `str \| Backend \| None` | `None` | Force a backend: `"basic"`, `"firecrawl"`, `"tavily"`, `"auto"` |
| `force_firecrawl` | `bool` | `False` | Shorthand for `backend="firecrawl"` |
| `force_tavily` | `bool` | `False` | Shorthand for `backend="tavily"` |
| `only_main_content` | `bool` | `True` | Strip nav/footer/sidebar (Firecrawl only) |
| `timeout` | `int` | `30` | Request timeout in seconds |
| `firecrawl_api_key` | `str \| None` | `None` | Per-call key override (reads env var otherwise) |
| `tavily_api_key` | `str \| None` | `None` | Per-call key override (reads env var otherwise) |

### `ScrapeResult`

```python
@dataclass
class ScrapeResult:
    url: str
    content: str           # Markdown text
    title: Optional[str]
    metadata: Optional[dict]
    source: str            # "basic" | "firecrawl" | "tavily"
    success: bool
    error: Optional[str]
```

`bool(result)` returns `result.success` for convenient conditional checks.

## Using backends directly

### Basic backend

```python
from smart_scraper.backends.basic import BasicBackend

backend = BasicBackend(timeout=15)
result = backend.fetch("https://example.com")
print(result.content)
```

### Firecrawl backend

```python
from smart_scraper.backends.firecrawl import FirecrawlBackend

backend = FirecrawlBackend()  # reads FIRECRAWL_API_KEY

# Single page
result = backend.fetch("https://medium.com/@author/article", only_main_content=True)

# Crawl a site
pages = backend.crawl("https://docs.example.com", max_pages=20, max_depth=2)

# Structured extraction
schema = {
    "type": "object",
    "properties": {
        "title": {"type": "string"},
        "topics": {"type": "array", "items": {"type": "string"}},
    },
}
data = backend.extract_structured("https://example.com", schema=schema)
```

### Tavily backend

```python
from smart_scraper.backends.tavily import TavilyBackend

backend = TavilyBackend()  # reads TAVILY_API_KEY

# Extract a specific URL
result = backend.fetch("https://wsj.com/articles/some-story")

# Search the web
response = backend.search("Python web scraping 2024", max_results=5, include_answer=True)
for item in response.results:
    print(item.title, item.url)

# Get an AI-generated answer
answer = backend.qna_search("What is the capital of France?")
```

### Inspecting backend selection

```python
from smart_scraper.selector import BackendSelector, Backend

selector = BackendSelector(firecrawl_available=True, tavily_available=False)
chosen = selector.select("https://medium.com/@author/post")
print(chosen)  # Backend.FIRECRAWL

reason = selector.explain("https://medium.com/@author/post")
print(reason)  # "Backend 'firecrawl' selected: domain requires JavaScript..."
```

## Environment variables

Copy `.env.example` to `.env` and fill in your keys:

```
FIRECRAWL_API_KEY=your_firecrawl_key_here
TAVILY_API_KEY=your_tavily_key_here
```

Keys are read at runtime via `os.environ`. Use `python-dotenv` or your preferred
env-management tool to load them:

```python
from dotenv import load_dotenv
load_dotenv()

from smart_scraper import scrape_url
result = scrape_url("https://notion.so/page")
```

## Running tests

```bash
# Install dev dependencies
pip install smart-scraper[dev]

# Run offline tests (no network, no API keys)
pytest

# Include live network tests (requires internet)
pytest --live

# With coverage
pytest --cov=src --cov-report=term-missing
```

## Contributing

1. Fork the repo and create a feature branch.
2. Install dev dependencies: `pip install -e ".[dev]"`
3. Run linting: `ruff check . && mypy src/`
4. Run tests: `pytest`
5. Open a pull request — all CI checks must pass.

## License

MIT — see [LICENSE](LICENSE).
