Metadata-Version: 2.4
Name: matrx-scraper
Version: 0.1.0
Summary: Web scraping engine, HTML parsing, and search integration for the Matrx ecosystem
Project-URL: Homepage, https://github.com/AI-Matrix-Engine/aidream-current
Project-URL: Repository, https://github.com/AI-Matrix-Engine/aidream-current
Project-URL: Issues, https://github.com/AI-Matrix-Engine/aidream-current/issues
Author-email: Matrx <admin@aimatrx.com>
Maintainer-email: Matrx <admin@aimatrx.com>
License: MIT
Keywords: crawler,html,matrx,parsing,scraping,search,web-scraping
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.12
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: httpx>=0.27
Requires-Dist: markdownify>=0.13
Requires-Dist: matrx-utils>=1.0.20
Requires-Dist: python-dotenv>=1.0
Requires-Dist: selectolax>=0.3.21
Requires-Dist: tabulate>=0.9
Requires-Dist: tldextract>=5.1
Provides-Extra: all
Requires-Dist: asyncpg>=0.31.0; extra == 'all'
Requires-Dist: cachetools>=5.3; extra == 'all'
Requires-Dist: curl-cffi>=0.7; extra == 'all'
Requires-Dist: datasketch>=1.6; extra == 'all'
Requires-Dist: extruct>=0.18; extra == 'all'
Requires-Dist: fastapi>=0.115; extra == 'all'
Requires-Dist: matrx-connect>=0.1.1; extra == 'all'
Requires-Dist: matrx-graph>=0.1.0; extra == 'all'
Requires-Dist: pillow>=11.0; extra == 'all'
Requires-Dist: playwright>=1.45; extra == 'all'
Requires-Dist: pymupdf>=1.24; extra == 'all'
Requires-Dist: pymupdfb>=1.24; extra == 'all'
Requires-Dist: pytesseract>=0.3.13; extra == 'all'
Requires-Dist: python-dotenv>=1.0; extra == 'all'
Requires-Dist: simhash>=2.1; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.30; extra == 'all'
Provides-Extra: browser
Requires-Dist: curl-cffi>=0.7; extra == 'browser'
Requires-Dist: playwright>=1.45; extra == 'browser'
Provides-Extra: connect
Requires-Dist: matrx-connect>=0.1.1; extra == 'connect'
Provides-Extra: dedup
Requires-Dist: cachetools>=5.3; extra == 'dedup'
Requires-Dist: datasketch>=1.6; extra == 'dedup'
Requires-Dist: simhash>=2.1; extra == 'dedup'
Provides-Extra: graph
Requires-Dist: matrx-graph>=0.1.0; extra == 'graph'
Provides-Extra: metadata
Requires-Dist: extruct>=0.18; extra == 'metadata'
Provides-Extra: ocr
Requires-Dist: pillow>=11.0; extra == 'ocr'
Requires-Dist: pytesseract>=0.3.13; extra == 'ocr'
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.24; extra == 'pdf'
Requires-Dist: pymupdfb>=1.24; extra == 'pdf'
Provides-Extra: server
Requires-Dist: asyncpg>=0.31.0; extra == 'server'
Requires-Dist: curl-cffi>=0.7; extra == 'server'
Requires-Dist: fastapi>=0.115; extra == 'server'
Requires-Dist: matrx-connect>=0.1.1; extra == 'server'
Requires-Dist: playwright>=1.45; extra == 'server'
Requires-Dist: python-dotenv>=1.0; extra == 'server'
Requires-Dist: uvicorn[standard]>=0.30; extra == 'server'
Description-Content-Type: text/markdown

# matrx-scraper

Web scraping + HTML parsing + site crawling + search client for Python. An 8-stage parser pipeline turns raw HTML into clean, AI-ready content plus structured extractions (tables, code blocks, links by category, metadata). Designed to work standalone with just `httpx`, with optional extras for headless browsing, PDF extraction, OCR, and a FastAPI server front-end.

## Install

```bash
pip install matrx-scraper                  # core: HTTP fetch + parse + crawl + Brave Search
pip install "matrx-scraper[browser]"       # + Playwright / curl_cffi for JS-rendered pages
pip install "matrx-scraper[pdf]"           # + PyMuPDF for PDF extraction
pip install "matrx-scraper[ocr]"           # + Tesseract OCR
pip install "matrx-scraper[connect]"       # + matrx-connect (stream events to a Matrx app)
pip install "matrx-scraper[server]"        # + FastAPI server + uvicorn + asyncpg
pip install "matrx-scraper[all]"           # everything
```

Python 3.12+ required. Depends on `matrx-utils`; `matrx-connect` is optional.

## What's in the box

- **Scraping** (`matrx_scraper.scraper`, `matrx_scraper.orchestrator`): `scrape(url, **opts)`, `scrape_many(urls)`, `scrape_many_stream(urls)`, `ScrapeResult`, `ScrapeOptions`, `ScrapeService`.
- **Parser pipeline** (`matrx_scraper.parser`): 8-stage HTML pipeline — normalize → `NoiseRemover` → `ScrapeFilter` → `ElementExtractor` → `LinkExtractor` → metadata (extruct) → hashing (MinHash/SimHash) → markdown conversion. Entry points: `parse_html(html, **opts)` and `ParserOrchestrator`.
- **Crawling** (`matrx_scraper.crawler`): `crawl_site(base_url)`, `SiteCrawler` — async BFS site traversal, respects robots.txt.
- **Search** (`matrx_scraper.search`): `BraveSearchClient`.
- **Caching** (`matrx_scraper.cache`): `CacheBackend` with `MemoryCache` and `TwoTierCache` (memory + Postgres, via the optional server extras).
- **Per-URL / per-domain config** (`matrx_scraper.domain_config`): `DomainConfigBackend` — default is static, Postgres-backed variant available via the optional extras.
- **Browser automation** (optional): `PlaywrightBrowserPool`.
- **FastAPI server** (optional): `matrx-scraper` CLI at `server/__main__.py`; routers under `api/`.

## Usage

### One-off scrape

```python
from matrx_scraper import scrape

result = await scrape("https://example.com/article")
print(result.title)
print(result.ai_content)           # clean, AI-ready markdown
print(result.links)                # categorized links
print(result.tables)               # parsed tables
print(result.organized_data)       # structured JSON of the page
```

`ScrapeResult` is a rich dataclass with ~20 fields: `url`, `success`, `content_type`, `title`, `ai_content`, `ai_research_content`, `markdown_renderable`, `organized_data`, `tables`, `code_blocks`, `links`, `hashes`, and more.

### Parse raw HTML (no HTTP)

```python
from matrx_scraper import parse_html

parsed = parse_html(open("page.html").read())
print(parsed.main_content)
```

### Crawl a full site

```python
from matrx_scraper import crawl_site

async for page in crawl_site("https://example.com", max_pages=100):
    print(page.url, page.title)
```

### Brave Search

```python
from matrx_scraper.search import BraveSearchClient

client = BraveSearchClient(api_key=settings.BRAVE_API_KEY)
results = await client.search("matrx-scraper python")
```

## Integration with a Matrx host

When used inside a host that has `matrx-connect` available, you can stream scrape progress as typed events:

```python
import matrx_scraper

matrx_scraper.configure_ext(
    info_payload_cls=InfoPayload,
    warning_payload_cls=WarningPayload,
    # … other Matrx event types
)
```

After this, `scrape_many_stream` and `ScrapeService` will emit `matrx-connect` event payloads. If `configure_ext` is not called, the package still works — it just doesn't emit stream events.

## Dependency posture

Core dependencies are a small set of well-known libraries (`httpx`, `beautifulsoup4`, `selectolax`, `markdownify`, `tldextract`, `tabulate`, `python-dotenv`) plus `matrx-utils`. All heavier dependencies (Playwright, PyMuPDF, Tesseract, FastAPI) live behind optional extras so lean installs stay lean.

## Migration notes

This package replaces the legacy root-level `scraper/` folder in the aidream monorepo and parts of `research/`. Internal docs ([`MIGRATION_STATUS.md`](MIGRATION_STATUS.md), [`GAPS_TO_FIX.md`](GAPS_TO_FIX.md), [`LEGACY_AUDIT.md`](LEGACY_AUDIT.md), [`MIGRATION_GUIDE.md`](MIGRATION_GUIDE.md)) track what has been ported and what hasn't.

## Contributing

See [CLAUDE.md](CLAUDE.md) for package-specific rules. This package lives in the aidream monorepo at [github.com/AI-Matrix-Engine/aidream-current](https://github.com/AI-Matrix-Engine/aidream-current/tree/main/packages/matrx-scraper).

## License

MIT.
