Metadata-Version: 2.4
Name: ssscraper
Version: 0.1.0
Summary: Async web scraping library for SiteSudharo — downloads resources to local/S3/R2 with full metadata
Project-URL: Homepage, https://github.com/sitesudharo/ssscraper
Project-URL: Documentation, https://sitesudharo.github.io/ssscraper
Project-URL: Repository, https://github.com/sitesudharo/ssscraper
Project-URL: Bug Tracker, https://github.com/sitesudharo/ssscraper/issues
Project-URL: Changelog, https://github.com/sitesudharo/ssscraper/blob/main/CHANGELOG.md
Author-email: SiteSudharo <gagandeep.2020@gmail.com>
License: MIT
License-File: LICENSE
Keywords: async,crawl4ai,r2,s3,scraping,sitesudharo,web
Classifier: Development Status :: 3 - Alpha
Classifier: Framework :: AsyncIO
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: aioboto3>=12.0.0
Requires-Dist: aiofiles>=23.0.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: lxml>=5.0.0
Requires-Dist: markdownify>=0.13.0
Requires-Dist: patchright>=0.4.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-httpx>=0.30.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-gen-files>=0.5.0; extra == 'docs'
Requires-Dist: mkdocs-literate-nav>=0.6.0; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5.0; extra == 'docs'
Requires-Dist: mkdocs>=1.6.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25.0; extra == 'docs'
Description-Content-Type: text/markdown

# SSscraper

Async web scraping library for [SiteSudharo](https://sitesudharo.com).  
Built on [Crawl4ai](https://github.com/unclecode/crawl4ai) — downloads page resources to **local disk**, **AWS S3**, or **Cloudflare R2** and returns structured metadata for every file.

## Features

- Full-page scraping via Crawl4ai (headless browser)
- Downloads: images, CSS, fonts, JS, documents, video, audio
- Storage backends: Local · S3 · R2 (drop-in swappable)
- Per-resource metadata: URL, stored path, content-type, size, MD5 + SHA-256
- Async & concurrent — configurable parallelism
- Retry with backoff, per-file size cap

## Quick start

```python
import asyncio
from ssscraper import SScraper, LocalStorage, ScraperConfig

async def main():
    scraper = SScraper(
        storage=LocalStorage("./downloads"),
        config=ScraperConfig(download_images=True, download_css=True, download_fonts=True),
    )
    result = await scraper.scrape("https://example.com")
    print(f"Downloaded {len(result.succeeded)} resources")
    for r in result.resources:
        print(r.model_dump_json())

asyncio.run(main())
```

## Storage backends

### Local
```python
from ssscraper import LocalStorage
storage = LocalStorage(base_dir="./downloads")
```

### AWS S3
```python
from ssscraper import S3Storage
storage = S3Storage(
    bucket="my-bucket",
    prefix="sitesudharo",
    region="us-east-1",
    aws_access_key_id="...",
    aws_secret_access_key="...",
)
```

### Cloudflare R2
```python
from ssscraper import R2Storage
storage = R2Storage(
    bucket="my-bucket",
    account_id="<cf-account-id>",
    access_key_id="...",
    secret_access_key="...",
    prefix="sitesudharo",
    public_domain="assets.yourdomain.com",  # optional
)
```

## ScrapeResult shape

```python
result.page            # PageMetadata — url, title, description, scraped_at
result.html            # raw HTML
result.markdown        # Crawl4ai markdown
result.resources       # List[ResourceMetadata]
result.succeeded       # filter: status == SUCCESS
result.failed          # filter: status == FAILED
result.images          # filter: type == image
result.stylesheets     # filter: type == css
result.fonts           # filter: type == font
```

### ResourceMetadata fields
| Field | Type | Description |
|---|---|---|
| `original_url` | str | Source URL |
| `storage_key` | str | Relative key inside the storage backend |
| `stored_path` | str | Absolute local path or full cloud URL |
| `resource_type` | ResourceType | image / css / javascript / font / … |
| `content_type` | str | HTTP Content-Type |
| `size_bytes` | int | File size |
| `checksum_md5` | str | MD5 hex digest |
| `checksum_sha256` | str | SHA-256 hex digest |
| `status` | ResourceStatus | success / failed / skipped |
| `error` | str | Error message if failed |
| `downloaded_at` | datetime | UTC timestamp |

## Install

```bash
pip install -e .          # from source
pip install ssscraper     # once published
```

Python ≥ 3.11 required.

## Config options

See [`ssscraper/config.py`](ssscraper/config.py) — all fields are optional with sensible defaults.
