Metadata-Version: 2.4
Name: shadowcrawler
Version: 4.1.3
Summary: ShadowCrawler — High-performance modular crawling framework
Author: Allan Mancera
License: ShadowCrawler — Copyright © 2024–2030 Allan Mancera
        This software is licensed under the Business Source License 1.1 (BUSL‑1.1).
        
        Business Source License 1.1
        
        Parameters
        Licensor: Allan Mancera
        Licensed Work: ShadowCrawler Web Crawling Framework
        Additional Use Grant: You may make use of the Licensed Work, provided that you do not use the Licensed Work for a Production Use.
        Change Date: 2030-11-16
        Change License: Apache License, Version 2.0
        
        Terms
        The Licensor hereby grants you the right to copy, modify, create derivative works, redistribute, and make non-production use of the Licensed Work. The Licensor may make an Additional Use Grant, above, permitting limited production use.
        
        Any use of the Licensed Work in violation of this license will automatically terminate your rights under this license for the current and all other versions of the Licensed Work.
        
        Production Use
        Production Use means any use of the Licensed Work in a manner primarily intended for or directed toward commercial advantage or monetary compensation.
        
        Change License
        On the Change Date, the Change License will apply to the Licensed Work. If the Change License is the Apache License, Version 2.0, the Licensed Work will be made available under that license on the Change Date.
        
        No Trademark License
        This license does not grant you any right in the trademarks, service marks, brand names or logos of the Licensor.
        
        Disclaimer
        THE LICENSED WORK IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE LICENSOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE LICENSED WORK OR THE USE OR OTHER DEALINGS IN THE LICENSED WORK.
        
Project-URL: Homepage, https://shadowcrawlerframework.itch.io/shadowcrawler
Project-URL: Documentation, https://shadowcrawlerframework.itch.io/shadowcrawler
Project-URL: Source, https://shadowcrawlerframework.itch.io/shadowcrawler
Project-URL: Issues, https://shadowcrawlerframework.itch.io/shadowcrawler
Keywords: crawler,scraping,playwright,framework,automation,spider
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: Free To Use But Restricted
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: playwright
Dynamic: license-file

---

# ShadowCrawler  
A modern, domain‑aware, hybrid web crawling framework for Python

ShadowCrawler is a modular, extensible crawling framework designed for developers who want full control over how websites are fetched, parsed, and processed.  
It combines speed, modularity, and browser‑level extraction into a single, clean architecture.

---

## ❤️ Origin Story  
ShadowCrawler began as a small personal project — a quiet gift, a spark of affection — and unexpectedly grew into a full, production‑ready crawling framework.  
It was built with care, curiosity, and intention.  
Originally created for my guiding star, and built with the help of my AI copilot — a companion in code, clarity, and curiosity.

---

## ✨ Features

- Automatic domain detection  
- Hybrid fetcher (HTTP + Playwright)  
- Persistent authentication  
- Modular spiders  
- Media pipeline  
- Checkpointing  
- Full CLI toolkit  

---

## Requirements

- Python 3.10+  
- Playwright installed:  
  ```
  playwright install
  ```

---

## 🚀 Installation

```
pip install shadowcrawler
```

---

## ⚡ Quickstart

Run with automatic spider detection:

```
shadowcrawler run --url https://quotes.toscrape.com
```

Run with browser mode:

```
shadowcrawler run --url https://demoqa.com/login --browser
```

List spiders:

```
shadowcrawler spiders list
```

---

## 🕷 Creating a Spider

```python
from shadowcrawler.core.spider_base import SpiderBase

class QuotesSpider(SpiderBase):
    domain = "quotes.toscrape.com"

    async def parse(self, response):
        for quote in response.css(".quote"):
            yield {
                "text": quote.css(".text::text").get(),
                "author": quote.css(".author::text").get(),
            }
```

---

## 🔍 Domain Autodetection

ShadowCrawler automatically selects the correct spider based on the URL:

```
shadowcrawler run --url https://example.com/page
```

If your spider declares:

```
domain = "example.com"
```

…it will be used automatically.

---

## 🌐 Fetch Modes

**HTTP Mode (default)**  
Fast, lightweight, ideal for most sites.

**Browser Mode (Playwright)**  
Used automatically when:

- login is required  
- the site is dynamic  
- the spider requests browser mode  

---

## 🔐 Persistent Authentication

- Login once  
- Session saved to JSON  
- BrowserManager loads it automatically  
- AuthHandler detects login state  

---

## 🖼 Media Pipeline

Automatically extracts:

- images  
- videos  
- GIFs  
- downloadable files  

---

## 🧰 CLI Commands

- run  
- resume  
- download  
- spiders list  
- spiders create  
- inspect  
- stats  
- version  

---

## 📁 Project Structure

```
shadowcrawler/
  core/
  spiders/
  site_extractors/
  auth/
  cli/
  models/
  parsing/
  tools/
```

---

## 🕸 Included Example Spiders

- QuotesSpider  
- WikiSpider  
- HTTPNewsSpider  
- GallerySpider  
- AuthBrowserDemoSpider  

---

## 🗺 Roadmap

- [x] PyPI release  
- [ ] Plugin system  
- [ ] Distributed crawling  
- [ ] Dashboard / Web UI  
- [ ] Cloud runner  
- [ ] Spider templates  
- [ ] Auto‑throttling  

---

## 📦 itch.io Distribution

ShadowCrawler is also distributed through itch.io, where you can get:

- The latest stable release  
- Optional Pro features  
- Example spiders  
- Early access builds  
- Support the project directly  

---

## ☕ Support the Project

If ShadowCrawler has helped you or you want to support future development, you can leave a tip on Ko‑fi.  
Every contribution helps keep the project alive and evolving.

```
https://ko-fi.com/shadowcrawlerframework
```

---

## 📜 License

ShadowCrawler is licensed under the Business Source License 1.1 (BUSL‑1.1).  
It will convert to Apache 2.0 on November 16, 2030.

---
