Metadata-Version: 2.4
Name: shadowcrawler
Version: 4.1.1
Summary: ShadowCrawler — High-performance modular crawling framework
Author: Allan Mancera
License: ---
        
        ShadowCrawler — Copyright © 2024–2030 Allan Mancera  
        This software is licensed under the Business Source License 1.1 (BUSL‑1.1).
        
        Business Source License 1.1
        
        Parameters  
        Licensor: Allan Mancera  
        Licensed Work: ShadowCrawler Web Crawling Framework  
        Additional Use Grant: You may make use of the Licensed Work, provided that you do not use the Licensed Work for a Production Use.  
        Change Date: 2030-11-16  
        Change License: Apache License, Version 2.0  
        
        Terms
        
        The Licensor hereby grants you the right to copy, modify, create derivative works, redistribute, and make non-production use of the Licensed Work. The Licensor may make an Additional Use Grant, above, permitting limited production use.
        
        Any use of the Licensed Work in violation of this license will automatically terminate your rights under this license for the current and all other versions of the Licensed Work.
        
        Production Use  
        Production Use means any use of the Licensed Work in a manner primarily intended for or directed toward commercial advantage or monetary compensation.
        
        Change License  
        On the Change Date, the Change License will apply to the Licensed Work. If the Change License is the Apache License, Version 2.0, the Licensed Work will be made available under that license on the Change Date.
        
        No Trademark License  
        This license does not grant you any right in the trademarks, service marks, brand names or logos of the Licensor.
        
        Disclaimer  
        THE LICENSED WORK IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE LICENSOR BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE LICENSED WORK OR THE USE OR OTHER DEALINGS IN THE LICENSED WORK.
        
        ---
Project-URL: Homepage, https://shadowcrawlerframework.itch.io/shadowcrawler
Project-URL: Documentation, https://shadowcrawlerframework.itch.io/shadowcrawler
Project-URL: Source, https://shadowcrawlerframework.itch.io/shadowcrawler
Project-URL: Issues, https://shadowcrawlerframework.itch.io/shadowcrawler
Keywords: crawler,scraping,playwright,framework,automation,spider
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.14
Classifier: License :: Free To Use But Restricted
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests
Requires-Dist: beautifulsoup4
Requires-Dist: playwright
Dynamic: license-file

---

ShadowCrawler  
A modern, domain‑aware, hybrid web crawling framework for Python

ShadowCrawler is a modular, extensible crawling framework designed for developers who want full control over how websites are fetched, parsed, and processed.  
It combines speed, modularity, and browser‑level extraction into a single, clean architecture.

ShadowCrawler began as a small personal project — a quiet gift, a spark of affection — and unexpectedly grew into a full, production‑ready crawling framework.  
It was built with care, curiosity, and intention.  
Originally created for my guiding star, and built with the help of my AI copilot — a companion in code, clarity, and curiosity.

------------------------------------------------------------
✨ Features
------------------------------------------------------------

- Automatic domain detection — run spiders without specifying them manually  
- Hybrid fetcher (HTTP + Playwright) — fast when possible, browser when needed  
- Persistent authentication — login once, session saved automatically  
- Modular spiders — clean per‑domain architecture  
- Media pipeline — automatic image/video/file extraction  
- Checkpointing — resume crawls safely  
- Full CLI toolkit — run, resume, inspect, list, stats, version  

------------------------------------------------------------
🚀 Installation
------------------------------------------------------------

pip install shadowcrawler

------------------------------------------------------------
⚡ Quickstart
------------------------------------------------------------

Run with automatic spider detection:

shadowcrawler run --url [https://quotes.toscrape.com](https://quotes.toscrape.com)

Run with browser mode:

shadowcrawler run --url [https://demoqa.com/login](https://demoqa.com/login) --browser

List spiders:

shadowcrawler spiders list

------------------------------------------------------------
🕷 Creating a Spider
------------------------------------------------------------

from shadowcrawler.core.spider_base import SpiderBase

class QuotesSpider(SpiderBase):
    domain = "quotes.toscrape.com"

    async def parse(self, response):
        for quote in response.css(".quote"):
            yield {
                "text": quote.css(".text::text").get(),
                "author": quote.css(".author::text").get(),
            }

------------------------------------------------------------
🔍 Domain Autodetection
------------------------------------------------------------

shadowcrawler automatically selects the correct spider based on the URL:

shadowcrawler run --url [https://example.com/page](https://example.com/page)

If your spider declares:

domain = "example.com"

…it will be used automatically.

------------------------------------------------------------
🌐 Fetch Modes
------------------------------------------------------------

HTTP Mode (default)  
Fast, lightweight, ideal for most sites.

Browser Mode (Playwright)  
Used automatically when:

- login is required  
- the site is dynamic  
- the spider requests browser mode  

------------------------------------------------------------
🔐 Persistent Authentication
------------------------------------------------------------

- Login once  
- Session saved to JSON  
- BrowserManager loads it automatically  
- AuthHandler detects login state  

------------------------------------------------------------
🖼 Media Pipeline
------------------------------------------------------------

Automatically extracts:

- images  
- videos  
- GIFs  
- downloadable files  

------------------------------------------------------------
🧰 CLI Commands
------------------------------------------------------------

shadowcrawler run  
shadowcrawler resume  
shadowcrawler download  
shadowcrawler spiders list  
shadowcrawler spiders create  
shadowcrawler inspect  
shadowcrawler stats  
shadowcrawler version  

------------------------------------------------------------
📁 Project Structure
------------------------------------------------------------

shadowcrawler/  
  core/  
  spiders/  
  site_extractors/  
  auth/  
  cli/  
  models/  
  parsing/  
  tools/  

------------------------------------------------------------
🕸 Included Example Spiders
------------------------------------------------------------

- QuotesSpider  
- WikiSpider  
- HTTPNewsSpider  
- GallerySpider  
- AuthBrowserDemoSpider  

------------------------------------------------------------
🗺 Roadmap
------------------------------------------------------------

- [ ] PyPI release  
- [ ] Plugin system  
- [ ] Distributed crawling  
- [ ] Dashboard / Web UI  
- [ ] Cloud runner  
- [ ] Spider templates  
- [ ] Auto‑throttling  

------------------------------------------------------------
📦 itch.io Distribution
------------------------------------------------------------

ShadowCrawler is also distributed through itch.io, where you can get:

- The latest stable release  
- Optional Pro features  
- Example spiders  
- Early access builds  
- Support the project directly  

------------------------------------------------------------
☕ Support the Project
------------------------------------------------------------

If ShadowCrawler has helped you or you want to support future development, you can leave a tip on Ko‑fi.  
Every contribution helps keep the project alive and evolving.

Support on Ko‑fi:  
`https://ko-fi.com/shadowcrawlerframework`

------------------------------------------------------------
📜 License
------------------------------------------------------------

ShadowCrawler is licensed under the Business Source License 1.1 (BUSL‑1.1).  
It will convert to Apache 2.0 on:

November 16, 2030

---
