Metadata-Version: 2.4
Name: scrawlee
Version: 0.1.0
Summary: An ultimate stealth scraping library with advanced proxy rotation and auto-parsing.
Project-URL: Homepage, https://github.com/saimsajidirl/scrawlee
Project-URL: Issues, https://github.com/saimsajidirl/scrawlee/issues
Author-email: Muhammad Saim <saimsajidirl@gmail.com>
License: MIT License
        
        Copyright (c) 2026 Saim Sajid
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
        DISCLAIMER: This software is intended for educational and ethical web scraping 
        purposes only. The author is not responsible for any illegal or unethical 
        activity performed using this tool. Users are solely responsible for complying 
        with the terms of service of any website they target and all applicable laws.
        By using this software, you agree that any misuse or illicit activity is not 
        associated with or the responsibility of the author.
License-File: LICENSE
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Requires-Dist: curl-cffi>=0.7.1
Requires-Dist: lxml>=5.1.0
Requires-Dist: selectolax>=0.3.17
Provides-Extra: dev
Requires-Dist: build; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: twine; extra == 'dev'
Description-Content-Type: text/markdown

# Scrawlee

An ultimate stealth scraping library built on top of `curl_cffi` with advanced proxy rotation, auto-parsing for JSON/HTML, and built-in rate-limiting retries. Fully supports both synchronous AND highly-concurrent asynchronous scraping.

## Key Features

- **Ultimate Stealth**: Rotates through real-world TLS/JA3 fingerprints (Chrome, Edge, Safari).
- **Asynchronous Engine**: Comes with `AsyncScrawleeClient` for blazing fast, highly-concurrent scraping using `asyncio`.
- **Auto-Parsing Response**: The `.auto` property automatically returns a parsed Python dictionary or a high-speed `selectolax` object depending on whether the response is JSON or HTML.
- **Dual-Parser Support**: Fetch lightning-fast CSS queries via `.html` (Selectolax) or utilize robust XPath querying via `.lxml` (lxml).
- **Cookie Persistence**: Instantly save and load authenticated sessions to disk so you never have to log in or solve Cloudflare challenges twice.
- **Smart Retries**: Built-in exponential backoff for common HTTP error codes (429, 50x).
- **Advanced Proxy Management**: Supports Random, Round-Robin, and Sticky session rotation with off-band automated health checks.

## Installation

```bash
pip install scrawlee
```
*(Requires Python 3.8+)*

## Usage Guide

### 1. Basic Synchronous Scraping
```python
from scrawlee import ScrawleeClient

with ScrawleeClient(impersonate="chrome120") as client:
    res = client.get("https://httpbin.org/get")
    
    # .auto magically returns a Dictionary for JSON API responses!
    print(res.auto['headers']['User-Agent'])
    
    res_html = client.get("https://httpbin.org/html")
    
    # Lightning fast CSS queries via selectolax
    print(res_html.html.css_first("h1").text(strip=True))
    
    # Powerful XPath queries via lxml
    print(res_html.lxml.xpath("//h1/text()")[0])
```

### 2. Deep Dive: Extracting Data from HTML
Scrawlee eliminates the need for external parsing libraries like BeautifulSoup. It comes natively packed with two blazing-fast, C-based parsing engines:

#### Extracting with CSS Selectors (via `.html`)
The `.html` property exposes the `selectolax` engine. It is the fastest way to parse data using standard CSS selectors.
```python
with ScrawleeClient() as client:
    res = client.get("https://example-store.com/products")
    
    # 1. Extract text from a single element
    title = res.html.css_first("h1.product-title").text(strip=True)
    
    # 2. Extract HTML attributes (e.g. data-id, href, src)
    product_id = res.html.css_first("div.product").attributes.get("data-product-id")
    
    # 3. Loop through lists of elements
    for feature_li in res.html.css("ul.features li"):
        print("Feature:", feature_li.text(strip=True))
```

#### Extracting with XPath Queries (via `.lxml`)
If you need complex DOM traversal (e.g., finding a parent element based on its child's value), CSS selectors fall short. The `.lxml` property provides industry-standard XPath extraction.
```python
with ScrawleeClient() as client:
    res = client.get("https://example-store.com/products")
    
    # Fetch an element exactly using an XPath query
    price = res.lxml.xpath('//div[@class="product-card" and @data-status="in-stock"]//span[@class="price"]/text()')[0]
    print(f"Price is: {price}")
```

### 3. High-Speed Asynchronous Scraping
If you need to scrape 1,000 pages concurrently, use `AsyncScrawleeClient`.
```python
import asyncio
from scrawlee import AsyncScrawleeClient

async def run():
    async with AsyncScrawleeClient() as client:
        # Fire concurrent requests
        res1, res2 = await asyncio.gather(
            client.get("https://httpbin.org/get"),
            client.get("https://httpbin.org/html")
        )
        print("Async HTTPBin Status:", res1.status_code)

asyncio.run(run())
```

### 4. Persistent Sessions (Save/Load Cookies)
If you bypass a Datadome/Cloudflare wall or log into a website, save your cookies to disk so you can instantly resume the session tomorrow!
```python
from scrawlee import ScrawleeClient

# Script 1: Save the session
with ScrawleeClient() as client:
    # ... Login logic or bypass challenge ...
    client.save_cookies("twitter_session.json")

# Script 2: Load the session instantly
with ScrawleeClient() as client:
    client.load_cookies("twitter_session.json")
    res = client.get("https://api.twitter.com/protected_route")
```

### 5. Advanced Proxy Management
Automatically rotates Proxies and quarantines failing ones.
```python
from scrawlee import ScrawleeClient, ProxyManager

pm = ProxyManager(rotation_strategy="round_robin")
# Accepts raw proxy data
pm.add_proxy(ip="12.34.56.78", port="8080", username="user", password="pwd")

with ScrawleeClient(proxy_manager=pm) as client:
    res = client.get("https://api.myip.com")
    print("Masked IP:", res.auto['ip'])
```
