Metadata-Version: 2.4
Name: scrapy-common-downloadhandler
Version: 0.1.0
Summary: A composite Scrapy download handler that integrates cloudscraper, curl_cffi (via scrapy-impersonate), and Twisted HTTP/1.1 into one handler with per-request routing via request.meta.
Project-URL: Homepage, https://github.com/yunjiagao/scrapy-common-downloadhandler
Project-URL: Repository, https://github.com/yunjiagao/scrapy-common-downloadhandler
Project-URL: Issues, https://github.com/yunjiagao/scrapy-common-downloadhandler/issues
Project-URL: Changelog, https://github.com/yunjiagao/scrapy-common-downloadhandler/blob/main/CHANGELOG.md
Author: yunjiagao
License-Expression: MIT
License-File: LICENSE
Keywords: anti-bot,cloudscraper,curl-cffi,download-handler,impersonate,scrapy
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: Scrapy
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Requires-Dist: cloudscraper>=1.2.58
Requires-Dist: pip>=25.0.1
Requires-Dist: scrapy-impersonate>=1.0
Requires-Dist: scrapy>=2.7
Requires-Dist: twisted>=21.7
Description-Content-Type: text/markdown

# scrapy-common-downloadhandler

A composite Scrapy download handler that integrates **cloudscraper**, **curl_cffi** (via [scrapy-impersonate](https://github.com/jxlil/scrapy-impersonate)), and **Twisted HTTP/1.1** into a single handler with per-request routing via `request.meta`.

## Inheritance Chain

```
HTTP11DownloadHandler             <- Twisted HTTP/1.1 (fallback)
  └── ImpersonateDownloadHandler  <- curl_cffi (when meta["impersonate"] is set)
        └── CommonDownloadHandler <- cloudscraper (when meta["use_cloudscraper"] is True)
```

## Installation

```bash
pip install scrapy-common-downloadhandler
```

## Quick Start

### 1. Configure the download handler

In your project's `settings.py` or spider's `custom_settings`:

```python
DOWNLOAD_HANDLERS = {
    "http": "scrapy_common_downloadhandler.CommonDownloadHandler",
    "https": "scrapy_common_downloadhandler.CommonDownloadHandler",
}
USER_AGENT = ""
```

`USER_AGENT` must be set to an empty string. This prevents Scrapy's `UserAgentMiddleware` from injecting a default User-Agent header (e.g. `Scrapy/x.x.x`), which would conflict with the browser User-Agent that curl_cffi automatically provides during impersonation — resulting in a TLS fingerprint / User-Agent mismatch detectable by anti-bot systems.

No other additional settings or flags are needed. All three download modes are available once the handler is configured.

### 2. Use in your spider

```python
import scrapy

class MySpider(scrapy.Spider):
    name = "example"

    def start_requests(self):
        # cloudscraper
        yield scrapy.Request(url, meta={"use_cloudscraper": True}, callback=self.parse)

        # curl_cffi impersonate
        yield scrapy.Request(url, meta={"impersonate": "chrome"}, callback=self.parse)

        # default Twisted HTTP/1.1
        yield scrapy.Request(url, callback=self.parse)
```

## Usage

### cloudscraper Requests

```python
# Basic
yield scrapy.Request(url, meta={"use_cloudscraper": True}, callback=self.parse)

# With create_scraper() parameter passthrough
yield scrapy.Request(url, meta={
    "use_cloudscraper": True,
    "cloudscraper_args": {
        "browser": {"browser": "chrome", "mobile": False, "platform": "windows"},
        "delay": 10,
        "interpreter": "nodejs",
    },
}, callback=self.parse)
```

All keys in `cloudscraper_args` are passed directly to `cloudscraper.create_scraper(**args)`.

### curl_cffi impersonate Requests

```python
# Basic
yield scrapy.Request(url, meta={"impersonate": "chrome"}, callback=self.parse)

# With parameter passthrough
yield scrapy.Request(url, meta={
    "impersonate": "chrome",
    "impersonate_args": {"timeout": 30},
}, callback=self.parse)
```

See [scrapy-impersonate](https://github.com/jxlil/scrapy-impersonate) for full details on `impersonate_args`.

### Default Twisted HTTP/1.1 Requests

```python
# No special meta needed
yield scrapy.Request(url, callback=self.parse)
```

## Parameter Passthrough Reference

| Mode | meta flag | passthrough key | passthrough target |
|---|---|---|---|
| cloudscraper | `use_cloudscraper: True` | `cloudscraper_args: {}` | `cloudscraper.create_scraper(**args)` |
| curl_cffi | `impersonate: "chrome"` | `impersonate_args: {}` | curl_cffi request method |
| Twisted | (none) | (none) | Scrapy default settings |

## Proxy Support

Proxy middlewares that set `request.meta["proxy"]` work seamlessly:

- **cloudscraper**: converts to `proxies={"http": proxy, "https": proxy}`
- **curl_cffi**: read by `ImpersonateDownloadHandler`'s `RequestParser`
- **Twisted**: handled by Scrapy's built-in `HttpProxyMiddleware`

## scrapy-redis Compatibility

Fully compatible. scrapy-redis only handles scheduling and deduplication, which is independent of the download handler layer.

## Response Flags

Responses carry a flag indicating which download mode was used:

- `"cloudscraper"` in `response.flags` — downloaded via cloudscraper
- `"impersonate"` in `response.flags` — downloaded via curl_cffi
- Neither — downloaded via Twisted HTTP/1.1

## Notes

- `USER_AGENT = ""` is **required**. Without it, Scrapy's `UserAgentMiddleware` will set the User-Agent header before the request reaches the download handler, overriding the browser-matched User-Agent that curl_cffi provides during impersonation.
- cloudscraper is a synchronous library (based on requests). The handler uses `deferToThread` to run it in a thread pool, avoiding reactor blocking.
- Internal redirects are disabled (`allow_redirects=False`) in cloudscraper mode. Redirects are handled by Scrapy's `RedirectMiddleware`.
- The `Content-Encoding` header is stripped from cloudscraper responses. Decompression is handled by Scrapy's `HttpCompressionMiddleware`.
- Scrapy's default reactor is `AsyncioSelectorReactor`. No additional `TWISTED_REACTOR` configuration is needed.

## License

MIT
