Metadata-Version: 2.4
Name: cf-killer
Version: 0.1.0
Summary: Cloudflare 5s 盾自动求解 + 页面批量抓取工具（基于 CloakBrowser）
Author: cf-killer contributors
License: MIT
Project-URL: Homepage, https://github.com/CloakHQ/CloakBrowser
Project-URL: Repository, https://github.com/CloakHQ/CloakBrowser
Keywords: cloudflare,anti-bot,web-scraping,cloakbrowser,turnstile-solver,playwright
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: cloakbrowser>=0.3.0
Requires-Dist: playwright>=1.40

# CF Killer

基于 [CloakBrowser](https://github.com/erickirt/CloakBrowser)（Chromium C++ 源码级反检测浏览器）的 **Cloudflare 5 秒盾自动求解 + 页面批量抓取** 工具。

---

## 1. 运行环境

| 项目 | 说明 |
|------|------|
| OS | Windows 10+ / Linux / macOS |
| Python | 3.9+（推荐 3.11） |
| 浏览器 | CloakBrowser 专用 Chromium（自动下载，~200MB） |

## 2. 依赖安装

```bash
# 1. 安装 cloakbrowser（含 Playwright）
pip install cloakbrowser

# 2. 下载特制 Chromium 二进制（首次运行前执行一次）
python -c "import cloakbrowser; cloakbrowser.ensure_binary()"
```

核心依赖链：

```
cloakbrowser (C++ 源码级反检测 Chromium)
  ├── playwright >= 1.40        # 浏览器自动化
  ├── httpx >= 0.24             # HTTP 客户端
  └── greenlet >= 3.1.1         # 协程支持
```

---

## 3. 功能概述

### 3.1 Cloudflare 自动解盾 (`CFSolver`)

自动检测并求解 Cloudflare Turnstile 挑战，支持多种 challenge 类型：

| 类型 | 策略 |
|------|------|
| `non-interactive` | 纯轮询等待 CF 自动放行 |
| `managed` | 等待 iframe → 点击 checkbox → 轮询消失 |
| `interactive` | 同上，带更复杂的点击路径 |
| `embedded` | 嵌入式 Turnstile 求解 |

点击采用四路径递进策略：iframe 内精确选择器 → iframe 坐标点击 → 主页面容器坐标 → Tab+Space 兜底。

### 3.2 页面批量抓取 (`CFPageFetcher`)

- 基于 CloakBrowser 持久化上下文，复用浏览器指纹和 cookie
- 内置 CF 检测（支持 JS 延迟写入标题的站点，如 ScienceDirect）
- 自动 context 回收：处理 N 页后重建浏览器上下文，防止内存泄漏
- **延迟回收机制**：并发场景下等活跃页面全部完成后再回收，避免竞态崩溃
- 支持代理（单实例/多实例/callable 三种模式）

### 3.3 文件下载 (`download_file`)

过 CF 后，通过页内 `fetch()` 直接下载二进制文件（PDF、图片等），复用浏览器 cookie 和 TLS 指纹，绕过反爬限制。

### 3.4 多实例并行 (`fetch_all`)

将 URL 均匀分配到多个浏览器实例，每个实例独立 event loop + 独立代理，ThreadPoolExecutor 并行执行，最大化吞吐量。

### 3.5 主入口函数

| 函数 | 用途 |
|------|------|
| `fetch_url(url, ...)` | 同步抓取单个 URL |
| `fetch_urls(urls, ...)` | 同步批量抓取（`fetch_all` 别名） |
| `fetch_all(urls, ...)` | 多实例并行抓取，支持分片代理 |

---

## 4. 测试案例

### 案例 A：批量页面抓取

测试 31 个混合 URL（Gut 医学期刊 + American Football Wiki + ScienceDirect），验证 CF 解盾和页面抓取能力。

```python
# -*- coding: utf-8 -*-
"""CF 自动解盾 + 页面抓取 — 测试脚本"""
import os
import sys

if sys.platform == "win32":
    import io
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8", errors="replace")
    sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding="utf-8", errors="replace")

sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from cf_killer import fetch_all

HEADLESS               = True
PROXY                  = None
CONCURRENCY            = 3
INSTANCES              = 2
MAX_PAGES_PER_CONTEXT  = 10
RETURN_COOKIES         = False

URLS = [
    "https://gut.bmj.com/content/75/6/1085",
    "https://gut.bmj.com/content/75/6/1087",
    "https://gut.bmj.com/content/75/6/1090",
    "https://gut.bmj.com/content/75/6/1092",
    "https://gut.bmj.com/content/75/6/1094",
    "https://gut.bmj.com/content/75/6/1097",
    "https://gut.bmj.com/content/75/6/1110",
    "https://gut.bmj.com/content/75/6/1123",
    "https://gut.bmj.com/content/75/6/1136",
    "https://gut.bmj.com/content/75/6/1147",
    "https://gut.bmj.com/content/75/6/1160",
    "https://gut.bmj.com/content/75/6/1169",
    "https://gut.bmj.com/content/75/6/1186",
    "https://gut.bmj.com/content/75/6/1201",
    "https://gut.bmj.com/content/75/6/1211",
    "https://gut.bmj.com/content/75/6/1226",
    "https://gut.bmj.com/content/75/6/1237",
    "https://gut.bmj.com/content/75/6/1248",
    "https://gut.bmj.com/content/75/6/1264",
    "https://gut.bmj.com/content/75/6/1266.1",
    "https://gut.bmj.com/content/75/6/1266.2",
    "https://gut.bmj.com/content/75/6/1267",
    "https://gut.bmj.com/content/75/6/1109",
    "http://americanfootball.fandom.com/1993_Kentucky_vs._Mississippi",
    "http://americanfootball.fandom.com/Isaiah_Foskey",
    "http://americanfootball.fandom.com/wiki/2014_Susquehanna_Crusaders",
    "http://americanfootball.fandom.com/wiki/2015_Lake_Forest_Foresters",
    "http://americanfootball.fandom.com/wiki/2023_Colorado_State_Rams",
    "http://americanfootballdatabase.fandom.com/Paul_Hackett_(American_football)",
    "http://americanfootballdatabase.fandom.com/wiki/100th_Grey_Cup",
    "https://www.sciencedirect.com/science/article/pii/S0039606025002491",
]

if __name__ == "__main__":
    print(f"测试: {len(URLS)} 个 URL")

    results = fetch_all(
        URLS,
        instances=INSTANCES,
        concurrency=CONCURRENCY,
        max_pages_per_context=MAX_PAGES_PER_CONTEXT,
        headless=HEADLESS,
        solve_cf=True,
        proxy=PROXY,
        return_cookies=RETURN_COOKIES,
        verbose=False,
    )

    ok = sum(1 for r in results if r["success"])
    print(f"\n{'='*50}")
    for r in results:
        status = "✓" if r["success"] else "✗"
        print(f"  {status}  {(r['title'] or 'FAILED')[:60]}")
    print(f"{'='*50}")
    print(f"结果: {ok}/{len(results)} 成功")
```

运行：`python test.py`

预期输出：

```
测试: 31 个 URL

==================================================
  ✓  Gut-peritoneal-multisystem axis in endometriosis | Gut
  ✓  Hitting the mitotic spot of fibrolamellar carcinoma | Gut
  ...
  ✓  100th Grey Cup | American Football Database | Fandom
  ✓  Guidelines for perioperative care in elective colorectal sur
==================================================
结果: 31/31 成功
```

---

### 案例 B：PDF 文件下载

通过过 CF 后的浏览器页面发起 `fetch()` 下载 PDF，复用 TLS 指纹和 cookie。

```python
# -*- coding: utf-8 -*-
"""测试 download_file 方法 — PDF 下载"""
import asyncio
import os
import sys

if sys.platform == "win32":
    import io
    sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8", errors="replace")
    sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding="utf-8", errors="replace")

sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
from cf_killer import CFPageFetcher

PDF_URL = "https://www.myavls.org/assets/pdf/SuperficialVenousDiseaseGuidelinesPMS313-02.03.16.pdf"
OUTPUT = os.path.join(os.path.dirname(os.path.abspath(__file__)), "SuperficialVenousDiseaseGuidelines.pdf")


async def main():
    print(f"目标: {PDF_URL}")
    print(f"保存: {OUTPUT}")

    async with CFPageFetcher(
        headless=True,
        verbose=True,
        solve_cf=True,
    ) as fetcher:
        ok = await fetcher.download_file(PDF_URL, OUTPUT)
        if ok:
            size_kb = os.path.getsize(OUTPUT) / 1024
            print(f"\n✅ 下载成功! 文件: {OUTPUT} ({size_kb:.0f} KB)")
        else:
            print(f"\n❌ 下载失败")


if __name__ == "__main__":
    asyncio.run(main())
```

运行：`python test_download.py`

预期输出：

```
目标: https://www.myavls.org/assets/pdf/SuperficialVenousDiseaseGuidelinesPMS313-02.03.16.pdf
保存: ...\SuperficialVenousDiseaseGuidelines.pdf
[上下文] 已创建
[下载] 预热: https://www.myavls.org/
非 CF url=https://www.myavls.org/
[下载] 已保存: ...\SuperficialVenousDiseaseGuidelines.pdf (121KB)

✅ 下载成功! ... (121 KB)
```

---

## 5. 主要 API 参数

### `CFPageFetcher`

| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `headless` | bool | True | 无头模式 |
| `humanize` | bool | False | 人类化鼠标轨迹/键盘时序 |
| `solve_cf` | bool | True | 自动求解 CF 挑战 |
| `cf_max_retries` | int | 5 | CF 求解最大重试次数 |
| `timeout` | int | 90000 | 页面导航超时 (ms) |
| `proxy` | str | None | 代理 URL |
| `max_pages_per_context` | int | 20 | 每 N 页回收浏览器上下文 |
| `return_cookies` | bool | False | 结果中是否包含 cookies |

### `fetch_all`

| 参数 | 类型 | 默认值 | 说明 |
|------|------|--------|------|
| `urls` | list | - | URL 列表 |
| `instances` | int | 1 | 并行浏览器实例数 |
| `concurrency` | int | 3 | 每实例并发 tab 数 |
| `max_pages_per_context` | int | 20 | 每 N 页自动回收 |
| `proxy` | str/list/callable | None | 单代理/代理列表/代理工厂函数 |
