Metadata-Version: 2.3
Name: khspider
Version: 0.1.6
Summary: 
Author: Your Name
Author-email: you@example.com
Requires-Python: >=3.10
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Dist: beautifulsoup4 (>=4.12.3)
Requires-Dist: funboost (>=53.7)
Requires-Dist: loguru (>=0.7.2)
Requires-Dist: lxml (>=6.0.2)
Requires-Dist: nb-log (>=14.0)
Requires-Dist: packaging (>=24.0)
Requires-Dist: parsel (>=1.9.1)
Requires-Dist: pymongo (>=4.16.0)
Requires-Dist: redis (>=7.1.0)
Requires-Dist: requests (>=2.32.5)
Requires-Dist: six (>=1.16.0)
Requires-Dist: w3lib (>=2.1.2)
Description-Content-Type: text/markdown

# khspider

基于 [funboost](https://github.com/ydf0509/funboost) 的 Python 爬虫框架，内置 DrissionPage 浏览器集成、中间件洋葱模型、批量/直接入库两种模式。

---

## 安装

```bash
poetry add git+https://gitee.com/sugarysp/khspider.git
```

---

## 快速开始

```bash
# 生成爬虫模板
khspider create -s news_spider            # → spiders/news_spider.py
khspider create -s product -d jd          # → spiders/jd/product.py
```

### 最简示例

```python
from funboost import boost, BoosterParams, BrokerEnum
from khspider import Spider, Item

spider = Spider(
    pipeline="mongodb://127.0.0.1/mydb",
    batch=True,          # True=批量入库（默认），False=逐条入库
    batch_size=500,
)

@boost(BoosterParams(queue_name="news", broker_kind=BrokerEnum.REDIS_ACK_ABLE))
def crawl(url):
    resp = spider.get(url)
    spider.save(Item(
        table_name="news",
        title=resp.xpath("//title/text()").get(),
        url=url,
    ))

if __name__ == "__main__":
    with spider:
        crawl.push("https://example.com")
        crawl.consume()
```

---

## 入库模式

### 批量入库（默认，高吞吐）

```python
spider = Spider(
    pipeline="mongodb://127.0.0.1/mydb",
    batch=True,
    batch_size=500,       # 积累 500 条批量写入
    batch_interval=5.0,   # 或每 5 秒写入一次
)

spider.save(item)   # 进缓冲区，异步批量落库
```

### 直接入库（boost_spider 风格，实时落库）

```python
# 方式一：构造时关闭批量
spider = Spider(pipeline="mongodb://127.0.0.1/mydb", batch=False)
spider.save(item)         # 立即同步写入

# 方式二：batch=True 时，某条 item 需要立即落库
spider.save_direct(item)  # 跳过缓冲区，直接写入
```

---

## 中间件

中间件实现洋葱模型：`process_request` 正序 → 实际请求 → `process_response` 逆序。

```python
from khspider.middlewares import BaseMiddleware

class MyMiddleware(BaseMiddleware):
    def process_request(self, req, config):
        # 修改请求或短路返回 Response
        return None

    def process_response(self, req, resp, config):
        # resp.status_code, req.url, req.meta["_start_at"] 均可用
        return resp

    def process_exception(self, req, exc, config):
        # 返回 None=继续抛出，Request=重试，Response=吞掉异常
        return None

spider = Spider(
    pipeline="mongodb://...",
    middlewares=[MyMiddleware()],
)
```

框架在发出请求前自动将 `req.meta["_start_at"]` 设为 `time.time()`，`process_response` 里可直接用于计算耗时。

### 打点示例：写入 InfluxDB

用户自己实现 `MetricsMiddleware`，框架不绑定任何 metrics 后端：

```python
from influxdb_client import InfluxDBClient, Point
from influxdb_client.client.write_api import SYNCHRONOUS
from khspider.middlewares import BaseMiddleware
import time

class InfluxDBMiddleware(BaseMiddleware):
    def __init__(self, url, token, org, bucket):
        client = InfluxDBClient(url=url, token=token, org=org)
        self._write = client.write_api(write_options=SYNCHRONOUS)
        self._bucket = bucket
        self._org = org

    def process_response(self, req, resp, config):
        elapsed_ms = (time.time() - req.meta["_start_at"]) * 1000
        point = (
            Point("http_request")
            .tag("url", req.url)
            .field("status_code", resp.status_code)
            .field("elapsed_ms", elapsed_ms)
        )
        self._write.write(bucket=self._bucket, org=self._org, record=point)
        return resp

    def process_exception(self, req, exc, config):
        elapsed_ms = (time.time() - req.meta["_start_at"]) * 1000
        point = (
            Point("http_request")
            .tag("url", req.url)
            .tag("error", type(exc).__name__)
            .field("elapsed_ms", elapsed_ms)
            .field("status_code", -1)
        )
        self._write.write(bucket=self._bucket, org=self._org, record=point)
        return None  # 继续抛出，让 funboost 重试


# 使用
spider = Spider(
    pipeline="mongodb://127.0.0.1/mydb",
    middlewares=[
        InfluxDBMiddleware(
            url="http://localhost:8086",
            token="your-token",
            org="your-org",
            bucket="spider_metrics",
        ),
    ],
)
```

---

## DrissionPage 集成

### HTTP 请求打点（spider.get 路径）

```python
from khspider.core.browser_downloader import DrissionPageDownloader, make_options_factory

def my_on_response(url, elapsed_ms, status_code):
    # 用户自己决定写到哪里
    print(f"{status_code} {elapsed_ms:.0f}ms {url}")
    # influx_client.write(...)

downloader = DrissionPageDownloader(
    options_factory=make_options_factory(
        browser_path=r"C:\Program Files\Chrome\chrome.exe",
        proxy="http://127.0.0.1:10808",
        port_start=12000,
    ),
    on_response=my_on_response,   # spider.get(url) 完成后触发
)

spider = Spider(downloader=downloader)
```

### 直接操作浏览器（下载、交互）

```python
from funboost import boost, BoosterParams, BrokerEnum

@boost(BoosterParams(queue_name="download", concurrent_num=5, broker_kind=BrokerEnum.REDIS_ACK_ABLE))
def download_file(file_url, save_dir):
    page = spider.page               # 当前线程专属浏览器（懒加载）
    page.set.download_path(save_dir) # DrissionPage 4.x 必须显式设置
    page.run_js(f"window.location.href='{file_url}';")
    mission = page.wait.download_begin(timeout=15)
    if mission:
        mission.wait()
    else:
        raise Exception(f"下载失败: {file_url}")

    # DP 下载打点：用户在业务层自己计时
    # t0 = time.time() 放在 run_js 前，完成后算耗时写入 influx
```

> DP 的直接下载（`page.run_js` + `wait.download_begin`）绕过了中间件，打点需要在业务函数里手动加，框架通过 `on_response` 回调只覆盖 `spider.get()` 路径。

### thread-local 并发模型

```
concurrent_num=5 → funboost 创建 5 个 worker 线程
每个线程第一次访问 spider.page → 自动启动独立浏览器（懒加载）
同一线程后续访问 → 返回同一浏览器实例，不重复开启

spider（1个全局对象）
  └── DrissionPageDownloader
        ├── Thread-1.page → ChromiumPage(port=12000)
        ├── Thread-2.page → ChromiumPage(port=12001)
        └── Thread-3.page → ChromiumPage(port=12002)
```

---

## CLI

```bash
khspider create -s <name>              # → spiders/<name>.py
khspider create -s <name> -d <site>   # → spiders/<site>/<name>.py
khspider create -s <name> -o <path>   # → 自定义路径
```

推荐按站点分目录：

```
spiders/
├── jd/
│   ├── product.py
│   └── review.py
└── xinhua/
    └── list.py
```

---

## 支持的 Pipeline

```python
Spider(pipeline="mongodb://user:pass@localhost/mydb")
Spider(pipeline="mysql://user:pass@localhost/mydb")
Spider(pipeline=MongoPipeline(...))   # 实例直接传入
```

