Metadata-Version: 2.4
Name: crawlo
Version: 1.6.9
Summary: Crawlo: A high-performance asynchronous Python web crawling framework with distributed support.。
Home-page: https://github.com/crawl-coder/Crawlo.git
Author: crawl-coder
Author-email: crawlo@qq.com
License: BSD-3-Clause
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: aiohttp>=3.13.3
Requires-Dist: aiofiles>=25.1.0
Requires-Dist: aioredis>=2.0.1
Requires-Dist: chardet>=7.4.3
Requires-Dist: cssselect>=1.2.0
Requires-Dist: dateparser>=1.2.2
Requires-Dist: httpx[http2]>=0.27.0
Requires-Dist: curl-cffi>=0.13.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: parsel>=1.9.1
Requires-Dist: pydantic>=2.11.7
Requires-Dist: python-dateutil>=2.9.0.post0
Requires-Dist: redis>=6.2.0
Requires-Dist: ujson>=5.9.0
Requires-Dist: w3lib>=2.3.1
Requires-Dist: rich>=13.0.0
Requires-Dist: astor>=0.8.0
Requires-Dist: watchdog>=3.0.0
Requires-Dist: psutil>=7.0.0
Requires-Dist: anyio>=4.3.0
Requires-Dist: httpcore>=1.0.5
Requires-Dist: msgpack>=1.0.0
Requires-Dist: pytz>=2024.2
Requires-Dist: yarl>=1.17.0
Provides-Extra: database
Requires-Dist: asyncmy>=0.2.11; extra == "database"
Requires-Dist: motor>=3.7.0; extra == "database"
Requires-Dist: pymongo>=4.11; extra == "database"
Requires-Dist: aiosqlite>=0.19.0; extra == "database"
Requires-Dist: asyncpg>=0.29.0; extra == "database"
Requires-Dist: clickhouse-connect>=0.7.0; extra == "database"
Requires-Dist: elasticsearch<9.0.0,>=8.0.0; extra == "database"
Requires-Dist: happybase>=1.2.0; extra == "database"
Provides-Extra: sqlite
Requires-Dist: aiosqlite>=0.19.0; extra == "sqlite"
Provides-Extra: postgresql
Requires-Dist: asyncpg>=0.29.0; extra == "postgresql"
Provides-Extra: clickhouse
Requires-Dist: clickhouse-connect>=0.7.0; extra == "clickhouse"
Provides-Extra: elasticsearch
Requires-Dist: elasticsearch<9.0.0,>=8.0.0; extra == "elasticsearch"
Provides-Extra: hbase
Requires-Dist: happybase>=1.2.0; extra == "hbase"
Provides-Extra: bloom
Requires-Dist: pybloom-live>=0.3.1; extra == "bloom"
Provides-Extra: render
Requires-Dist: playwright; extra == "render"
Requires-Dist: camoufox>=0.3.0; extra == "render"
Requires-Dist: DrissionPage>=4.1.0; extra == "render"
Provides-Extra: stealth
Requires-Dist: cloakbrowser[geoip]>=0.3.14; extra == "stealth"
Provides-Extra: mcp
Requires-Dist: mcp>=1.0.0; extra == "mcp"
Requires-Dist: markdownify>=0.13.1; extra == "mcp"
Provides-Extra: db-all
Requires-Dist: aiosqlite>=0.19.0; extra == "db-all"
Requires-Dist: asyncpg>=0.29.0; extra == "db-all"
Requires-Dist: clickhouse-connect>=0.7.0; extra == "db-all"
Requires-Dist: elasticsearch<9.0.0,>=8.0.0; extra == "db-all"
Requires-Dist: happybase>=1.2.0; extra == "db-all"
Requires-Dist: asyncmy>=0.2.11; extra == "db-all"
Requires-Dist: motor>=3.7.0; extra == "db-all"
Requires-Dist: pymongo>=4.11; extra == "db-all"
Requires-Dist: aiosqlite>=0.19.0; extra == "db-all"
Requires-Dist: asyncpg>=0.29.0; extra == "db-all"
Requires-Dist: clickhouse-connect>=0.7.0; extra == "db-all"
Requires-Dist: elasticsearch<9.0.0,>=8.0.0; extra == "db-all"
Requires-Dist: happybase>=1.2.0; extra == "db-all"
Provides-Extra: all
Requires-Dist: bitarray>=1.5.3; extra == "all"
Requires-Dist: PyExecJS>=1.5.1; extra == "all"
Requires-Dist: pymongo>=3.10.1; extra == "all"
Requires-Dist: redis-py-cluster>=2.1.0; extra == "all"
Requires-Dist: asyncmy>=0.2.11; extra == "all"
Requires-Dist: motor>=3.7.0; extra == "all"
Requires-Dist: pymongo>=4.11; extra == "all"
Requires-Dist: aiosqlite>=0.19.0; extra == "all"
Requires-Dist: asyncpg>=0.29.0; extra == "all"
Requires-Dist: clickhouse-connect>=0.7.0; extra == "all"
Requires-Dist: elasticsearch<9.0.0,>=8.0.0; extra == "all"
Requires-Dist: happybase>=1.2.0; extra == "all"
Requires-Dist: playwright; extra == "all"
Requires-Dist: camoufox>=0.3.0; extra == "all"
Requires-Dist: DrissionPage>=4.1.0; extra == "all"
Requires-Dist: cloakbrowser[geoip]>=0.3.14; extra == "all"
Requires-Dist: mcp>=1.0.0; extra == "all"
Requires-Dist: markdownify>=0.13.1; extra == "all"
Requires-Dist: aiosqlite>=0.19.0; extra == "all"
Requires-Dist: asyncpg>=0.29.0; extra == "all"
Requires-Dist: clickhouse-connect>=0.7.0; extra == "all"
Requires-Dist: elasticsearch<9.0.0,>=8.0.0; extra == "all"
Requires-Dist: happybase>=1.2.0; extra == "all"
Requires-Dist: asyncmy>=0.2.11; extra == "all"
Requires-Dist: motor>=3.7.0; extra == "all"
Requires-Dist: pymongo>=4.11; extra == "all"
Requires-Dist: aiosqlite>=0.19.0; extra == "all"
Requires-Dist: asyncpg>=0.29.0; extra == "all"
Requires-Dist: clickhouse-connect>=0.7.0; extra == "all"
Requires-Dist: elasticsearch<9.0.0,>=8.0.0; extra == "all"
Requires-Dist: happybase>=1.2.0; extra == "all"

<p align="center">
  <img src="assets/logo.svg" alt="Crawlo Logo" width="150"/>
</p>

<h1 align="center">Crawlo</h1>

<p align="center">
  <strong>A Modern High-Performance Python Async Web Scraping Framework</strong>
</p>

<p align="center">
  <strong>Python 3.8+</strong> · <strong>Python 3.14 Compatible</strong>
</p>

<p align="center">
  <a href="README.zh.md">中文</a> ·
  <a href="README.md">English</a>
</p>

<p align="center">
  <a href="#quick-start-en">Quick Start</a> ·
  <a href="#features-en">Key Features</a> ·
  <a href="#docs-en">Docs</a> ·
  <a href="#examples-en">Examples</a>
</p>

---

## <a id="quick-start-en"></a>✨ Quick Start (3 Steps)

### 1. Install
```bash
pip install crawlo
```

### 2. Create a Spider
```bash
crawlo startproject myproject
cd myproject
crawlo genspider example example.com
```

### 3. Run
```bash
crawlo run example
```

👉 **[5-Minute Quickstart Tutorial →](docs/getting-started/5min-quickstart.md)**

---

## <a id="features-en"></a>🚀 Key Features

### ⚡ High-Performance Async Architecture
- Built on asyncio + aiohttp/httpx/curl-cffi multi-protocol downloaders
- Smart concurrency control, connection pool reuse, auto throughput optimization
- HTTP/2 support, TLS fingerprint emulation (bypass JA3 detection)

### 🛡️ Robust Anti-Bot Capabilities
- **HybridDownloader**: 6-level detection priority, auto-switch protocol/browser engine
- **Cloudflare Auto-Bypass**: Detects challenge pages and auto-switches to stealth browser
- **5 Browser Downloaders**: Playwright / Camoufox / CloakBrowser / DrissionPage / Chrome
- **BROWSER_* Unified Config Layer**: One set of params for all browser downloaders
- **Adaptive Selectors**: Auto-relocate elements when site structure changes (selector self-healing)

### 🤖 AI Integration (MCP Server)
- Claude / Cursor directly invoke Crawlo scraping capabilities
- Three scraping modes: `basic` (1-3s) → `stealth` (3-10s) → `max-stealth` (10s+)
- Browser singleton pool: stealth/max-stealth modes reuse instances
- Structured error responses: distinguish `TIMEOUT` / `CONNECTION_ERROR` / `STEALTH_UNAVAILABLE`, with suggestions

### 📊 Four-Level Backpressure Defense
- **Engine** layer: request generation control (enqueue + TaskManager dual checks)
- **QueueManager** layer: strategy-driven (`QueueSizeStrategy` / `AdaptiveStrategy` / `CompositeStrategy`)
- **MemoryQueue** layer: Mixin delegation + fallback logic
- **Hard limit**: direct rejection when queue is full
- Smart enhancement: `IntelligentBackpressureCalculator` + `BackpressureMonitor` optional integration

### 📬 Multi-Channel Notification
- **5 Channels**: DingTalk / Feishu / WeCom / Email / SMS
- **30+ Preset Templates**: task start/stop, anomaly alerts, progress updates, DB monitoring
- **Async Delivery**: `async_send_*` functions, `run_in_executor` wrapper to avoid blocking event loop
- Message dedup + rate limiting to prevent notification storms

### 🔄 Three Deployment Modes

| Mode | Config | Coordination | Use Case |
|------|-------|-------------|----------|
| **Memory Mode** | `RUN_MODE='standalone'` `QUEUE_TYPE='memory'` | None (auto exit) | Dev/debug, quick validation |
| **Multi-Node** ⭐ | `RUN_MODE='auto'` `QUEUE_TYPE='redis'` | Competing consumption (BZPOPMIN) | Multi-machine, task loss acceptable |
| **Distributed** | `RUN_MODE='distributed'` `QUEUE_TYPE='redis_stream'` | ACK + heartbeat + failover | Production, high reliability |

> All three modes share the same priority model — switch without modifying spider code.
> [Learn More →](docs/concepts/architecture.md#2-部署模式-deployment-modes)

---

## <a id="docs-en"></a>📚 Documentation

### 🎯 By Role

| You are? | Recommended Reading |
|----------|-------------------|
| **Beginner** | [5-Min Quickstart](docs/getting-started/5min-quickstart.md) → [Installation](docs/getting-started/installation.md) |
| **Developer** | [Configuration Guide](docs/guides/configuration/) → [Scheduling Guide](docs/guides/scheduling/) |
| **Ops** | [Run Mode Deep Dive](docs/guides/configuration/run-modes.md) → [Checkpoint System](docs/concepts/checkpoint-guide.md) |

### 📖 Full Docs Navigation

- 🚀 **[Getting Started](docs/getting-started/)** - Install, create your first spider
- 📚 **[Tutorials](docs/tutorials/)** - Complete guides from basics to production
- 🎯 **[Guides](docs/guides/)** - Scenario-based deep dives
  - [Configuration](docs/guides/configuration/), [Scheduling](docs/guides/scheduling/)
  - [Backpressure](docs/guides/scheduling/backpressure.md), [Run Modes](docs/guides/configuration/run-modes.md)
- 📖 **[Concepts](docs/concepts/)** - Architecture, lifecycle, error handling
- 🔧 **[API Reference](docs/reference/)** - Complete API docs
- 💡 **[Examples](docs/examples/)** - Real-world examples and best practices
- ❓ **[FAQ](docs/faq/)** - FAQ and troubleshooting

👉 **[Browse Complete Docs →](docs/index.md)**

---

## <a id="examples-en"></a>💡 Examples

Check out the [`examples/`](examples/) directory:
- **Basic** - Quick start
- **Advanced** - Complex scenarios
- **Production** - Ready for production

👉 **[View All Examples →](docs/examples/)**

---

## 🤝 Contributing

Issues and Pull Requests are welcome!

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

---

## 📄 License

Licensed under BSD 3-Clause - see the [LICENSE](LICENSE) file for details.

---

<p align="center">
  <strong>⭐ If this project helps you, please give us a Star!</strong>
</p>
