Metadata-Version: 2.4
Name: red-crawler
Version: 0.1.0
Summary: Xiaohongshu contact lead crawler for fashion creators
Project-URL: Homepage, https://github.com/Batxent/red-crawler
Project-URL: Repository, https://github.com/Batxent/red-crawler
Project-URL: Issues, https://github.com/Batxent/red-crawler/issues
Keywords: contacts,crawler,playwright,rednote,xiaohongshu
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Requires-Python: >=3.9
Requires-Dist: beautifulsoup4>=4.13.3
Requires-Dist: browserforge>=1.2.4
Requires-Dist: playwright-stealth<2.0.0
Requires-Dist: playwright>=1.52.0
Requires-Dist: setuptools<70
Description-Content-Type: text/markdown

﻿# red-crawler

CLI crawler for collecting Xiaohongshu beauty creator contact leads from profile bios and recommendation chains, with SQLite persistence and nightly automation.

## Usage

Install the published CLI:

```bash
uv tool install red-crawler==0.1.0
```

Install the Playwright browser runtime:

```bash
red-crawler install-browsers
```

For local development from a checkout:

```bash
uv sync
uv run playwright install chromium
```

Save a reusable login session first:

```bash
red-crawler login --save-state "./state.json"
```

It will open a visible browser. Log in to Xiaohongshu there, then come back to the terminal and press Enter to save the session file.

Run a manual crawl with an existing Playwright storage state file:

```bash
red-crawler crawl-seed \
  --seed-url "https://www.xiaohongshu.com/user/profile/USER_ID" \
  --storage-state "./state.json" \
  --max-accounts 20 \
  --max-depth 2 \
  --db-path "./data/red_crawler.db" \
  --output-dir "./output"
```

`crawl-seed` defaults to safe mode, adding slower request pacing and dwell/scroll delays that look more like a normal browsing session. Use `--no-safe-mode` only if you explicitly want a faster run.

`crawl-seed` now does both:

- exports `accounts.csv`, `contact_leads.csv`, `run_report.json`
- upserts the same result into SQLite

Optional note-page expansion:

```bash
red-crawler crawl-seed \
  --seed-url "https://www.xiaohongshu.com/user/profile/USER_ID" \
  --storage-state "./state.json" \
  --include-note-recommendations
```

List high-quality contactable creators from the SQLite database:

```bash
red-crawler list-contactable \
  --db-path "./data/red_crawler.db" \
  --min-relevance-score 0.7 \
  --limit 20
```

Run nightly auto-collection with queue, search bootstrap, seed promotion, and daily report output:

```bash
red-crawler collect-nightly \
  --storage-state "./state.json" \
  --db-path "./data/red_crawler.db" \
  --report-dir "./reports" \
  --cache-dir "./.cache/red-crawler" \
  --crawl-budget 30
```

Export weekly growth report and a contactable creator CSV:

```bash
red-crawler report-weekly \
  --db-path "./data/red_crawler.db" \
  --report-dir "./reports" \
  --days 7
```

Key outputs:

- manual crawl:
  - `accounts.csv`
  - `contact_leads.csv`
  - `run_report.json`
- nightly automation:
  - `reports/daily-run-report.json`
  - `reports/weekly-growth-report.json`
  - `reports/contactable_creators.csv`
- SQLite database:
  - `data/red_crawler.db`

## OpenClaw

The OpenClaw skill for this project lives at `openclaw-skills/red-crawler-ops/`.

To install it from a local path, point OpenClaw at that folder, or copy the skill directory into your OpenClaw skills location and register the same path.

Use the OpenClaw skill actions in this order:

- `bootstrap` validates a local working directory and can run Chromium installation when explicitly requested.
- `login` creates the Playwright storage state explicitly.
- `crawl_seed` and `collect_nightly` require an authenticated Playwright storage state file.
- `report_weekly` and `list_contactable` run from the SQLite database and do not require `--storage-state`.

The skill does not clone repositories or create login sessions implicitly. Install the `red-crawler` CLI package first, point `workspace_path` at a local working directory, run `bootstrap` only for reviewed local setup steps, then run `login` when you are ready to create `state.json`.

## Publishing

The package builds as a standard Python wheel and source distribution:

```bash
uv build
```

See [docs/publishing.md](/Users/tommy/Documents/GitHubOpenSources/red-crawler/docs/publishing.md) for the release checklist and PyPI/TestPyPI commands.

## launchd

For macOS local scheduling, use the template at [docs/launchd/red-crawler.collect-nightly.plist](/Users/tommy/Documents/GitHubOpenSources/red-crawler/docs/launchd/red-crawler.collect-nightly.plist).

Replace the placeholder paths:

- `__WORKDIR__`
- `__UV_BIN__`
- `__STORAGE_STATE__`
- `__DB_PATH__`
- `__REPORT_DIR__`
- `__CACHE_DIR__`
- `__LOG_DIR__`

Then load it with:

```bash
launchctl unload ~/Library/LaunchAgents/com.red-crawler.collect-nightly.plist 2>/dev/null || true
cp docs/launchd/red-crawler.collect-nightly.plist ~/Library/LaunchAgents/com.red-crawler.collect-nightly.plist
launchctl load ~/Library/LaunchAgents/com.red-crawler.collect-nightly.plist
```

