Metadata-Version: 2.4
Name: vidistill
Version: 0.3.3
Summary: Video content distillation CLI tool
Requires-Python: >=3.13
Requires-Dist: click>=8.3.2
Requires-Dist: playwright>=1.58.0
Requires-Dist: pydantic-ai>=1.82.0
Requires-Dist: python-dotenv>=1.2.2
Requires-Dist: rich>=15.0.0
Requires-Dist: tos>=2.9.0
Requires-Dist: yt-dlp>=2026.3.17
Description-Content-Type: text/markdown

# vidistill

A toolset to download, parse and analyse social-media video content.

## Install

```bash
uv sync
uv run playwright install chromium   # one-time: downloads ~150 MB browser binary
```

`playwright install chromium` is required by the `collect` subcommand, which
drives a headless browser against Bilibili pages.

## Environment

Copy `.env.example` → `.env` and fill in:

- `OPENROUTER_API_KEY` — LLM provider for `analyse`.
- `VOLC_ACCESS_KEY_ID` / `VOLC_SECRET_ACCESS_KEY` / `VOLC_TOS_BUCKET` —
  Volcengine TOS + 豆包 ASR credentials used by the default transcriber.

## CLI

```
vidistill analyse <input>  [--output-dir …] [--model …] [--transcriber …] [--diarize]
vidistill collect <page-path> [--rule …] [--num …] [--sort …] [--max-scrolls …] [--headful]
```

### `vidistill analyse`

Runs the full pipeline on a single video: download → audio extraction →
transcription → LLM analysis → markdown report.

```bash
vidistill analyse https://www.bilibili.com/video/BV1xx411c7mu
```

### `vidistill collect`

Scrape a Bilibili listing page, filter videos by a rule expression, and write
the top-N matches as JSON. Supported page types:

| Type     | URL shape                                                        |
|----------|------------------------------------------------------------------|
| homepage | `www.bilibili.com/` (the recommended feed)                       |
| user     | `space.bilibili.com/<uid>`                                       |
| search   | `search.bilibili.com/all?keyword=…`                              |
| channel  | `www.bilibili.com/c/<slug>`                                      |
| popular  | `www.bilibili.com/v/popular/<column>` (`all`, `weekly`, …)       |
| topic    | `www.bilibili.com/v/topic/detail?topic_id=<id>`                  |

```bash
vidistill collect https://space.bilibili.com/12345 \
    --rule='play>10000 and like>=500' --num=20

vidistill collect "https://search.bilibili.com/all?keyword=LLM" \
    -r 'play>50000 and duration<=600' -n 10 --sort=play
```

Rule DSL (`--rule`) supports comparisons (`== != > >= < <=`), boolean ops
(`and`, `or`, `not`), parentheses, and the following fields: `play`, `like`,
`coin`, `favorite` (alias `star`), `share`, `danmaku`, `comment` (alias
`reply`), `duration` (seconds), `publish_days_ago`, `title`, `author`. Missing
fields evaluate the whole rule to `False` for that video.

Output lands in `<output-dir>/collect_<page-type>_<identifier>_<ts>.json`.

#### Bilibili login & storage state

Bilibili's anti-bot layer rejects guest scraping. The `--storage-state` flag
uses Playwright's native storage-state (cookies + localStorage) to persist
login across runs. Default: `default.storage_state` in the current directory.

**First run (no login yet):**
```bash
vidistill collect https://space.bilibili.com/<uid> -n 10
```
If `default.storage_state` doesn't exist, a browser window opens. Log in to
Bilibili there. When you close the browser, storage state is saved
automatically and scraping proceeds. Subsequent runs reuse that state.

**Explicit path:**
```bash
vidistill collect <url> --storage-state ~/my-session.storage_state -n 10
```

## Development

```bash
uv run pytest
```
