Metadata-Version: 2.4
Name: vidclaude
Version: 0.2.0
Summary: Multimodal video understanding for Claude Code — extract frames, transcribe audio, build timelines from any video
License-Expression: MIT
Keywords: video,claude,multimodal,whisper,analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Multimedia :: Video
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: Pillow>=10.0
Requires-Dist: faster-whisper>=1.0
Provides-Extra: ocr
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Provides-Extra: api
Requires-Dist: anthropic>=0.40.0; extra == "api"

# vidclaude — Multimodal Video Understanding

A Python CLI tool that extracts structured evidence from videos (frames, audio
transcript, OCR, temporal timeline) for analysis by Claude.

## Quick Start

### Option 1: npm (easiest)
```bash
npm install -g vidclaude
vidclaude your_video.mp4 --extract --mode standard --verbose
```

### Option 2: pip
```bash
pip install vidclaude
vidclaude your_video.mp4 --extract --mode standard --verbose
```

### Option 3: From source
```bash
git clone <repo-url> && cd claudevid
setup.bat          # Windows
bash setup.sh      # macOS / Linux
python video_understand.py your_video.mp4 --extract --mode standard --verbose
```

No API key needed. If you have a Claude Max/Pro plan, the tool works entirely
through Claude Code — Claude in your conversation does the reasoning.

## How It Works

```
Video File → ffmpeg extraction → Frames + Audio + Metadata
                                        ↓
                              faster-whisper large-v3 → Transcript (Hindi, English, 90+ languages)
                              pytesseract → OCR text (optional)
                              Shot detection → Scene boundaries
                                        ↓
                              Timeline builder → Unified event list
                                        ↓
                              evidence.md + cached frames → Claude reasons over it
```

**No API key needed.** Your Claude Max/Pro plan covers everything.
The tool extracts evidence, and Claude in your conversation reasons over it.

## Prerequisites

| Requirement | How to install |
|-------------|---------------|
| Python 3.10+ | [python.org](https://python.org) |
| ffmpeg | Windows: `winget install ffmpeg` / macOS: `brew install ffmpeg` / Linux: `sudo apt install ffmpeg` |

## Installation

### Option A: One-line setup (recommended)
```bash
# Windows
setup.bat

# macOS / Linux
bash setup.sh
```

### Option B: Manual
```bash
pip install -r requirements.txt
```

### Optional extras
```bash
pip install pytesseract    # OCR (also needs Tesseract binary)
pip install anthropic      # Only for standalone --api mode
```

## Usage

### Inside Claude Code (recommended)

1. Copy `SKILL.md` into your project (or keep it here)
2. Ask Claude: *"analyze the video at D:/path/to/video.mp4"*
3. Claude runs the extraction, reads the evidence, and answers
4. Ask follow-up questions — the cache is reused instantly

### From the command line

```bash
# Standard analysis (recommended for most videos)
python video_understand.py video.mp4 --extract --mode standard --verbose

# Quick analysis (fast, fewer frames)
python video_understand.py video.mp4 --extract --mode quick

# Deep analysis (dense frames, full OCR)
python video_understand.py video.mp4 --extract --mode deep --verbose

# Process a folder of videos
python video_understand.py ./videos/ --extract --verbose

# Skip audio / OCR
python video_understand.py video.mp4 --extract --no-audio --no-ocr

# Force fresh extraction (ignore cache)
python video_understand.py video.mp4 --extract --no-cache --verbose
```

## Processing Modes

| Mode | Frames | Audio model | OCR | Best for |
|------|--------|-------------|-----|----------|
| `quick` | ~20, uniform sampling | whisper base | skip | Fast overview, short clips |
| `standard` | ~60, shot-aware | whisper large-v3 | keyframes | General analysis |
| `deep` | ~150, burst sampling | whisper large-v3 | all frames | Detailed review, long videos |

## Caching

First run extracts everything to `.vidcache/<hash>/`:
```
.vidcache/a3f7b2c1/
  meta.json          # Video metadata
  frames/            # Extracted JPEG frames
  transcript.json    # Timestamped transcript
  ocr.json           # OCR results
  timeline.json      # Merged timeline
  evidence.md        # Human-readable report
```

Follow-up questions reuse the cache — no re-extraction needed.
Delete `.vidcache/` to free disk space.

## CLI Reference

| Flag | Default | Description |
|------|---------|-------------|
| `input` | required | Video file or folder path |
| `--extract` | - | Extract only (skill mode, no API key) |
| `-q "..."` | none | Question (for --api mode) |
| `--mode` | standard | `quick` / `standard` / `deep` |
| `-f N` | auto | FPS override |
| `-m N` | auto | Max frames override |
| `--no-audio` | - | Skip transcription |
| `--no-ocr` | - | Skip OCR |
| `--no-cache` | - | Force re-extraction |
| `--verbose` | - | Detailed progress |
| `-o file` | stdout | Output file |
| `--batch-summary` | - | Cross-video summary for folders |

## Project Structure

```
video_understand.py          # CLI entry point
SKILL.md                     # Claude Code skill definition
setup.bat / setup.sh         # One-click setup scripts
requirements.txt             # Python dependencies
vidclaude/
  cli.py                     # Argument parsing, orchestration
  models.py                  # Data model (VideoMeta, Frame, Shot, etc.)
  ingest.py                  # Layer A: Video validation + metadata
  segment.py                 # Layer B+C: Shot detection + adaptive sampling
  audio.py                   # Layer D: faster-whisper transcription
  ocr.py                     # Layer E: Text extraction from frames
  intent.py                  # Intent classification (adjusts pipeline)
  timeline.py                # Layer G: Temporal event merging
  memory.py                  # Layer I: Hierarchical summaries
  reason.py                  # Layer J: Evidence assembly
  util.py                    # Shared helpers
```

## Architecture

Based on [claude_video_understanding_architecture.md](claude_video_understanding_architecture.md),
this tool implements a multi-layer video understanding pipeline:

- **Layer A** (Ingestion): Format validation, ffprobe metadata
- **Layer B** (Segmentation): Shot boundary detection via scene filter
- **Layer C** (Adaptive Sampling): Content-aware frame selection
- **Layer D** (Audio): faster-whisper large-v3 ASR with timestamps (90+ languages)
- **Layer E** (OCR): Text extraction from key frames
- **Layer G** (Timeline): Unified temporal event list
- **Layer I** (Memory): Hierarchical summaries for long videos
- **Layer J** (Reasoning): Evidence assembly for Claude

Claude serves as the reasoning brain — the tool provides structured,
time-grounded evidence for Claude to analyze.

## Language Support

Uses faster-whisper with large-v3 model which supports 90+ languages including:
Hindi, English, Spanish, French, German, Chinese, Japanese, Arabic, and more.
Language is auto-detected with confidence scoring.
