Metadata-Version: 2.4
Name: vidclaude
Version: 0.2.2
Summary: Multimodal video understanding for Claude Code — extract frames, transcribe audio, build timelines from any video
License-Expression: MIT
Keywords: video,claude,multimodal,whisper,analysis
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Multimedia :: Video
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: Pillow>=10.0
Requires-Dist: faster-whisper>=1.0
Provides-Extra: ocr
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Provides-Extra: api
Requires-Dist: anthropic>=0.40.0; extra == "api"

# vidclaude

Multimodal video understanding for Claude Code. Extract frames, transcribe audio in 90+ languages, build temporal timelines — all from a single command. No API key needed.

```bash
pip install vidclaude
vidclaude video.mp4 --mode standard --verbose
```

## What it does

Drop a video in, get structured evidence out. Claude in your conversation does the thinking.

```
Video File
  │
  ├─ ffmpeg ──────────► Frames (adaptive, shot-aware sampling)
  ├─ faster-whisper ──► Transcript with timestamps (large-v3, 90+ languages)
  ├─ pytesseract ─────► On-screen text / OCR (optional)
  ├─ scene detection ─► Shot boundaries
  │
  └─► Timeline ──► evidence.md ──► Claude reasons over it
```

Works with Hindi, English, Spanish, Japanese, Arabic — any of the 90+ languages Whisper supports. Language is auto-detected.

## Prerequisites

| Requirement | Install |
|-------------|---------|
| Python 3.10+ | [python.org](https://python.org) |
| ffmpeg | Windows: `winget install ffmpeg` / macOS: `brew install ffmpeg` / Linux: `sudo apt install ffmpeg` |

## Install

```bash
pip install vidclaude
```

That's it. First run downloads the Whisper model (~3GB, one time).

## Usage

### With Claude Code (recommended)

Set up the skill once:

```bash
vidclaude --install-skill
```

Then in Claude Code, just say:

> "analyze the video at C:/Users/me/Videos/meeting.mp4"
>
> "what does the speaker say about the budget in presentation.mp4?"
>
> "when does the logo appear in intro.mov?"

Claude runs the extraction, reads the evidence report + frames, and answers your question. Your Max/Pro plan covers everything — no API key needed.

Follow-up questions about the same video are instant (cached).

### From the command line

```bash
# Standard analysis — good for most videos
vidclaude video.mp4 --mode standard --verbose

# Quick — fewer frames, faster
vidclaude video.mp4 --mode quick

# Deep — dense frames, full OCR, detailed
vidclaude video.mp4 --mode deep --verbose

# Batch process a folder
vidclaude ./videos/ --verbose

# Skip audio transcription
vidclaude video.mp4 --no-audio

# Force fresh extraction (ignore cache)
vidclaude video.mp4 --no-cache
```

### Output

Every run creates a `.vidcache/` directory next to the video:

```
.vidcache/a3f7b2c1/
  evidence.md        ← Human-readable report (Claude reads this)
  frames/            ← Extracted JPEG frames
  transcript.json    ← Timestamped transcript
  timeline.json      ← Unified event timeline
  meta.json          ← Video metadata
  shots.json         ← Shot boundaries
  ocr.json           ← On-screen text
  summaries.json     ← Scene/chapter summaries
```

## Modes

| Mode | Frames | Whisper model | OCR | Best for |
|------|--------|---------------|-----|----------|
| `quick` | ~20, uniform | base | skip | Short clips, fast overview |
| `standard` | ~60, shot-aware | large-v3 | keyframes | General use |
| `deep` | ~150, burst sampling | large-v3 | all frames | Long videos, detailed analysis |

**Smart frame budget**: If a video is too long for the frame limit, FPS is automatically reduced. Shot boundary frames are always prioritized.

## CLI Reference

```
vidclaude [input] [options]

positional:
  input                    Video file or folder

setup:
  --install-skill          Copy SKILL.md to current dir for Claude Code

processing:
  --mode {quick,standard,deep}   Processing mode (default: standard)
  -f, --fps N              Override frames per second
  -m, --max-frames N       Override max frame count
  --no-audio               Skip audio transcription
  --no-ocr                 Skip OCR extraction
  --no-cache               Force re-extraction

output:
  --extract                Print cache path summary
  -o, --output FILE        Write output to file
  --verbose                Show detailed progress
  --batch-summary          Cross-video summary for folders
```

## How it works

Built on a multi-layer video understanding architecture:

- **Layer A — Ingestion**: Validates format (MP4, MOV, MKV, WebM, AVI), extracts metadata via ffprobe
- **Layer B — Segmentation**: Detects shot boundaries using ffmpeg's scene change filter
- **Layer C — Adaptive Sampling**: Content-aware frame selection — more frames at scene transitions, smart frame budgets per mode
- **Layer D — Audio**: faster-whisper with large-v3 for multilingual transcription with word-level timestamps and VAD filtering
- **Layer E — OCR**: pytesseract text extraction from key frames (optional)
- **Layer G — Timeline**: Merges speech, OCR, and scene events into a single time-sorted list
- **Layer I — Memory**: Hierarchical summaries for longer videos (scene → chapter → global)
- **Layer J — Evidence Assembly**: Generates `evidence.md` with frame references, transcript, timeline for Claude to reason over

## Intent-aware processing

The tool classifies your question and adjusts the pipeline:

| Question type | Example | What happens |
|--------------|---------|-------------|
| Description | "What happens in this video?" | Balanced extraction |
| Moment retrieval | "When does the person stand up?" | Prioritizes transcript + timeline |
| Temporal ordering | "Does X happen before Y?" | Prioritizes timeline events |
| Counting | "How many cars appear?" | Denser frame sampling |
| OCR / text | "What text is on the slide?" | Prioritizes OCR extraction |
| Speech | "What did they say about revenue?" | Prioritizes transcript |

## Optional extras

```bash
# OCR support (also needs Tesseract binary installed)
pip install pytesseract

# Standalone API mode (use outside Claude Code)
pip install anthropic
export ANTHROPIC_API_KEY=sk-...
vidclaude video.mp4 --api -q "What happens in this video?"
```

## Examples

**Analyze a meeting recording:**
```bash
vidclaude meeting.mp4 --mode standard --verbose
# → 55 frames, full transcript, timeline
# → In Claude Code: "summarize the key decisions from this meeting"
```

**Review security footage:**
```bash
vidclaude ./cameras/ --mode deep --verbose
# → Batch processes all videos in the folder
# → Dense frame extraction catches fast events
```

**Extract text from a lecture:**
```bash
pip install pytesseract
vidclaude lecture.mp4 --mode standard --verbose
# → Captures slide text via OCR + speaker transcript
```

**Quick check on a short clip:**
```bash
vidclaude clip.mp4 --mode quick
# → 20 frames, fast whisper, done in seconds
```

## Troubleshooting

| Problem | Fix |
|---------|-----|
| `ffmpeg not found` | Install: `winget install ffmpeg` (Windows) / `brew install ffmpeg` (Mac) |
| `No module named 'faster_whisper'` | `pip install faster-whisper` |
| Slow first run | Normal — downloading Whisper large-v3 model (~3GB, one time) |
| Wrong language detected | Whisper auto-detects; usually correct. For edge cases, the transcript still captures phonetics |
| Large `.vidcache` folder | Delete it to free space: `rm -rf .vidcache/` |
| Want fresh extraction | Use `--no-cache` flag |
| OCR not working | Install pytesseract + Tesseract binary |

## Project structure

```
vidclaude/
  cli.py          Argument parsing, orchestration, caching
  models.py       Data model (VideoMeta, Frame, Shot, TranscriptChunk, etc.)
  ingest.py       Video validation + ffprobe metadata
  segment.py      Shot detection + adaptive frame sampling
  audio.py        faster-whisper transcription (large-v3)
  ocr.py          pytesseract text extraction
  intent.py       Question intent classification
  timeline.py     Temporal event merging
  memory.py       Hierarchical summaries
  reason.py       Evidence assembly + optional API mode
  util.py         ffmpeg helpers, image encoding, caching
  SKILL.md        Claude Code skill definition (bundled)
```

## License

MIT
