Metadata-Version: 2.4
Name: vt-calc
Version: 0.0.3
Summary: Calculate the number of tokens used for images in VLMs
Home-page: https://github.com/thisisiron/vision-token-calculator
Author: Vision Token Calculator
Keywords: vision,tokens,language model,multimodal,ai,vlm,vision language model,vision language model token calculator
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers<5.0.0,>=4.30.0
Requires-Dist: Pillow>=9.0.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: requests>=2.25.0
Requires-Dist: av>=10.0.0
Requires-Dist: fastapi>=0.104.0
Requires-Dist: uvicorn[standard]>=0.24.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: python-multipart>=0.0.6
Requires-Dist: rich
Provides-Extra: video-advanced
Requires-Dist: decord>=0.6.0; extra == "video-advanced"
Requires-Dist: torchcodec>=0.1.0; extra == "video-advanced"
Provides-Extra: test
Requires-Dist: pytest>=7.0.0; extra == "test"
Requires-Dist: qwen-vl-utils>=0.0.8; extra == "test"
Requires-Dist: opencv-python>=4.5.0; extra == "test"
Provides-Extra: quality
Requires-Dist: ruff; extra == "quality"
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Vision Token Calculator

A Python tool for calculating the number of tokens generated when processing images with Vision Language Models (VLMs).

## Features

- Calculate image tokens for VLMs
- Support both existing images and dummy images
- Support remote images via URL (http/https)
- Simple command line interface (CLI)

## Installation

### Option 1: PyPI (recommended)

```bash
pip install vt-calc
```

### Option 2: From source (editable for development)

```bash
pip install -e .
```

## Usage

Using the vt-calc command (after pip install -e .)

After installing with `pip install -e .`, you can use the `vt-calc` command directly:

```bash
# Single image
vt-calc --image path/to/your/image.jpg

# Image from URL
vt-calc --image https://example.com/image.jpg

# Directory (batch processing)
vt-calc --image path/to/your/images_dir

# Dummy image with specific dimensions (Width x Height)
vt-calc --size 1920 1080

# Choose a short model name (default: qwen2.5-vl)
vt-calc --image path/to/your/image.jpg -m qwen2.5-vl

# Calculate tokens for a video file
vt-calc --video path/to/video.mp4 -m qwen2.5-vl

# Specify frame sampling rate (FPS)
vt-calc --video video.mp4 --fps 2.0

# Limit maximum number of frames
vt-calc --video video.mp4 --max-frames 100

# Show help
vt-calc --help
```

### CLI options

- `-i, --image`: Path to an image file, a directory of images, or an image URL
- `-s, --size WIDTH HEIGHT`: Create a dummy image of the given size
- `-m, --model-name`: Short model name to use (default: `qwen2.5-vl`)

Supported input formats for directory processing: `.jpg`, `.jpeg`, `.png`, `.webp` (case-insensitive).

### Example output (single image)

```text
Using dummy image: 1024 x 768
                        ╔══════════════════════════════╗
                        ║ VISION TOKEN ANALYSIS REPORT ║
                        ╚══════════════════════════════╝
╭───────────────────────────────── MODEL INFO ─────────────────────────────────╮
│                                                                              │
│   Model Name                qwen2.5-vl                                       │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── IMAGE INFO ─────────────────────────────────╮
│                                                                              │
│   Image Source              Dummy image                                      │
│   Original Size (H x W)     1024 x 768                                       │
│   Resized Size (H x W)      1036 x 756                                       │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── PATCH INFO ─────────────────────────────────╮
│                                                                              │
│   Patch Size (ViT)          14                                               │
│   Grid Size (H x W)         74 x 54                                          │
│   Number of Patches         3996                                             │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────────── TOKEN INFO ─────────────────────────────────╮
│                                                                              │
│   Image Token               999                                              │
│   (<|image_pad|>)                                                            │
│   Image Start Token         1                                                │
│   (<|vision_start|>)                                                         │
│   Image End Token           1                                                │
│   (<|vision_end|>)                                                           │
│                                                                              │
╰──────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────── TOKEN FORMAT ────────────────────────────────╮
│               <|vision_start|><|image_pad|>*999<|vision_end|>                │
╰──────────────────────────────────────────────────────────────────────────────╯
```

### Example output (multi image)

```text
Processing directory: test_images/
Found 8 images to process...

[1/8] Processing: test_1_640x480.jpg ✓ (393 tokens)
[2/8] Processing: test_2_800x600.jpg ✓ (611 tokens)
[3/8] Processing: test_3_1024x768.jpg ✓ (1001 tokens)
[4/8] Processing: test_4_1280x720.jpg ✓ (1198 tokens)
[5/8] Processing: test_5_1920x1080.jpg ✓ (2693 tokens)
[6/8] Processing: test_6_512x512.jpg ✓ (326 tokens)
[7/8] Processing: test_7_256x256.jpg ✓ (83 tokens)
[8/8] Processing: test_8_2048x1536.jpg ✓ (4017 tokens)

       BATCH ANALYSIS REPORT
╭────────────────────────┬────────────╮
│ Model                  │ qwen2.5-vl │
│ Total Images Processed │ 8          │
│ Average Vision Tokens  │ 1290.2     │
│ Minimum Vision Tokens  │ 83         │
│ Maximum Vision Tokens  │ 4017       │
│ Standard Deviation     │ 1370.5     │
╰────────────────────────┴────────────╯
```

## Supported Models

| Model | Option |
|-------|--------|
| Qwen2-VL | qwen2-vl |
| Qwen2.5-VL | qwen2.5-vl |
| Qwen3-VL | qwen3-vl |
| InternVL3 | internvl3 |
| LLaVA | llava |

## License

This project is licensed under the MIT License — see the `LICENSE` file for details.
