Metadata-Version: 2.4
Name: vector-db-sizer
Version: 0.1.0
Summary: Analytical vector database storage estimator
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: pydantic>=2.7.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: typer>=0.12.0
Description-Content-Type: text/markdown

# vector-db-sizer

## What it is

`vector-db-sizer` is an analytical CLI estimator for vector database disk and RAM sizing.

## When to use it

Use it for fast pre-implementation sizing work, such as:
- early architecture decisions;
- comparing vector dimensions;
- comparing engines;
- comparing index types;
- estimating metadata/payload impact;
- generating Markdown/CSV/JSON artifacts for architecture discussions.

## What it does not do

- No live database connections.
- No ingestion or load execution.
- No latency/recall benchmarking.
- No pricing calculations.
- No production guarantee.

## Quick start

```bash
uv sync
uv run vector-db-sizer validate examples/qdrant_text_hnsw.yaml
uv run vector-db-sizer estimate examples/qdrant_text_hnsw.yaml --format markdown --out report.md
uv run vector-db-sizer estimate examples/multi_scenario.yaml --format csv --out comparison.csv
```

## Input YAML

```yaml
name: qdrant_text_hnsw

dataset:
  source_type: text
  total_tokens: 50000000
  chunk_tokens: 512
  chunk_overlap: 64

embedding:
  kind: dense
  dimensions: 1536
  dtype: float32

database:
  engine: qdrant
  index_type: hnsw
```

## Single-scenario example

```bash
uv run vector-db-sizer estimate examples/qdrant_text_hnsw.yaml --format markdown
```

## Multi-scenario example

```bash
uv run vector-db-sizer estimate examples/multi_scenario.yaml --format csv
uv run vector-db-sizer estimate examples/multi_scenario.yaml --format json
```

## Output formats

- `json` (machine-readable)
- `markdown` (human report)
- `csv` (comparison table)

## Supported engines

- generic
- pgvector
- qdrant
- milvus
- elasticsearch
- opensearch
- weaviate
- pinecone

## How to interpret the report

- **Raw vectors**: uncompressed/base vector bytes.
- **Quantized vectors**: additional quantized representation when modeled.
- **Record payload**: IDs + metadata/text/provenance payload bytes.
- **Index disk**: index structure bytes on disk.
- **Engine overhead**: engine/profile-level overhead approximation.
- **Final disk estimate**: replicated storage plus WAL/snapshot/safety factors.
- **Final RAM estimate**: vectors + payload + index + overhead RAM approximation.
- **Warnings**: profile caveats and scenario assumptions to review.
- **Confidence**: per-component confidence levels for planning.

## Confidence levels

- `high`: formulaic or type-level estimate.
- `medium`: useful engineering approximation.
- `low`: heuristic and engine-dependent; validate with pilot load.

## Production sizing warning

The estimates are analytical and should be calibrated with a representative pilot load before production capacity planning.

## Development

```bash
uv sync
uv run pytest
uv run ruff check .
```

## Current limitations

- Engine profiles are approximate.
- No vendor pricing model.
- No actual DB measurements from live systems.
- No latency/recall estimation.
- No automatic database selection.
