Metadata-Version: 2.4
Name: mllm-annotator
Version: 0.1.1
Summary: Resumable multimodal-LLM annotator and embedder for folders of audio or image files.
Author-email: Matteo Boi <matteo.boi@unibe.ch>
License-Expression: MIT
Project-URL: Homepage, https://github.com/BoiMat/mllm-annotator
Project-URL: Repository, https://github.com/BoiMat/mllm-annotator
Keywords: gemini,annotation,labeling,multimodal,audio,image,embeddings,llm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: google-genai>=1.0.0
Requires-Dist: keyring>=24
Provides-Extra: ui
Requires-Dist: customtkinter>=5.2.0; extra == "ui"
Provides-Extra: viz
Requires-Dist: umap-learn>=0.5; extra == "viz"
Requires-Dist: matplotlib>=3.8; extra == "viz"
Requires-Dist: numpy>=1.24; extra == "viz"
Provides-Extra: all
Requires-Dist: mllm-annotator[ui,viz]; extra == "all"
Dynamic: license-file

# mllm-annotator

A small, resumable tool for sending folders of audio or image files to a
multimodal LLM for **automatic annotation**, plus an **embedding + UMAP
visualization** workflow. Gemini is the current backend; the design keeps the
provider behind a thin seam so others can be added later.

It ships both a command-line tool and a desktop GUI.

## Install

```powershell
# CLI only
pip install mllm-annotator

# with the desktop GUI and the embed/visualize feature
pip install "mllm-annotator[ui,viz]"
```

Or, for development from a clone:

```powershell
uv sync --extra ui --extra viz
```

The embed/visualize feature also needs **ffmpeg** on your `PATH` to handle
audio formats Gemini can't embed directly (e.g. `.aac`, `.opus`). It is an
optional system dependency, not a pip package; without it those files are
skipped.

## API key

Provide a Gemini API key in any one of these ways (checked in this order):

1. environment variable `GEMINI_API_KEY` (or `GOOGLE_API_KEY`):

   ```powershell
   $env:GEMINI_API_KEY="your_api_key"
   ```

2. a `.env` file in the current working directory:

   ```text
   GEMINI_API_KEY=your_api_key
   ```

3. saved from inside the GUI — click **Set API Key** in the top-left corner (it
   also opens automatically on first launch if no key is found), paste the key,
   and it is stored securely in your OS keyring (Windows Credential Manager /
   macOS Keychain / Linux Secret Service). No plaintext file is written.

`.env` is ignored by git, and keys are never written into the built package.

## Command line

```powershell
mllm-annotator --help
```

### Examples

Horse cough annotation:

```powershell
mllm-annotator `
  --input-folder "C:\data\horse_audio" `
  --media-type audio `
  --instruction "Annotate if the audio contains a horse cough or another sound such as the horse smacking the microphone." `
  --daily-limit 500
```

Swiss German transcription validation:

```powershell
mllm-annotator `
  --input-folder "C:\data\swiss_german_audio" `
  --media-type audio `
  --labels-csv "C:\data\transcriptions.csv" `
  --instruction "Confirm whether the attached Swiss German audio matches the associated transcription. If it is wrong, rewrite the correct transcription." `
  --daily-limit 500
```

Image captioning:

```powershell
mllm-annotator `
  --input-folder "C:\data\images" `
  --media-type image `
  --instruction "Caption the attached image." `
  --daily-limit 500
```

## Desktop GUI

```powershell
mllm-annotator-ui
```

The GUI lets you browse for the data folder, choose audio or image mode,
optionally select a `filename,label` CSV, write the instruction, preview the
file table, and start or resume processing. It shows the rewritten prompt and
updates each row as Gemini responses arrive, using the same JSONL result/state
files as the CLI. A second tab embeds the media and shows an interactive 2-D
UMAP projection (zoom/pan toolbar, hover a point for its file name).

## CSV format

The optional labels CSV must contain exactly one row per media file and these
columns:

```csv
filename,label
audio_001.wav,expected transcription or label
audio_002.wav,another label
```

For `--recursive`, `filename` must be the relative path with forward slashes,
for example `speaker_a/audio_001.wav`.

## Resume behavior

By default, results are appended to `runs/results.jsonl` and progress is saved
in `runs/state.json`. If the daily limit is reached or the API returns a
quota/rate limit, run the same command again later or the next day — already
processed files are skipped.

The first run rewrites your natural-language instruction with `gemini-3.5-flash`
and stores it in the state file. The media files are processed with
`gemini-3.1-flash-lite`. Use `--no-rewrite` to skip the rewrite call.

Before spending API calls, you can validate the folder and optional CSV:

```powershell
mllm-annotator `
  --input-folder "C:\data\images" `
  --media-type image `
  --instruction "Caption the attached image." `
  --dry-run
```
