Metadata-Version: 2.4
Name: glasskit.ai
Version: 0.1.1
Summary: Recorded-video eval tools for smart-glasses apps
Keywords: ai,cli,evals,smart-glasses,testing,video
Author: RealComputer
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Video
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Utilities
Requires-Dist: av
Requires-Dist: pillow
Requires-Dist: pydantic>=2
Requires-Dist: pyyaml
Requires-Dist: rich
Requires-Dist: typer
Requires-Python: >=3.12
Project-URL: Homepage, https://glasskit.ai
Project-URL: Documentation, https://github.com/RealComputer/GlassKit/blob/main/cli/README.md
Project-URL: Repository, https://github.com/RealComputer/GlassKit
Project-URL: Issues, https://github.com/RealComputer/GlassKit/issues
Description-Content-Type: text/markdown

# GlassKit CLI

This is the GlassKit command-line package. Its first command family is `glasskit eval`, a recorded-video evaluator for smart-glasses apps.

Smart-glasses apps are hard to test by hand because the input is physical, visual, and timing-sensitive. Given a recording of a workflow, `glasskit eval` lets you label the important moments and rerun the same checks whenever your prompts, model, parser, or app logic changes. Your adapter owns the app-specific call; the CLI handles video decoding, timestamp sampling, comparisons, reports, failure artifacts, and quality gates.

The current implementation assumes you use a `uv`-managed Python pipeline. If your use case does not fit the current model, please open an issue. We want to expand support based on real app needs.

## Why Use `glasskit eval`?

- Turn real camera recordings into repeatable tests instead of relying on memory, screenshots, or manual replay.
- Test the vision path that users actually depend on: frames, prompts, model calls, response parsing, app logic, and thresholds.
- Label only stable moments in the video and skip ambiguous transitions that would make a test noisy.
- Keep app-specific behavior in your adapter while reusing the CLI for video handling, comparison modes, JSON reports, and failure images.
- Run the same suite locally and in CI with exit codes that distinguish setup errors from quality-gate failures.

## What You Need

- A short recording of the workflow you want to evaluate.
- An `expected.yaml` file that says which timestamps or ranges should be checked and what each result should be.
- A Python adapter function or object that receives decoded frames and returns JSON-like observations.
- A `uv` command environment that can install `glasskit.ai` and your app's runtime dependencies.

## Install

Use the CLI from the app repository that contains the eval suite, adapter, and app dependencies. Running from that directory keeps imports and relative paths predictable. If your app uses a `.env` file, pass it through the command runner, for example with `uv run --env-file .env`; the CLI does not load `.env` files by itself. With `uv`, install the published `glasskit.ai` package into the command environment and invoke the `glasskit` console script in one command:

```bash
cd path/to/your-app
uv run --with glasskit.ai glasskit eval --help
```

If you already installed the command another way (`uv add --dev glasskit.ai`), you can drop the `uv run --with glasskit.ai` prefix and run `glasskit eval ...` directly.

Run help when you need the exact options for the installed version:

```bash
uv run --with glasskit.ai glasskit --help
uv run --with glasskit.ai glasskit eval --help
uv run --with glasskit.ai glasskit eval run --help
```

## Quick Start

Run these commands from your app repository so local imports, adapter files, and relative asset paths resolve the same way they do in your app. Add `--env-file .env` to `uv run` if your adapter expects environment variables from that file.

### 1. Create a Case from a Recording

```bash
uv run --with glasskit.ai glasskit eval init-case \
  --suite eval-suite \
  --case fold-step-001 \
  --video path/to/recording.mp4 \
  --target step_1 \
  --label "Step 1"
```

This creates `eval-suite/fold-step-001/`, copies the recording into the case directory, and writes a starter `expected.yaml`.

### 2. Label the Moments You Care About

Edit `eval-suite/fold-step-001/expected.yaml` so the timestamps and expected values match the video. Start with one or two unambiguous samples:

```yaml
version: 1
video: "video.mp4"
sampling:
  every_s: 0.5
targets:
  step_1:
    label: Step 1
    samples:
      - at: 2.0
        field: matches
        expect: true
      - range: [3.0, 5.0]
        field: matches
        expect: true
```

Use `at` for a single moment and `range` for a stable window. Avoid transition frames until you specifically want to measure transition behavior.

### 3. Check the Wiring with a Fake Adapter

Create `eval_adapter.py` in your app repository:

```python
def evaluate_sample(sample, target):
    return {
        "target": target.id,
        "timestamp_s": sample.timestamp_s,
        "matches": target.id == "step_1",
    }
```

This adapter does not judge the image. It only proves that the suite, video decoding, field extraction, comparison, and command wiring work before you connect a model backend.

### 4. Validate and Inspect the Schedule

```bash
uv run --with glasskit.ai glasskit eval validate --suite eval-suite --adapter eval_adapter.py:evaluate_sample
uv run --with glasskit.ai glasskit eval list-samples --suite eval-suite
```

### 5. Run the Eval

Run the fake adapter first:

```bash
uv run --with glasskit.ai glasskit eval run \
  --adapter eval_adapter.py:evaluate_sample \
  --suite eval-suite \
  --min-pass-rate 1.0
```

After that passes, replace the fake adapter body with the real call into your app or model backend and run with the options you want for local debugging or CI:

```bash
uv run --with glasskit.ai glasskit eval run \
  --adapter eval_adapter.py:create_evaluator \
  --suite eval-suite \
  --min-pass-rate 0.9 \
  --output-json tmp/eval-results.json \
  --save-failures \
  --artifacts-dir tmp/eval-artifacts
```

The command exits `0` when all quality gates pass, `1` when the eval ran but one or more gates failed, and `2` for setup or runtime errors such as invalid YAML, unreadable videos, or adapter failures that are not being collected with `--keep-going`.

## Recommended App Repo Layout

Keep the eval suite next to the adapter and app code that it exercises. This makes imports and relative asset paths predictable, and keeps environment handling close to the code that needs it.

```text
your-app/
  eval_adapter.py
  adapter-config.yaml
  eval-suite/
    fold-step-001/
      video.mp4
      expected.yaml
    fold-step-002/
      video.mp4
      expected.yaml
```

Commit the suite files that your team should share. Keep secrets in environment variables or uncommitted environment files, not in `expected.yaml` or `adapter-config.yaml`.

## Eval Suite Layout

An eval suite is a directory containing one or more case directories. Each case has an `expected.yaml` file and either one video file in the case directory or a `video:` path in `expected.yaml`.

```text
eval-suite/
  fold-step-001/
    video.mp4
    expected.yaml
  fold-step-002/
    camera-recording.mov
    expected.yaml
```

A single-case suite is also supported by placing `expected.yaml` directly in the suite directory.

Supported video suffixes are `.mp4`, `.mov`, `.m4v`, `.webm`, and `.mkv`. Timestamps in `expected.yaml` are seconds from the start of the clip, even when the container stores non-zero presentation timestamps internally.

## Writing `expected.yaml`

Here is a representative case file:

```yaml
version: 1
video: video.mp4
description: Fold step 1 should be detected after the crease is completed.
sampling:
  every_s: 0.5
workflow:
  targets:
    - id: step_1
      label: Step 1
      prompt_id: origami.step_1
targets:
  step_1:
    label: Step 1
    config:
      reference_image: assets/step_1.png
    samples:
      - range: [0.0, 6.8]
        expect: false
      - range: [7.4, 11.8]
        expect: true
  step_2:
    label: Step 2
    samples:
      - at: [4.0, 6.0]
        expect: false
thresholds:
  min_pass_rate: 0.9
  max_failures: 2
  per_target:
    step_1:
      min_pass_rate: 0.95
```

Ranges are interpreted as `[start, end)`. With `sampling.every_s: 0.5`, `range: [7.4, 8.6]` expands to samples at `7.4`, `7.9`, and `8.4` seconds. Only declared `range` and `at` samples are evaluated; unlabeled gaps are skipped. A sample block must contain exactly one of `range` or `at`.

Use ranges for stable windows where the expected answer should be unchanged. Use `at` for isolated moments or when a transition is too short to sample safely. Avoid labeling ambiguous transition frames unless the ambiguity is exactly what you want to measure.

### Case Fields

- `version` must be `1`.
- `video` is an optional path to the case video, resolved relative to the case directory. If it is omitted, the case directory must contain exactly one supported video file.
- `description` is optional and only for humans.
- `sampling.every_s` sets the default sample interval for `range` blocks in the case. The default is `0.5` seconds.
- `workflow.targets` is optional metadata matched to targets by each entry's `id`. Each entry must have `id`; `label` is optional; extra fields are passed to the adapter as `target.config` unless overridden by `targets.<id>.config`.
- `targets.<target_id>.label` is optional display text for reports.
- `targets.<target_id>.config` is optional adapter-specific metadata for that target. This is where you can put prompt ids, reference image paths, class names, or other app-level data that the core CLI should not know about.
- `targets.<target_id>.samples` is the required list of labeled sample blocks.
- `thresholds` is optional case-level gating. It can contain `min_pass_rate`, `max_failures`, and `per_target.<target_id>.min_pass_rate`.

## Expected Values and Comparison

The adapter returns a JSON-like value: `null`, boolean, number, string, array, or object. Each sample's `expect` value is compared with that returned value.

By default, booleans, strings, and `null` use exact comparison. Numbers use numeric comparison with zero tolerance unless you set one. Arrays and objects use exact comparison unless you choose another mode.

Use `field` when the adapter returns a structured object but the sample only cares about one nested value:

```yaml
targets:
  detector:
    samples:
      - at: 2.0
        field: result.matches
        expect: true
```

Field paths are dot-separated. Mapping keys are matched by name, and list indexes can be addressed with non-negative numeric path parts such as `detections.0.label`.

Supported `compare.mode` values are:

```yaml
targets:
  score:
    samples:
      - at: 1.0
        expect: 0.75
        compare:
          mode: numeric
          tolerance: 0.05
  metadata:
    samples:
      - at: 1.0
        expect:
          result:
            matches: true
        compare:
          mode: json_subset
  objects:
    samples:
      - at: 1.0
        expect: ["paper", "crease"]
        compare:
          mode: set_contains_all
```

- `exact` requires the observed value to equal `expect`.
- `numeric` requires both values to be numbers and allows `tolerance`.
- `json_subset` requires every key and value in `expect` to be present in the observed object. For arrays, each expected item must match at least one observed item.
- `set_equals` compares arrays as unordered sets.
- `set_contains_any` passes when at least one expected array item is present in the observed array.
- `set_contains_all` passes when every expected array item is present in the observed array.

## Suite-Level Thresholds

Put thresholds that should apply to the selected run as a whole in `suite.yaml` at the suite root. Suite-level `min_pass_rate` and `max_failures` gates are evaluated against the combined selected results. Suite-level `per_target` entries apply across the selected samples for each target id:

```yaml
thresholds:
  min_pass_rate: 0.9
  max_failures: 5
  per_target:
    step_1:
      min_pass_rate: 0.95
    step_2:
      min_pass_rate: 0.85
```

Every run also includes an `adapter_errors` gate. The run only succeeds if the adapter produced no runtime or comparison errors and every configured quality gate passed.

Failed comparisons are intentionally controlled by quality gates instead of a built-in default failure policy. If you run without `--min-pass-rate`, `--min-target-pass-rate`, `--max-failures`, or YAML thresholds, failed comparisons are reported in the summary and JSON output but do not make the command exit nonzero; adapter, runtime, or comparison errors still fail through the `adapter_errors` gate. Configure a pass-rate or max-failures gate for CI or any run where failed observations should fail the command.

CLI gates are useful for one-off CI jobs or local experiments:

```bash
uv run --with glasskit.ai glasskit eval run \
  --adapter eval_adapter.py:create_evaluator \
  --suite eval-suite \
  --min-pass-rate 0.9 \
  --min-target-pass-rate 0.85 \
  --max-failures 3
```

`--min-pass-rate` and `--max-failures` override suite-level YAML values for the run. Because those flags define a run-level pass/fail policy, case-level YAML gates are not applied when either flag is set. `--min-target-pass-rate` applies the same target pass-rate gate to every target present in the selected results; when it is set, suite-level `per_target` gates are replaced by the uniform CLI gate. With `--case`, suite-level per-target gates for targets outside the selected case are skipped.

## Commands

### `glasskit eval init-case`

`init-case` creates a case directory, copies the source video into it when needed, and writes a starter `expected.yaml`:

```bash
uv run --with glasskit.ai glasskit eval init-case \
  --suite eval-suite \
  --case fold-step-001 \
  --video recordings/fold-step-001.mp4 \
  --target step_1 \
  --label "Step 1"
```

The case name must be a single directory name under the suite. If the source video is already inside the case directory, the generated `video:` path is written relative to the case directory. Use `--force` to overwrite an existing `expected.yaml` or case video.

### `glasskit eval validate`

`validate` checks suite structure, YAML schema, video readability, sample timestamps, and optional adapter importability:

```bash
uv run --with glasskit.ai glasskit eval validate --suite eval-suite
uv run --with glasskit.ai glasskit eval validate --suite eval-suite --adapter eval_adapter.py:create_evaluator
uv run --with glasskit.ai glasskit eval validate --suite eval-suite --case fold-step-001
```

Use validation before long or paid model evals. It catches most local mistakes without decoding sample frames or calling `evaluate`. Passing `--adapter` imports, constructs, and closes the adapter, so adapter setup side effects can still run.

### `glasskit eval list-samples`

`list-samples` prints the expanded sample schedule:

```bash
uv run --with glasskit.ai glasskit eval list-samples --suite eval-suite
uv run --with glasskit.ai glasskit eval list-samples --suite eval-suite --case fold-step-001
```

This is the quickest way to confirm that your ranges, point samples, fields, and explicit comparison modes expand as intended. When a sample omits `compare.mode`, the Mode column is blank because the default mode is inferred when the sample is evaluated.

### `glasskit eval run`

`run` decodes sample frames, calls the adapter, compares results, applies gates, prints a summary, and optionally writes JSON and failure artifacts:

```bash
uv run --with glasskit.ai glasskit eval run \
  --adapter eval_adapter.py:create_evaluator \
  --suite eval-suite \
  --case fold-step-001 \
  --adapter-config adapter-config.yaml \
  --keep-going \
  --verbose \
  --output-json tmp/eval-results.json \
  --save-failures \
  --artifacts-dir tmp/eval-artifacts
```

- `--case` limits the run to one case directory by name.
- `--adapter-config` reads a YAML or JSON object and passes it to the adapter factory.
- `--keep-going` records adapter or comparison errors as errored sample results instead of aborting the run on the first error.
- `--verbose` prints every sample result as it is produced and sets `AdapterConfig.verbose` for the adapter.
- `--output-json` writes a machine-readable report with summary counts, elapsed run duration, gate results, and per-sample observations. The final console summary also shows the elapsed duration.
- `--save-failures` saves failed sample frames and per-result JSON files. If `--artifacts-dir` is omitted, artifacts are written under `.glasskit-artifacts` in the suite directory.
- `--allow-empty` allows suites or cases with no samples. This is mainly useful while drafting a suite, not for real quality gates.

## Writing an Adapter

An adapter is the bridge between the generic CLI and your app. It receives decoded video frames plus target metadata, calls your app or model backend, and returns a JSON-like observation for each sample.

Pass an adapter as `<module-or-file>:<callable>`. The module side can be a Python import path such as `my_app.eval_adapter` or a file path such as `eval_adapter.py`. The callable side can name a function, class, or nested attribute such as `create_evaluator` or `EvalAdapters.fold_checker`.

The recommended shape is a factory that accepts one required `AdapterConfig` argument and returns an evaluator object:

```python
from __future__ import annotations

import os
from typing import Any


def create_evaluator(config: Any) -> "FoldEvaluator":
    settings = dict(config.config)
    return FoldEvaluator(
        api_key=os.environ["MODEL_API_KEY"],
        model=settings.get("model", "default-model"),
        verbose=bool(config.verbose),
    )


class FoldEvaluator:
    def __init__(self, *, api_key: str, model: str, verbose: bool) -> None:
        self._api_key = api_key
        self._model = model
        self._verbose = verbose

    async def evaluate(self, sample: Any, target: Any) -> bool:
        image = sample.image
        target_id = target.id
        prompt_id = target.config.get("prompt_id", target_id)
        return await call_model_backend(
            api_key=self._api_key,
            model=self._model,
            image=image,
            prompt_id=prompt_id,
            timestamp_s=sample.timestamp_s,
        )

    async def close(self) -> None:
        await close_model_client()
```

No-argument factories and evaluator classes are also supported, but they will not receive `AdapterConfig`. If the factory needs `--adapter-config`, `--artifacts-dir`, `--verbose`, or the suite path, give it one required argument.

`evaluate(sample, target)` may be synchronous or asynchronous. If the evaluator also implements `evaluate_many(samples, target)`, the runner calls it once per target and uses the returned list as the observations for that target's samples. `evaluate_many` must return exactly one observation for each input sample in the same order.

`close()` is optional and may be synchronous or asynchronous. Use it to close HTTP clients, model sessions, or temporary resources.

### Adapter Inputs

The factory receives `AdapterConfig` when it declares one required argument. `AdapterConfig` has these fields:

- `suite_path` is the resolved path to the eval suite.
- `config` is the object loaded from `--adapter-config`; it is an empty mapping when the option is omitted.
- `artifacts_dir` is the path from `--artifacts-dir`, or `None` when the option is omitted.
- `verbose` mirrors `--verbose`.

The evaluator receives a `sample` object with these fields:

- `image` is a decoded RGB `PIL.Image.Image` for the requested timestamp.
- `timestamp_s` is the requested sample timestamp in seconds from the start of the clip.
- `frame_index` is the decoded video frame index chosen for that timestamp.
- `sample_index` is the case-local sample index.
- `video_path` is the source video path as a string.
- `case_name` is the case directory name.

The evaluator also receives a `target` object with these fields:

- `id` is the target id from `expected.yaml`.
- `index` is the target's zero-based order in the case file.
- `label` is the optional target label.
- `config` is the merged target metadata from `workflow.targets` and `targets.<id>.config`.

### Simple Function Adapters

For smoke checks or fake local evals, the adapter target can be a function whose first two positional arguments are either `image, target_id` or `sample, target`:

```python
def evaluate_frame(image, target_id):
    return target_id == "step_1"
```

```python
async def evaluate_sample(sample, target):
    return {
        "target": target.id,
        "bright": sample.image.convert("L").getextrema()[1] > 180,
    }
```

Function adapters are useful for testing suite wiring because they do not need a model backend. Production adapters should usually use the object shape so they can reuse clients and close resources cleanly.

### Adapter Config Files

Use `--adapter-config` for values that should not live in `expected.yaml`, such as backend URLs, model names, thresholds owned by the adapter, or local asset paths:

```yaml
api_url: "https://example.test/v1"
model: "vision-checker"
jpeg_quality: 90
```

The CLI only parses the file as YAML or JSON and passes the resulting object to `AdapterConfig.config`; it does not expand environment variables inside the file. Read secrets directly from environment variables in the adapter and use `--adapter-config` for non-secret runtime settings.

### Adapter Return Values

Return the smallest stable value that answers the target. For a binary detector, return `true` or `false`. For a classifier, return a string label. For richer workflows, return an object and use `field` or `json_subset` in `expected.yaml`.

Good observations are deterministic, JSON-like, and easy to inspect in the JSON report:

```python
return {
    "result": {
        "matches": True,
        "confidence": 0.94,
        "label": "folded",
    }
}
```

Avoid returning SDK objects, dataclasses, images, bytes, or other values that cannot be serialized to JSON. They make reports harder to read and may fail when written to `--output-json`.

## Practical Adapter Pattern

A model-backed adapter usually follows this flow:

```python
import os


def create_evaluator(config):
    settings = dict(config.config)
    return MyEvaluator(
        api_key=settings.get("api_key") or os.environ["MODEL_API_KEY"],
        api_url=settings.get("api_url", "https://api.example.test"),
        model=settings.get("model", "vision-model"),
    )


class MyEvaluator:
    def __init__(self, *, api_key, api_url, model):
        self._client = make_async_client(api_key=api_key, base_url=api_url)
        self._model = model

    async def evaluate_many(self, samples, target):
        return [await self.evaluate(sample, target) for sample in samples]

    async def evaluate(self, sample, target):
        prompt = target.config.get("prompt", f"Check {target.id}.")
        image_payload = encode_image(sample.image)
        response = await self._client.check(
            model=self._model,
            prompt=prompt,
            image=image_payload,
        )
        return parse_response(response)

    async def close(self):
        await self._client.aclose()
```

Keep retries, response parsing, prompt construction, and backend-specific error handling in the adapter. Keep generic eval semantics in `expected.yaml` and the CLI.

## Debugging Failed Runs

Start with validation:

```bash
uv run --with glasskit.ai glasskit eval validate --suite eval-suite --adapter eval_adapter.py:create_evaluator
```

Then list samples and run one case:

```bash
uv run --with glasskit.ai glasskit eval list-samples --suite eval-suite --case fold-step-001
uv run --with glasskit.ai glasskit eval run --suite eval-suite --case fold-step-001 --adapter eval_adapter.py:create_evaluator --verbose
```

If the adapter is unstable or expensive, add `--keep-going --save-failures --output-json tmp/eval-results.json`. The saved failure images show exactly what frame the adapter saw, and the JSON report includes the raw observation, extracted field, comparison mode, and reason for each sample.

Common issues:

- `adapter target not found` means the `<module-or-file>:<callable>` path imported successfully but the callable name could not be resolved.
- `adapter import failed` usually means the working directory, `PYTHONPATH`, or app environment does not include the adapter's dependencies.
- `video file does not exist` means the `video:` path is wrong or is being resolved from the case directory differently than expected.
- `sample ... exceeds video duration` means a timestamp is beyond the readable video duration. Check the recording length and the units in `expected.yaml`.
- `missing field` means the adapter returned a value that does not contain the sample's `field` path.
- `invalid_observation: adapter returned null` means the adapter returned `None` for a sample whose expected value was not `null`.

## Technical Details

For contributor-oriented implementation notes, see [AGENTS.md](AGENTS.md).
