Metadata-Version: 2.4
Name: data-annotations
Version: 2.2.0
Summary: Annotate generated data artifacts
Keywords: annotations,data,metadata,provenance,reproducibility
Author: Rodrigo C.  G.  Pena
Author-email: Rodrigo C.  G.  Pena <rodrigo.cerqueiragonzalezpena@unibas.ch>
License-Expression: BSD-3-Clause
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: pydantic>=2.13.1
Requires-Dist: questionary>=2.1.1 ; extra == 'cli'
Requires-Dist: typer>=0.16.0 ; extra == 'cli'
Requires-Python: >=3.12
Project-URL: Source, https://gitlab.com/ceda-unibas/tools/data-annotations
Project-URL: Changelog, https://gitlab.com/ceda-unibas/tools/data-annotations/-/blob/main/CHANGELOG.md
Project-URL: Issues, https://gitlab.com/ceda-unibas/tools/data-annotations/-/issues
Provides-Extra: cli
Description-Content-Type: text/markdown

# data-annotations

A Python package for attaching provenance and structured descriptions to the
files and directories your workflows produce.

It is designed for lightweight research and reproducibility pipelines where you want
generated datasets, tables, plots, or reports to carry enough context to explain
where they came from and what they contain.

The package captures common provenance automatically and writes plain JSON and
Markdown artifacts that are easy to inspect or archive. The canonical on-disk
format uses one JSON annotation document per artifact:

- Files use `artifact.ext.annotation.json`
- Directories carry `data-annotations.json` at their root

Each annotation document stores four top-level sections:

- `annotation_version`
- `subject`
- `provenance`
- `description`

Here's the mental model: files get a visible sibling annotation, and
directories carry one visible annotation at their root. Treat the annotation as
part of the research output bundle.

See the [changelog](CHANGELOG.md) for release history and upgrade-oriented notes.

## Installation

Install the core library from PyPI with `pip`:

```bash
pip install data-annotations
```

Or add it to a project with [uv](https://astral.sh/uv/):

```bash
uv add data-annotations
```

The command-line interface uses optional dependencies. Install the package with
CLI support when you want to run `data-annotations` commands:

```bash
pip install "data-annotations[cli]"
uv add "data-annotations[cli]"
```

For development or unreleased source installs, install directly from GitLab:

```bash
uv add "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git"
pip install "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git"
```

Pin a source install to a particular release tag `x.y.z` with:

```bash
uv add "data-annotations @ git+https://gitlab.com/ceda-unibas/tools/data-annotations.git@x.y.z"
```

## What gets captured automatically

Every annotation document includes provenance with:

- A UTC creation timestamp
- Hostname and username
- The script path and command-line arguments
- The script path relative to the Git repo root when it can be determined
- Git commit, branch, dirty state, canonical repository remote, exact tags, and
  `git describe` output when available
- The current `SLURM_JOB_ID` when available

You can also attach your own parameters, input file paths, and function names.
Local filesystem paths in provenance are stored as absolute paths. URI-style inputs
such as `s3://...` or `https://...` are preserved as provided.
Git tags and `git_describe` are human-friendly hints only; `git_sha` remains the
source of truth for reproducibility, matching, and source checkout.

## Quick Start

The recommended way to annotate your data artifacts is to decorate pipeline
functions that consume some inputs and parameters, then write those artifacts.
This keeps the artifact-writing logic explicit while letting `data-annotations` capture
provenance and emit sidecars automatically.

For example, here is a complete file-level annotation workflow using the
`record_file_annotation(...)` decorator. Once `write_participants` is called, it
automatically generates sidecars `participants.csv.annotation.json` and `participants.csv.README.md`.
The JSON sidecar will contain provenance and description metadata, and the Markdown sidecar
will have a human-friendly rendering of the description provided in the decorator.

```python
from pathlib import Path

from data_annotations.annotations import record_file_annotation
from data_annotations.description import AllowedValue, FieldDefinition

@record_file_annotation(
    title="Participant Cohort",
    summary="Participant-level cohort assignments for the validation split.",
    fields=[
        FieldDefinition(
            name="participant_id",
            data_type="string",
            summary="Stable participant identifier.",
            required=True,
            nullable=False,
        ),
        FieldDefinition(
            name="group",
            data_type="string",
            summary="Assigned study group.",
            allowed_values=[
                AllowedValue(value="control"),
                AllowedValue(value="treatment"),
            ],
        ),
    ],
    primary_key=["participant_id"],
    artifact_kind="dataset",
    acquisition_context={"source": "Study A registry export"},
    generation_context={"pipeline": "baseline-v1"},
)
def write_participants(
    artifact_path: Path,
    input_path: Path,
    split: str,
) -> Path:
    participant_ids = [
        line.strip()
        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
        if line.strip()
    ]
    artifact_path.parent.mkdir(parents=True, exist_ok=True)
    artifact_path.write_text(
        "\n".join(
            [
                "participant_id,group,split",
                *[
                    f"{participant_id},control,{split}"
                    for participant_id in participant_ids
                ],
            ]
        )
        + "\n",
        encoding="utf-8",
    )
    return artifact_path

# Annotation sidecars are written automatically
# when the decorated function is called:
artifact_path = Path("outputs") / "participants.csv"
write_participants(
    artifact_path=artifact_path,
    input_path=Path("data/raw/participants.csv"),
    split="validation",
)

print(f"{artifact_path}.annotation.json")
print(f"{artifact_path}.README.md")
```

### Decorator Contract

You write a normal Python function and the decorator returns that function's
original return value unchanged.

For provenance-bearing decorators, recorded inputs are inferred from named
function arguments such as `input_path` and `input_paths`. Those arguments
should correspond to real data dependencies used inside the wrapped function.

For file decorators:

- `record_file_manifest(...)`
- `record_file_annotation(...)`
- `record_file_description(...)`

Your function should:

- accept one argument pointing at the output file path. By default this argument
  is named `artifact_path`, but you can change the expected name with
  `artifact_path_arg=...`.
- use any other normal Python arguments you need for the pipeline step.
- for provenance-bearing decorators, use argument names listed in `input_args`
  for real upstream dependencies you want recorded as provenance inputs. By
  default those names are `("input_path", "input_paths")`.

Your function may return any value. File decorators do not inspect that return
value. Returning the generated `artifact_path` is recommended because it is
convenient for callers, but it is not required.

For directory decorators:

- `record_directory_manifest(...)`
- `record_directory_annotation(...)`
- `record_directory_description(...)`

Your function should:

- accept one argument pointing at the output directory. By default this argument
  is named `output_dir`, but you can change the expected name with
  `output_dir_arg=...`.
- return a materialized iterable, usually a `list`, describing the files that
  were produced in that directory.
- prefer returning a `list` or `tuple` rather than a generator, since the
  decorator needs to iterate over the outputs to write sidecars.

Accepted directory return items are:

- `DocumentedArtifact` when you want per-artifact title, summary, fields,
  keys, or missing-value metadata.
- `DocumentedArtifactGroup` for `record_directory_annotation(...)` and
  `record_directory_description(...)` when many files share one title, summary,
  kind, and optional schema metadata.
- `ProducedFile` when you only need path, kind, and optional precomputed hash.
- `ChildBundle` when an annotated child directory should be referenced as its
  own independently shareable bundle.
- `(path, kind)` tuples when path and artifact kind are enough.
- plain path-like values when the artifact kind can default to `"other"`.

For provenance-bearing directory decorators, `input_args` works the same way as
for file decorators: matching argument names are recorded as inputs, and the
remaining bound arguments become provenance params.

Here is another decorator pattern example with `record_directory_annotation(...)`:

```python
from pathlib import Path

from data_annotations.annotations import record_directory_annotation
from data_annotations.description import (
    DocumentedArtifact,
    DocumentedArtifactGroup,
    FieldDefinition,
)
from data_annotations.provenance import ProducedFile

@record_directory_annotation(
    title="Validation Outputs",
    summary="Directory-level documentation for the validation run outputs.",
    acquisition_context={"source": "Study A registry export"},
    generation_context={"pipeline": "baseline-v1"},
)
def build_outputs(
    output_dir: Path,
    input_path: Path,
    split: str,
):
    participant_ids = [
        line.strip()
        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
        if line.strip()
    ]
    output_dir.mkdir(parents=True, exist_ok=True)

    table_path = output_dir / "scores.csv"
    table_path.write_text(
        "\n".join(
            [
                "participant_id,score,split",
                *[
                    f"{participant_id},0.94,{split}"
                    for participant_id in participant_ids
                ],
            ]
        )
        + "\n",
        encoding="utf-8",
    )

    report_path = output_dir / "summary.txt"
    report_path.write_text(
        (
            f"Validated {len(participant_ids)} participants from "
            f"{input_path.name} for the {split} split.\n"
        ),
        encoding="utf-8",
    )

    plot_paths = []
    for day in ["2024-01-01", "2024-01-02", "2024-01-03"]:
        plot_path = output_dir / f"sma_{day}.png"
        plot_path.write_bytes(
            (
                f"plot placeholder for the SMA variable on {day}, "
                f"derived from {input_path.name}\n"
            ).encode("utf-8")
        )
        plot_paths.append(plot_path)

    return [
        DocumentedArtifact(
            path=str(table_path),
            kind="dataset",
            title="Metrics Table",
            fields=[
                FieldDefinition(
                    name="metric",
                    data_type="string",
                    summary="Metric name.",
                ),
                FieldDefinition(
                    name="value",
                    data_type="float",
                    summary="Metric value.",
                ),
            ],
        ),
        ProducedFile(path=str(report_path), kind="report"),
        DocumentedArtifactGroup(
            title="Daily SMA plots",
            summary="Plots of the same variable on different days.",
            kind="plot",
            paths=[str(path) for path in plot_paths],
            selector="sma_*.png",
        ),
    ]


output_dir = Path("outputs") / "run-001"
build_outputs(
    output_dir=output_dir,
    input_path=Path("data/raw/participants.csv"),
    split="validation",
)

print(output_dir / "data-annotations.json")
print(output_dir / "README.md")
```

The decorator and direct APIs write the same canonical document shape. If you need
metadata to vary per call instead of staying fixed at decoration time, use
`annotate_file(...)`, `annotate_directory(...)`, `write_file_annotation(...)`, or
`write_directory_annotation(...)` directly instead. See the example gallery in
`examples/` for runnable examples of all approaches.

### When To Use Decorators Vs Direct Functions

If a function is only a final serializer for already-prepared data, prefer the
direct annotation and writer APIs. They let you attach `inputs=[...]` explicitly.

## Canonical Document Shape

File annotations store:

- `subject.path`
- `subject.kind`
- `subject.sha256`
- `provenance.*`
- `description.title`
- `description.summary`
- `description.fields`
- `description.primary_key`
- `description.missing_value_codes`
- `description.acquisition_context`
- `description.generation_context`
- `description.description_updated_at`

Directory annotations store:

- `subject.path`
- `subject.produced_files[]`
- `subject.child_bundles[]`
- `subject.content_digest`
- `provenance.*`
- `description.title`
- `description.summary`
- `description.artifact_groups[]`
- `description.artifacts[]`
- `description.acquisition_context`
- `description.generation_context`
- `description.description_updated_at`

Use `description.artifact_groups[]` when many files have the same meaning, and
use `description.artifacts[]` only for file-specific notes, overrides, or schema.
Groups are descriptive only. Integrity still lives in `subject.produced_files[]`,
which tracks every concrete file by path, kind, and checksum.

The `description` section intentionally excludes provenance linkage fields.
Directory `produced_files[].path` values are stored relative to `subject.path`,
which keeps verification stable when a complete output directory is copied or
archived elsewhere. `subject.content_digest` is computed from sorted tracked file
paths, file checksums, and referenced child bundle digests.

## Artifact Groups

Artifact groups are for homogeneous sets of files that researchers naturally
understand as one output family: for example, 100 PNG plots of the same variable,
one per acquisition day. A group stores the shared title, summary, kind, optional
schema fields, and the concrete member paths. It can also store an informational
`selector`, such as `plots/*.png`, to show how the group was chosen.

Rules of thumb:

- Use artifact groups when many files have the same meaning.
- Use individual artifacts for file-specific notes, exceptions, or overrides.
- It is OK for an individual artifact to also appear in a group.
- Do not rely on groups for integrity. `subject.produced_files[]` remains the
  complete checksum inventory.

## Nested Directory Policy

Annotate the smallest thing you would share as a unit. If a directory is one
research output, give that directory one `data-annotations.json`, even when its
tracked files live in nested subdirectories.

Use recursive directory annotations for one bundle with nested files:

```bash
data-annotations annotate directory path/to/run-001 --recursive
data-annotations annotate directory path/to/run-001 --max-depth 2
```

Use child bundle annotations when a subdirectory is independently meaningful,
shareable, or reusable. In that case, annotate the child directory first, then
annotate the parent. The parent records a compact `child_bundles[]` reference
with the child path, child annotation path, and child content digest; it does not
copy the child file inventory into the parent JSON.

Post-hoc directory discovery follows the same rule. `--recursive` discovers
nested files, but it stops at annotated child directories containing
`data-annotations.json` and records them as child bundles.

## Provenance Decorators And Writers

The `data_annotations.provenance` namespace provides provenance-only entry points.
Prefer the decorators when you already have a small function that writes artifacts:

```python
from pathlib import Path

from data_annotations.provenance import record_file_manifest


@record_file_manifest(artifact_kind="report")
def write_report(
    artifact_path: Path,
    input_path: Path,
    threshold: float = 0.5,
):
    artifact_path.parent.mkdir(parents=True, exist_ok=True)
    artifact_path.write_text(
        f"threshold applied: {threshold}\nsource={input_path.name}\n",
        encoding="utf-8",
    )


write_report(
    artifact_path=Path("outputs/summary.txt"),
    input_path=Path("data/raw/participants.csv"),
    threshold=0.75,
)
```

Use `record_directory_manifest(...)` for directory outputs. Directory decorators
accept `DocumentedArtifact`, `ProducedFile`, `(path, kind)`, and plain path-like
return values. Provenance-only APIs do not accept description groups; use
unified annotation or description APIs when groups should appear in the JSON or
README.

If you want the direct writer approach instead, use `write_file_manifest(...)` and
`write_directory_manifest(...)` (see `examples/`).

## Description Layer

The `data_annotations.description` sub-package provides the structured description
models used by annotation writers and the Markdown sidecar renderers.
Within those models, the primary human-written narrative field is named `summary`.

Key public description models:

- `AllowedValue`
- `FieldDefinition`
- `DocumentedArtifact`
- `DocumentedArtifactGroup`
- `ArtifactDescription`
- `ArtifactGroupDescription`
- `FileDescription`
- `DirectoryDescription`

Description decorators and helpers:

- `record_file_description(...)`
- `record_directory_description(...)`
- `write_file_description(...)`
- `write_directory_description(...)`
- `render_file_readme(...)`
- `render_directory_readme(...)`

Alias helpers `write_file_readme(...)` and `write_directory_readme(...)` are supported.

Use the decorator forms when the description metadata is stable
for a function, and use the direct helpers when you want to assemble descriptions
per call.

## Recovery Helpers

Use `artifact_matches_manifest(...)` to verify whether a detached artifact still
matches an annotation document, and `checkout_manifest_source(...)` to recover the
recorded code state from Git metadata.

```python
from pathlib import Path

from data_annotations.provenance import (
    artifact_matches_manifest,
    checkout_manifest_source,
)

annotation_path = Path("outputs/participants.csv.annotation.json")
artifact_path = Path("downloads/participants.csv")

if artifact_matches_manifest(artifact_path, annotation_path):
    recovered = checkout_manifest_source(annotation_path)
    print(recovered.checkout_path)
    print(recovered.script_path)
```

## Post-Hoc Annotation

The strongest workflow is to create provenance and description at the same time
as the artifact itself. When annotations are written during generation, the
package can capture runtime context directly and the resulting records are
typically more complete, precise, and trustworthy.

For existing artifacts, the CLI provides a post-hoc annotation path so you can
still attach provenance and description after the fact.

Post-hoc descriptions can still be very useful, but the quality of post-hoc
provenance depends on how exact the supplied answers are. In particular, fields
such as the generating script, command, function, Git commit, repository path,
Git tags, `git describe` output, inputs, and parameters are only as reliable as
the information entered during annotation.

## CLI Workflow

This package provides a command-line interface (CLI) for retrospective annotation
and provenance inspection.

For post-hoc annotation:

```bash
data-annotations annotate file path/to/participants.csv
data-annotations annotate directory path/to/run-001
data-annotations annotate directory path/to/run-001 --recursive
data-annotations annotate directory path/to/run-001 --max-depth 2
data-annotations annotate directory path/to/run-001 \
  --recursive \
  --group-selector "plots/*.png" \
  --group-title "Daily SMA plots" \
  --group-summary "Plots of the same variable on different days." \
  --group-kind plot
```

These commands prompt for missing details, write `*.annotation.json` or `data-annotations.json`,
and optionally derive README sidecars. Post-hoc records are marked with
`capture_mode="post_hoc"`.

When group selectors are provided, the CLI expands them to concrete member paths
at annotation time. Grouped files are tracked in `subject.produced_files[]` but
are skipped by the per-file prompt flow, so you do not have to answer the same
questions for every matching file.

For post-hoc provenance, use repeatable `--git-tag` and optional
`--git-describe` when you know the original code state. These values are stored
as human-readable hints; `--git-sha` remains the field used for recovery.

For provenance inspection and source recovery:

```bash
data-annotations provenance match path/to/artifact
data-annotations provenance checkout path/to/artifact
```

Command `match` auto-discovers `*.annotation.json` for files and `data-annotations.json` for
directories, prints a verification summary, and suggests the exact `checkout`
command to run next when Git recovery metadata is available.

### Run With `uvx`

```bash
uvx --from "data-annotations[cli]" data-annotations provenance match path/to/participants.csv
```

### Install And Use With `uv tool`

```bash
uv tool install "data-annotations[cli]"
data-annotations provenance match path/to/participants.csv
```

### Run From Repository Root

From the repository root while developing locally, run `task install` first.
That task uses `uv sync --extra cli`, so the CLI commands are available in
the project environment. You can then run:

```bash
uv run data-annotations annotate file path/to/participants.csv
uv run data-annotations annotate directory path/to/run-001
uv run data-annotations provenance match path/to/participants.csv
uv run data-annotations provenance checkout path/to/participants.csv
```

## API Overview

### Annotation Models

- `FileArtifactSubject`
- `DirectoryArtifactSubject`
- `FileAnnotationDocument`
- `DirectoryAnnotationDocument`
- `FileAnnotationResult`
- `DirectoryAnnotationResult`

### Annotation Decorators

- `record_file_annotation(...)`
- `record_directory_annotation(...)`

### Annotation Functions

- `write_file_annotation(...)`
- `write_directory_annotation(...)`
- `annotate_file(...)`
- `annotate_directory(...)`

### Description Models

- `AllowedValue`
- `FieldDefinition`
- `DocumentedArtifact`
- `DocumentedArtifactGroup`
- `ArtifactDescription`
- `ArtifactGroupDescription`
- `FileDescription`
- `DirectoryDescription`

### Description Functions

- `record_file_description(...)`
- `record_directory_description(...)`
- `write_file_description(...)`
- `write_directory_description(...)`
- `write_file_readme(...)`
- `write_directory_readme(...)`
- `render_file_readme(...)`
- `render_directory_readme(...)`

### Provenance Models

- `ProducedFile`
- `ChildBundle`
- `BaseProvenance`
- `FileManifest`
- `DirectoryManifest`
- `RecoveredSource`

### Provenance Functions

- `record_file_manifest(...)`
- `record_directory_manifest(...)`
- `write_file_manifest(...)`
- `write_directory_manifest(...)`
- `directory_content_digest(...)`
- `artifact_matches_manifest(...)`
- `checkout_manifest_source(...)`

## Examples

Runnable examples live in `examples/` and mirror the README workflows.
Run them from the repository root with:

```bash
uv run python examples/record_file_annotation.py
uv run python examples/record_directory_annotation.py
uv run python examples/record_file_manifest.py
uv run python examples/record_directory_manifest.py
uv run python examples/record_file_description.py
uv run python examples/record_directory_description.py
uv run python examples/annotate_file.py
uv run python examples/annotate_directory.py
uv run python examples/write_file_manifest.py
uv run python examples/write_directory_manifest.py
uv run python examples/write_file_description.py
uv run python examples/write_directory_description.py
uv run python examples/recover_provenance.py
uv run python examples/recover_provenance_cli.py
```

Each example writes its outputs to a fresh temporary directory and prints the
location so you can inspect the generated annotation documents and README sidecars.
