Metadata-Version: 2.4
Name: data-annotations
Version: 2.11.0
Summary: Annotate data artifacts with provenance and descriptions
Keywords: annotations,data,metadata,provenance,reproducibility
Author: Rodrigo C.  G.  Pena
Author-email: Rodrigo C.  G.  Pena <rodrigo.cerqueiragonzalezpena@unibas.ch>
License-Expression: BSD-3-Clause
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: pydantic>=2.13.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: questionary>=2.1.1 ; extra == 'cli'
Requires-Dist: typer>=0.16.0 ; extra == 'cli'
Requires-Python: >=3.12
Project-URL: Source, https://gitlab.com/ceda-unibas/tools/data-annotations
Project-URL: Changelog, https://gitlab.com/ceda-unibas/tools/data-annotations/-/blob/main/CHANGELOG.md
Project-URL: Issues, https://gitlab.com/ceda-unibas/tools/data-annotations/-/issues
Provides-Extra: cli
Description-Content-Type: text/markdown

# data-annotations

[![PyPI](https://img.shields.io/pypi/v/data-annotations?label=pypi)](https://pypi.org/project/data-annotations/)
[![Documentation](https://img.shields.io/badge/docs-latest-blue)](https://ceda-unibas.gitlab.io/tools/data-annotations/)
[![License](https://img.shields.io/pypi/l/data-annotations?label=license)](https://gitlab.com/ceda-unibas/tools/data-annotations/-/blob/main/LICENSE)
[![CI](https://gitlab.com/ceda-unibas/tools/data-annotations/badges/main/pipeline.svg)](https://gitlab.com/ceda-unibas/tools/data-annotations/-/pipelines)

`data-annotations` is a Python package for attaching provenance and structured
descriptions to the files and directories your workflows produce.

It writes plain JSON annotation sidecars that are easy to inspect, archive, and
publish with research outputs:

- files use `artifact.ext.annotation.json`
- directories use `data-annotations.json` at their root

Optional Markdown README sidecars can be generated for human-readable summaries.

## Documentation

The [full documentation](https://ceda-unibas.gitlab.io/tools/data-annotations/) is organized as a [Diátaxis](https://diataxis.fr/) site.

Other links:

- [Source code](https://gitlab.com/ceda-unibas/tools/data-annotations)
- [Changelog](https://gitlab.com/ceda-unibas/tools/data-annotations/-/blob/main/CHANGELOG.md)
- [Work items](https://gitlab.com/ceda-unibas/tools/data-annotations/-/work_items)

## Installation

Install the core library from PyPI:

```bash
pip install data-annotations
```

Or add it to a project with `uv`:

```bash
uv add data-annotations
```

Install CLI support when you want the `data-annotations` command:

```bash
pip install "data-annotations[cli]"
uv add "data-annotations[cli]"
```

## Quick start

Decorate a function that writes an artifact. When the function runs,
`data-annotations` records provenance and writes the JSON sidecar.

```python
from pathlib import Path

from data_annotations.annotations import record_file_annotation
from data_annotations.description import FieldDefinition


@record_file_annotation(
    title="Participant Cohort",
    summary="Participant-level cohort assignments.",
    fields=[
        FieldDefinition(
            name="participant_id",
            data_type="string",
            summary="Stable participant identifier.",
            required=True,
            nullable=False,
        ),
    ],
    primary_key=["participant_id"],
    artifact_kind="dataset",
    write_readme=True,
)
def write_participants(artifact_path: Path, input_path: Path) -> Path:
    participant_ids = [
        line.strip()
        for line in input_path.read_text(encoding="utf-8").splitlines()[1:]
        if line.strip()
    ]
    artifact_path.parent.mkdir(parents=True, exist_ok=True)
    artifact_path.write_text(
        "participant_id\n" + "\n".join(participant_ids) + "\n",
        encoding="utf-8",
    )
    return artifact_path


artifact_path = Path("outputs") / "participants.csv"
write_participants(
    artifact_path=artifact_path,
    input_path=Path("data/raw/participants.csv"),
)
```

This writes:

```text
outputs/participants.csv
outputs/participants.csv.annotation.json
outputs/participants.csv.README.md
```

## CLI

The CLI supports retrospective annotation, provenance inspection, source
recovery, and sanitized publish bundles.

```bash
data-annotations annotate file path/to/participants.csv --write-readme
data-annotations annotate directory path/to/run-001 --recursive
data-annotations provenance match path/to/participants.csv
data-annotations provenance chain path/to/participants.csv
data-annotations provenance checkout path/to/participants.csv
data-annotations publish path/to/run-001 path/to/publish-bundle
```

## Development

From a source checkout (assuming you have [Task installed](https://taskfile.dev/docs/installation)):

```bash
task install
task lint
task type-check
task test
```

Build or preview the documentation site:

```bash
task docs-build
task docs-serve
```
