Metadata-Version: 2.4
Name: vcti-fileloader-hdf5
Version: 5.1.1
Summary: h5py-backed HDF5 file loader for the vcti-fileloader framework
Author: Visual Collaboration Technologies Inc.
Requires-Python: <3.15,>=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: h5py>=3.0
Requires-Dist: vcti-fileloader>=5.1.0
Requires-Dist: vcti-tree>=1.0.0
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Provides-Extra: lint
Requires-Dist: ruff; extra == "lint"
Provides-Extra: typecheck
Requires-Dist: mypy; extra == "typecheck"
Dynamic: license-file

# FileLoader HDF5

HDF5 file loader using h5py — attaches an HDF5 file's content as a
locked subtree under a caller-supplied parent in any `LockableTree`.

## Overview

`vcti-fileloader-hdf5` is the HDF5 plugin for the `vcti-fileloader`
framework. It implements the `Loader` protocol with one operation
that does the real work: `populate(handle, tree, parent)` walks the
HDF5 hierarchy once via `h5py.File.visit()` and grafts the file's
groups and datasets as a subtree under `parent`.

The loader pass-through is **strict**: every node's `file_attributes`
is populated verbatim from the corresponding HDF5 object's
`obj.attrs` — no synthesised keys, no derived storage metadata, no
file-path. Application-domain attributes (file paths, derived
storage info, category tags) belong in a `before_lock` hook or a
downstream enricher.

Groups become `DataNode` payloads carrying their HDF5 attributes;
datasets become `LazyDataNode` payloads with `shape` / `dtype`
populated at construction so consumers can browse without
materialising. The returned subtree is structure-locked and
payload-locked. The loader is backing-agnostic — it writes against
the `LockableTree` protocol from `vcti-tree`, so the caller picks
the tree implementation (`DictTree`, `vcti-nptree.ArrayTree`, or
their own).

## Installation

```bash
pip install vcti-fileloader-hdf5>=5.1.1
```

---

## Quick Start

```python
from pathlib import Path

from vcti.fileloader.core import DataNode, LoaderRegistry, materialise_subtree
from vcti.fileloader.hdf5 import H5pyLoader, get_loader_descriptor
from vcti.tree import DictTree                # or any other LockableTree backing

# Context manager (recommended for the one-shot case)
loader = H5pyLoader()
tree: DictTree[DataNode] = DictTree(DataNode())
with loader.open(Path("data.h5")) as handle:
    # Eager: read every dataset into memory before the file closes.
    subtree_root = loader.populate(handle, tree, tree.root_handle, lazy=False)
# Tree is fully usable here; handle is closed.

# Lazy loading: keep the handle open while you browse and materialise on demand.
loader = H5pyLoader()
tree = DictTree(DataNode())
handle = loader.load(Path("data.h5"))
try:
    subtree_root = loader.populate(handle, tree, tree.root_handle)  # lazy=True default
    # ... browse the tree, call payload.load() on datasets you need ...
finally:
    loader.unload(handle)

# Materialise then unload — close the file but keep the tree usable.
loader = H5pyLoader()
tree = DictTree(DataNode())
handle = loader.load(Path("data.h5"))
try:
    subtree_root = loader.populate(handle, tree, tree.root_handle)
    materialise_subtree(tree, subtree_root)
finally:
    loader.unload(handle)

# Registry-based usage
registry = LoaderRegistry()
registry.register(get_loader_descriptor())
desc = registry.get("hdf5-h5py-loader")
tree = DictTree(DataNode())
with desc.loader.open(Path("data.h5")) as handle:
    subtree_root = desc.loader.populate(handle, tree, tree.root_handle, lazy=False)
```

## Quick Start — with a `before_lock` hook

Stamping `file_path` and arbitrary domain tags is the caller's
job, run through the `before_lock` hook (which fires after the
subtree is built but before it is locked):

```python
def stamp_file_path(tree, root):
    tree.payload(root).enricher_attributes["file_path"] = str(path)

with loader.open(path) as handle:
    root = loader.populate(handle, tree, tree.root_handle, before_lock=stamp_file_path)
```

For rule-driven enrichment, pair the hook with
[`vcti-attribute-enricher`](https://github.com/vcollab/vcti-python-attribute-enricher):

```python
from vcti.attribute_enricher import EnrichRule, apply_rules
from vcti.tree import descendants
from vcti.lookup import Rule

def enrich(tree, root):
    apply_rules(
        descendants(tree, root, include_self=True),
        rules=[
            EnrichRule(set={"file_path": str(path)}),
            EnrichRule(set={"category": "mechanical"},
                       when=(Rule("name", "^=", "stress"),)),
        ],
    )

with loader.open(path) as handle:
    root = loader.populate(handle, tree, tree.root_handle, before_lock=enrich)
```

`vcti-attribute-enricher` is an optional sibling package — this
loader has no dependency on it.

---

## What the subtree looks like

Given an HDF5 file with this structure:

```
/  (file_attr="test_value")
├── results/  (solver="NASTRAN")
│   └── stress  (units="MPa"), shape=(3,), dtype=float64
└── ids  shape=(3,), dtype=int64
```

`populate(handle, tree, tree.root_handle)` produces this subtree:

| Node            | name        | Payload type   | data   | file_attributes        | shape / dtype |
|-----------------|-------------|----------------|--------|------------------------|---------------|
| subtree root    | `None`      | `DataNode`     | `None` | `{file_attr: "test_value"}` | — |
| results (group) | `"results"` | `DataNode`     | `None` | `{solver: "NASTRAN"}`   | — |
| stress (lazy)   | `"stress"`  | `LazyDataNode` | (lazy) | `{units: "MPa"}`        | `(3,)` / `float64` |
| ids (lazy)      | `"ids"`     | `LazyDataNode` | (lazy) | `{}`                    | `(3,)` / `int64` |

Note that `name` is a first-class field, and `shape` / `dtype` are
first-class fields on `LazyDataNode` — none of these are in
`file_attributes`. The enricher side (`enricher_attributes`) starts
empty; the merged `attributes` ChainMap reflects only the file's
native keys until a `before_lock` hook (or other enricher) adds to
the enricher side.

---

## API

### H5pyLoader

| Method | Description |
|--------|-------------|
| `load(path, **options)` | Open HDF5 file, return `h5py.File` handle |
| `open(path, **options)` | Context manager — loads and auto-unloads |
| `populate(handle, tree, parent, *, before_lock=None, lazy=True, **options)` | Attach subtree, run hook, lock, return subtree root handle |
| `unload(handle)` | Close HDF5 file handle (idempotent) |
| `can_load(path)` | Check extension (`.h5`, `.hdf5`) |

### Helpers

| | Description |
|---|---|
| `get_loader_descriptor()` | Create `LoaderDescriptor` for registry |
| `H5pyValidator` | Check h5py availability |
| `H5pySetup` | No-op setup (h5py needs no config) |

---

## Lazy vs Eager Loading

`populate(..., lazy=True)` (default) attaches each dataset as a
`LazyDataNode` — its `data` is `None` until `.load()` is called, at
which point the closure reads `handle[path][:]`. Use lazy when:

- you want to browse the file's structure (names, shapes, dtypes,
  attributes) before deciding which arrays to materialise, or
- the file is too large to load entirely into memory.

`populate(..., lazy=False)` reads every dataset into a `DataNode` at
populate time. Use eager when:

- the file is small and you want everything loaded immediately, or
- you cannot guarantee the handle will stay open after `populate`
  returns and you do not want to use `materialise_subtree`.

### Handle lifetime contract with lazy nodes

Each `LazyDataNode` produced by `populate(..., lazy=True)` holds a
closure over `handle`. Once `loader.unload(handle)` runs, those
closures cannot fulfil further `.load()` calls. Three patterns avoid
the problem:

1. **Keep the handle open** for the lifetime of the tree.
2. **Materialise then unload:**
   `populate(handle, tree, p); materialise_subtree(tree, root); unload(handle)`.
   After this, every lazy node is loaded, and the tree is fully
   usable without the handle.
3. **Use eager mode:** `populate(..., lazy=False)`.

`materialise_subtree(tree, root_handle)` is exported from
`vcti.fileloader.core` and walks the subtree, calling `.load()` on
every `LazyDataNode`.

---

## Error Handling

```python
from vcti.fileloader.core import (
    LoadError,
    UnloadError,
    UnsupportedFormatError,
    TreeAttachmentError,
)

try:
    with loader.open(Path("data.h5")) as handle:
        subtree_root = loader.populate(handle, tree, tree.root_handle)
except FileNotFoundError:
    ...
except UnsupportedFormatError:
    ...
except LoadError:
    ...
except TreeAttachmentError:
    # parent is missing, deleted, or structure-locked in `tree`
    ...
except ValueError:
    # populate() was called on a closed handle
    ...
```

If `populate` fails partway through (an I/O error during the walk,
or an exception in `before_lock`), the partial subtree is removed
before the exception propagates — callers never see a half-built
subtree.

---

## Soft Links and Hard Links

HDF5 files can contain **soft links** (symbolic references to other
paths) and **hard links** (multiple names pointing to the same
object). `h5py.File.visit()` — which this loader uses — follows
hard links but **does not follow soft links** or external links by
default:

- **Hard-linked objects** appear once in the subtree (at the first
  path `visit()` encounters). They are not duplicated.
- **Soft links** are silently skipped and will not appear as nodes
  in the subtree.
- **External links** are also skipped.

This behaviour is inherited from h5py/libhdf5 and is not
configurable in this loader.

---

## Thread Safety

h5py file handles are **not thread-safe**. Do not share a single
`h5py.File` handle across threads. Open a separate handle per
thread, or serialise access with a lock.

Tree backings (DictTree, ArrayTree, etc.) are likewise not
thread-safe. Calling `populate` on the same tree from multiple
threads concurrently is undefined behaviour.

---

## Dependencies

- [h5py](https://www.h5py.org/) (>=3.0)
- [numpy](https://numpy.org/) (>=1.24)
- [vcti-fileloader](https://pypi.org/project/vcti-fileloader/) (>=5.1.0) — `Loader` protocol, `SubtreeBuilder`, `DataNode`, `LazyDataNode`, `materialise_subtree` (import from `vcti.fileloader.core`)
- [vcti-tree](https://pypi.org/project/vcti-tree/) (>=1.0.0) — `LockableTree` protocol
