Metadata-Version: 2.4
Name: vcti-fileloader-numpy
Version: 1.0.1
Summary: NumPy-backed NPY, NPZ, and CSV file loaders for the vcti-fileloader framework
Author: Visual Collaboration Technologies Inc.
Project-URL: Homepage, https://github.com/vcollab/vcti-python-fileloader-numpy
Project-URL: Repository, https://github.com/vcollab/vcti-python-fileloader-numpy
Project-URL: Issues, https://github.com/vcollab/vcti-python-fileloader-numpy/issues
Requires-Python: <3.15,>=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Requires-Dist: vcti-fileloader>=5.1.0
Requires-Dist: vcti-tree>=1.0.0
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Provides-Extra: lint
Requires-Dist: ruff; extra == "lint"
Provides-Extra: typecheck
Requires-Dist: mypy; extra == "typecheck"
Dynamic: license-file

# FileLoader NumPy

NumPy-backed NPY, NPZ, and CSV file loaders for the
`vcti-fileloader` framework.

## Overview

`vcti-fileloader-numpy` ships three loader plugins for the
`vcti-fileloader` framework, all implemented against NumPy:

- **`NpyLoader`** — loads `.npy` single-array files into a one-node
  subtree. Supports memory-mapped reads via the `mmap_mode` option.
- **`NpzLoader`** — loads `.npz` archives (zip-of-NPY) into a
  multi-child subtree, one child per array name. Defaults to
  `LazyDataNode` children so callers can browse `shape` / `dtype`
  without materialising; pass `lazy=False` for eager reads.
- **`CsvLoader`** — loads delimited-text files (`.csv` / `.tsv` /
  `.txt`) via `numpy.genfromtxt` into a one-node subtree. Passes
  through any `genfromtxt` keyword; when `names=True` is used, the
  loader stamps the column names on the subtree root as
  `file_attributes["columns"]`.

All three implement the `vcti.fileloader.core.Loader` protocol and
write against the `LockableTree` protocol from `vcti-tree`, so the
caller picks the backing.

## Installation

```bash
pip install vcti-fileloader-numpy>=1.0.1
```

### In `pyproject.toml` dependencies

```toml
dependencies = [
    "vcti-fileloader-numpy>=1.0.1",
]
```

---

## Quick Start

### NPY

```python
from pathlib import Path

from vcti.fileloader.core import DataNode
from vcti.fileloader.numpy import NpyLoader
from vcti.tree import DictTree

loader = NpyLoader()
tree: DictTree[DataNode] = DictTree(DataNode())
with loader.open(Path("data.npy")) as handle:
    root = loader.populate(handle, tree, tree.root_handle)

[child] = list(tree.children(root))
print(tree.payload(child).data)        # the loaded array
```

Memory-map a large file instead of reading into memory:

```python
arr = loader.load(Path("big.npy"), mmap_mode="r")
```

### NPZ

```python
from vcti.fileloader.core import materialise_subtree
from vcti.fileloader.numpy import NpzLoader

loader = NpzLoader()
tree: DictTree[DataNode] = DictTree(DataNode())

with loader.open(Path("data.npz")) as handle:
    root = loader.populate(handle, tree, tree.root_handle)   # lazy=True default
    materialise_subtree(tree, root)                          # read everything
# `handle` is closed here, but the tree is still usable.

for child in tree.children(root):
    p = tree.payload(child)
    print(p.name, p.shape, p.dtype)
```

For eager reads (no closure, no handle-lifetime concern):

```python
with loader.open(Path("data.npz")) as handle:
    root = loader.populate(handle, tree, tree.root_handle, lazy=False)
```

### CSV / TSV

```python
from vcti.fileloader.numpy import CsvLoader

loader = CsvLoader()
tree: DictTree[DataNode] = DictTree(DataNode())

with loader.open(Path("data.csv"), delimiter=",") as handle:
    root = loader.populate(handle, tree, tree.root_handle)
```

With a header row producing a structured array:

```python
with loader.open(Path("data.csv"), delimiter=",", names=True) as handle:
    root = loader.populate(handle, tree, tree.root_handle)
# Root's file_attributes["columns"] now lists the column names.
```

---

## Subtree shapes

| Loader | Root payload | Children |
|---|---|---|
| `NpyLoader` | empty `DataNode` | 1 × `DataNode(data=<array>)` |
| `NpzLoader` (lazy) | empty `DataNode` | N × `LazyDataNode(name=key, shape, dtype)` |
| `NpzLoader` (eager) | empty `DataNode` | N × `DataNode(name=key, data=<array>)` |
| `CsvLoader` (homogeneous) | empty `DataNode` | 1 × `DataNode(data=<array>)` |
| `CsvLoader` (structured, `names=True`) | `DataNode(file_attributes={"columns": [...]})` | 1 × `DataNode(data=<array>)` |

---

## Handle lifetime contract (NPZ lazy nodes)

`NpzLoader.populate(..., lazy=True)` attaches `LazyDataNode`s whose
closures hold the open `NpzFile` handle. Once
`loader.unload(handle)` runs, those closures cannot fulfil further
`.load()` calls. Three patterns avoid the problem:

1. **Keep the handle open** for the lifetime of the tree.
2. **Materialise then unload:** call
   `materialise_subtree(tree, root)` before `unload`. Every lazy
   node loads, and the tree is fully usable without the handle.
3. **Use eager mode:** `populate(..., lazy=False)`.

`materialise_subtree(tree, root_handle)` is exported from
`vcti.fileloader.core`.

---

## Error handling

```python
from vcti.fileloader.core import (
    LoadError,
    UnsupportedFormatError,
    TreeAttachmentError,
)

try:
    with loader.open(Path("data.npy")) as handle:
        root = loader.populate(handle, tree, tree.root_handle)
except FileNotFoundError:
    ...
except UnsupportedFormatError:
    # File extension is not recognised by this loader
    ...
except LoadError:
    # File could not be parsed
    ...
except TreeAttachmentError:
    # parent is missing, deleted, or structure-locked in `tree`
    ...
```

If `populate` fails partway (a parse error during NPZ traversal, or
an exception in `before_lock`), the partial subtree is removed
before the exception propagates — callers never see a half-built
subtree.

---

## What this package does NOT do

- **Pandas-flavoured CSV.** This loader uses `numpy.genfromtxt`,
  which is fast and dependency-free but lacks the schema inference
  and rich missing-value handling of `pandas.read_csv`. A separate
  `vcti-fileloader-pandas` is the right home for that.
- **Schema validation.** The loaders accept whatever NumPy parses.
  Validation belongs in a `before_lock` hook or downstream pass.
- **Streaming reads.** `numpy.load` and `numpy.genfromtxt` read the
  whole file (or array, for NPZ keys). Out-of-memory cases need a
  custom loader or `mmap_mode` for NPY.
- **Attribute synthesis.** No `file_path`, no derived storage
  metadata. Stamp those via the `before_lock` hook or a downstream
  enricher (e.g. `vcti-attribute-enricher`).

---

## Dependencies

- [numpy](https://numpy.org/) (>=1.26)
- [vcti-fileloader](https://pypi.org/project/vcti-fileloader/) (>=5.1.0) — `Loader` protocol, `SubtreeBuilder`, `DataNode`, `LazyDataNode`, `materialise_subtree` (import from `vcti.fileloader.core`)
- [vcti-tree](https://pypi.org/project/vcti-tree/) (>=1.0.0) — `LockableTree` protocol
