Metadata-Version: 2.4
Name: vcti-nputils
Version: 1.5.0
Summary: NumPy structured array utilities — joining, flattening, field views, enum mapping, position arrays, and a dynamic (std::vector-like) array
Author: Visual Collaboration Technologies Inc.
Requires-Python: <3.15,>=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=2.0
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Provides-Extra: lint
Requires-Dist: ruff; extra == "lint"
Provides-Extra: typecheck
Requires-Dist: mypy; extra == "typecheck"
Requires-Dist: numpy; extra == "typecheck"
Provides-Extra: bench
Requires-Dist: tabulate; extra == "bench"
Dynamic: license-file

# NumPy Utils

NumPy structured array utilities — dtype construction, field views, joining, enum mapping, position arrays, byte/string conversion for C++ interop, and growable (`std::vector`-like) arrays.

## Overview

`vcti-nputils` collects the low-level NumPy helpers shared across the vcti
stack. Most of it is stateless functions over structured arrays and
dtypes — building and reshaping dtypes, taking zero-copy field views,
joining arrays, mapping enum values, and converting byte fields for
pybind11/C++ interop. Alongside those it provides two small stateful
containers for building a numpy array of unknown final size:
`GrowableArray` (append-only — the common case) and `DynamicArray`
(the same growth model plus soft deletion). Both grow on demand and hand
off a plain contiguous array via `to_numpy()` for the read-heavy phase.

**Which to use:** append-only? `GrowableArray`. Need to remove elements
mid-build? `DynamicArray`.

## Installation

```bash
pip install vcti-nputils>=1.5.0
```

### In `pyproject.toml` dependencies

```toml
dependencies = [
    "vcti-nputils>=1.5.0",
]
```

---

## Quick Start

```python
import numpy as np
from vcti.nputils import (
    as_ndarray,
    check_overflow,
    decode_field,
    drop_fields,
    DynamicArray,
    encode_field,
    fields_view,
    flatten_dtype,
    GrowableArray,
    join_struct_arrays,
    merge_adjacent_fields,
    name_array,
    position_array,
    rename_fields,
    structured_dtype,
    with_encoding,
)

# Join structured arrays horizontally
dt1 = np.dtype([('id', 'i4'), ('value', 'f8')])
dt2 = np.dtype([('name', 'U10')])
arr1 = np.array([(1, 1.5), (2, 2.5)], dtype=dt1)
arr2 = np.array([('Alice',), ('Bob',)], dtype=dt2)
joined = join_struct_arrays([arr1, arr2])
# dtype: [('id', 'i4'), ('value', 'f8'), ('name', 'U10')]

# Create a zero-copy view with selected fields
view = fields_view(joined, ['id', 'name'])

# Drop fields from a structured array (zero-copy)
clean = drop_fields(joined, ['value'])

# Build a structured dtype from a scalar dtype + names
coord_dt = structured_dtype('f8', ['x', 'y', 'z'])
# dtype([('x', '<f8'), ('y', '<f8'), ('z', '<f8')])

# Rename fields in a dtype
new_dt = rename_fields(dt1, {'id': 'node_id', 'value': 'temperature'})

# Flatten array fields into individual columns (default naming)
dt = np.dtype([('id', 'i4'), ('coords', 'f8', (3,))])
_, cols = flatten_dtype(dt)
# cols: ['id', 'coord_0', 'coord_1', 'coord_2']

# Flatten with explicit per-field names
_, cols = flatten_dtype(dt, field_names={'coords': ['x', 'y', 'z']})
# cols: ['id', 'x', 'y', 'z']

# Flatten with a custom format string
_, cols = flatten_dtype(dt, fmt="{name}[{dim}]")
# cols: ['id', 'coord[0]', 'coord[1]', 'coord[2]']

# Merge adjacent 'S' fields into one (pure dtype view). Multiple merges
# can be specified at once; same-field overlap and name collisions are
# validated before anything is returned.
dt = np.dtype([
    ('first', 'S4'), ('last', 'S6'),
    ('city', 'S8'), ('state', 'S2'),
    ('age', 'i4'),
])
merged = merge_adjacent_fields(dt, {
    'name':    ['first', 'last'],
    'address': ['city', 'state'],
})
# dtype([('name', 'S10'), ('address', 'S10'), ('age', '<i4')])

# Map numeric enum values to names
enum_dict = {1: 'ACTIVE', 2: 'INACTIVE', 3: 'PENDING'}
names = name_array(np.array([1, 2, 1, 3]), enum_dict)

# Convert counts to cumulative offsets
offsets = position_array(np.array([3, 2, 4, 1]))
# array([0, 3, 5, 9, 10])

# Safely coerce inputs to ndarray
arr = as_ndarray([1, 2, 3], dtype=np.float64)
empty = as_ndarray(None)  # array([], dtype=float64)

# Byte <-> string conversion for C++/pybind11 interop
dt = np.dtype([('name', 'S10'), ('name_length', 'i4')])
sa = np.zeros(2, dtype=dt)
encode_field(sa, 'name', ['Alice', 'Bob'], length_field='name_length')
decoded = decode_field(sa, 'name')
overflow = check_overflow(sa, 'name', 'name_length')

# Attach encoding to a dtype so decode_field/encode_field use it automatically
name_dt = with_encoding(np.dtype('S32'), 'latin-1')

# Build an array incrementally without knowing the final size (append-only)
ga = GrowableArray(np.dtype([('id', 'i4'), ('value', 'f8')]))
ga.append((1, 1.5))
ga.extend([(2, 2.5), (3, 3.5)])
result = ga.to_numpy()  # independent, exact-size array you own

# Need to remove elements mid-build? DynamicArray adds soft deletion
da = DynamicArray(np.dtype('i8'))
da.extend([10, 20, 30, 40])
da.delete(1)            # elements shift: da[1] is now 30
clean = da.to_numpy()   # array([10, 30, 40])
```

> These are typed, numpy-backed accumulators, not faster `list`s. For a
> method-by-method comparison, measured trade-offs, and guidance on when to
> use them versus a Python `list`, see
> [docs/design/comparison.md](docs/design/comparison.md); for usage recipes,
> [docs/patterns.md](docs/patterns.md). Reproduce the numbers with
> `python benchmarks/benchmark_growable_array.py`.

---

## Module layout

Each category lives in its own module. All public functions are re-exported from `vcti.nputils`.

| Module | Functions |
|--------|-----------|
| `dtype_utils` | `structured_dtype`, `flatten_dtype` (+ `flatten_record_dtype` alias), `merge_adjacent_fields`, `rename_fields` |
| `view_utils` | `fields_view`, `drop_fields` |
| `join_utils` | `join_struct_arrays` |
| `mapping_utils` | `name_array` |
| `offset_utils` | `position_array` |
| `coerce_utils` | `as_ndarray` |
| `growable_array` | `GrowableArray` |
| `dynamic_array` | `DynamicArray` |
| `byte_utils` | `string_from_bytes`, `bytes_from_string`, `decode_column`, `encode_column`, `decode_field`, `encode_field`, `check_overflow`, `get_encoding`, `with_encoding`, `ZERO_CHAR` |

---

## Functions

### Dtype construction & transformation
| Function | Purpose |
|----------|---------|
| `structured_dtype(dtype, names)` | Build a structured dtype from a scalar or subdtype plus field names |
| `flatten_dtype(dt, *, field_names, fmt, strip_plural)` | Expand array fields into scalars with flexible naming |
| `flatten_record_dtype(dt, ...)` | Legacy alias for `flatten_dtype` |
| `merge_adjacent_fields(dt, merges)` | Merge one or more groups of adjacent 'S' fields into a single field each (pure dtype view) |
| `rename_fields(dt, mapping)` | Return a new dtype with fields renamed |

### Zero-copy views
| Function | Purpose |
|----------|---------|
| `fields_view(sa, fields)` | View containing only the selected fields |
| `drop_fields(sa, exclude)` | View containing all fields except those excluded |

### Joining
| Function | Purpose |
|----------|---------|
| `join_struct_arrays(arrays)` | Join structured arrays horizontally by combining fields |

### Mapping, offsets, coercion
| Function | Purpose |
|----------|---------|
| `name_array(nparray, enum_dict, default)` | Map numeric values to string names |
| `position_array(counts, dtype)` | Convert count array to cumulative offset array |
| `as_ndarray(value, dtype)` | Coerce None, list, or ndarray to ndarray |

### Containers
| Type | Purpose |
|------|---------|
| `GrowableArray(dtype, initial_capacity, growth_factor)` | Append-only growable numpy array: amortised O(1) `append`/`extend` (`append_get_index` returns the index), `reserve`/`shrink_to_fit`/`clear`, `full`/`resize` for sized fills, zero-copy `as_array()` view, and `to_numpy()` for an independent copy |
| `DynamicArray(dtype, initial_capacity, growth_factor)` | Same growth model **plus soft deletion**: lazy `delete` (shifting semantics), `compact`, `active_indices` |

### Byte / string conversion (pybind11 interop)
| Function | Purpose |
|----------|---------|
| `string_from_bytes(value, encoding)` | Decode a single bytes value, stripping null padding |
| `bytes_from_string(value, length, encoding)` | Encode to fixed-length bytes (pad or truncate) |
| `decode_column(byte_array, encoding)` | Vectorized decode of a byte column to strings |
| `encode_column(strings, length, encoding)` | Vectorized encode to `(bytes, lengths)` |
| `decode_field(sa, field_name, *, encoding)` | Decode a byte field in a structured array |
| `encode_field(sa, field_name, strings, *, length_field, encoding)` | Encode strings into a byte field, optionally populating a paired length field |
| `check_overflow(sa, field_name, length_field)` | Detect rows where the original encoded byte length exceeded the field |
| `get_encoding(dtype, default)` | Read encoding from `dtype.metadata['encoding']` |
| `with_encoding(dtype, encoding)` | Attach encoding to a scalar dtype via metadata |
| `ZERO_CHAR` | The null character (`"\x00"`) used to strip/pad byte fields |

---

## Dependencies

- [numpy](https://numpy.org/) (>=2.0)
