Metadata-Version: 2.4
Name: pypff
Version: 1.0.0
Summary: High-performance PanoSETI File Format (PFF) I/O library
Project-URL: Homepage, https://github.com/panoseti/pypff
Project-URL: Repository, https://github.com/panoseti/pypff
Project-URL: Issues, https://github.com/panoseti/pypff/issues
Author-email: Wei Liu <liuwei_berkeley@berkeley.edu>, Nicolas Rault-Wang <nraultwang@berkeley.edu>
License-File: LICENSE
Keywords: astronomy,dask,high-performance,io,panoseti,pff,xarray,zarr
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Astronomy
Requires-Python: >=3.11
Requires-Dist: astropy
Requires-Dist: dask[distributed]>=2024.1
Requires-Dist: matplotlib
Requires-Dist: numcodecs>=0.12
Requires-Dist: numpy>=1.24.0
Requires-Dist: orjson>=3.9.0
Requires-Dist: plotly
Requires-Dist: pydantic>=2.5.0
Requires-Dist: rich>=13.0.0
Requires-Dist: scipy
Requires-Dist: tqdm
Requires-Dist: typer>=0.9.0
Requires-Dist: xarray>=2024.1
Requires-Dist: zarr>=3.0
Provides-Extra: dev
Requires-Dist: dask; extra == 'dev'
Requires-Dist: hatchling; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: zarr
Requires-Dist: dask[distributed]>=2024.1; extra == 'zarr'
Requires-Dist: numcodecs>=0.12; extra == 'zarr'
Requires-Dist: xarray>=2024.1; extra == 'zarr'
Requires-Dist: zarr>=3.0; extra == 'zarr'
Description-Content-Type: text/markdown

# pypff - High-performance PanoSETI I/O Library

[![pypff-CI](https://github.com/panoseti/pypff/actions/workflows/egg.yml/badge.svg)](https://github.com/panoseti/pypff/actions)
[![Version](https://img.shields.io/badge/version-1.0.0-blue)](https://github.com/panoseti/pypff)
[![Python](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![Coverage](https://img.shields.io/badge/coverage-65%25-green)](https://github.com/panoseti/pypff)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

A high-performance Python package for reading and analyzing data files generated by PanoSETI (PanoSETI File Format - PFF).

## Features
- **Streaming by default:** `iter_batches(size=256)` and `__iter__` yield zero-copy views without materializing the full sequence into RAM — safe for Jupyter and HPC pipelines alike.
- **Distributed chunked reads:** `iter_byte_range(file_idx, byte_start, byte_end)` lets Dask/Nextflow workers parse frame-aligned byte ranges in parallel without coordination.
- **Zero-copy random access:** `seq[i]` returns a strided mmap view; slicing and `read_images(indices)` use a sort + inverse-permutation for disk locality.
- **Single-pass metadata:** `get_metadata_arrays(keys)` extracts any number of header fields in one `np.frombuffer` pass per file via a composite NumPy structured dtype. Supports virtual `unix_t_ns` key.
- **Nanosecond-precise timestamps:** `timestamps()` returns `int64` ns (no float precision loss); `timestamps(as_datetime=True)` returns a zero-copy `datetime64[ns]` view for matplotlib and pandas.
- **PFF → Zarr v3 conversion:** `pypff[zarr]` optional extra converts any `.pffd` run to Zarr v3 stores readable by xarray, dask, numpy, TensorStore, and Julia — lossless, compressed, HPC/ML-ready. See [Zarr v3 spec](docs/zarr_v3_spec.md).
- **Bounded resources:** LRU mmap handle cache (default 16 files); `PFFSequence` is a context manager.
- **Multiprocessing-safe:** pickle-compatible — file handles are dropped on serialisation and lazily reopened in workers.
- **Pydantic validation:** strict schema validation for all PFF headers and PANOSETI config files.
- **Run discovery:** `PanosetiRun` lazily scans a `.pffd` directory and exposes typed configs, housekeeping, and data products.

## Installation
The package uses `uv` for dependency management.

```bash
cd pypff
uv sync                   # core library
uv sync --extra zarr      # + Zarr v3 conversion (zarr-python, xarray, dask)
```

## Quick Start

### Run discovery
```python
from pypff import PanosetiRun

run = PanosetiRun("path/to/run_directory.pffd")
run.show()                          # pretty-print structure

print(run.list_products())          # ['dp_img16.bpp_2.module_1', ...]
seq = run.get_product("dp_img16.bpp_2.module_1")

obs_cfg  = run.get_config("obs_config")   # returns Pydantic model
data_cfg = run.get_config("data_config")
hk       = run.get_hk()                   # dict[device, dict[field, np.ndarray]]
```

### Streaming (memory-bounded)
```python
# Iterate frame-by-frame (zero-copy views into mmap)
for img in seq:
    process(img)

# Batch iteration — never holds more than one batch in RAM
for batch in seq.iter_batches(size=256):
    batch  # shape (256, H, W), dtype matches file

# With timestamps and headers in one pass
for batch, ts in seq.iter_batches(size=256, with_timestamps=True):
    ts  # int64 nanoseconds, shape (256,)

# Distributed byte-range reads (Dask / Nextflow workers)
for batch in seq.iter_byte_range(file_idx=0, byte_start=0, byte_end=size):
    ...
```

### Random access and slicing
```python
img   = seq[42]            # zero-copy view, shape (H, W)
imgs  = seq[0:100:2]       # every other frame, shape (50, H, W)
imgs  = seq.read_images(np.array([5, 0, 3]))   # unsorted — sorted internally for locality
imgs  = seq.read_images_range(start=0, count=500)
```

### Timestamps — always `int64` nanoseconds
```python
# Full cached array — int64 ns, no float precision loss
ts = seq.timestamps()                      # np.ndarray[int64]

# Zero-copy datetime64[ns] view for matplotlib / pandas
ts_dt = seq.timestamps(as_datetime=True)   # np.ndarray[datetime64[ns]]

# Indexed subset
ts_sub = seq.timestamps(indices=np.arange(0, len(seq), 100))

# Single frame
t_ns = seq.timestamp_at(42)               # int, nanoseconds

# Time-based navigation
idx = seq.seek_time(t_ns + 1_000_000_000)  # 1 s later

# Arithmetic rule: subtract epoch in integer space before dividing
rel_s = (ts - ts[0]) / 1e9               # CORRECT — small values, no precision loss
# ts / 1e9 - t0_s                        # WRONG  — loses ns precision at ~1.7e9 s
```

### Metadata extraction (single pass)
```python
# One np.frombuffer call per file regardless of number of keys
meta = seq.get_metadata_arrays(["pkt_num", "tv_sec", "unix_t_ns"])
meta["unix_t_ns"]   # int64 ns, same as timestamps()
meta["pkt_num"]     # int64

# All fields at once
all_meta = seq.get_all_metadata()
```

### Headers and Pydantic validation
```python
# Fast path — raw dict, no Pydantic overhead
header, img = seq.get_frame(0)
header["pkt_num"]                 # quabo file
header["quabo_0"]["pkt_tai"]      # module file

# Validated path — full Pydantic model
from pypff.models import ModuleHeader
hdr, img = seq.get_frame_validated(0)
assert isinstance(hdr, ModuleHeader)
```

### Context manager and multiprocessing
```python
# Deterministic handle cleanup
with run.get_product("dp_ph256.bpp_2.module_254") as seq:
    data = seq.read_images_range(0, 1000)

# PFFSequence is pickle-safe — handles are dropped on serialisation
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor() as ex:
    futures = [ex.submit(lambda s, i: s[i].sum(), seq, i) for i in range(len(seq))]
```

## PFF → Zarr v3 conversion (`pypff[zarr]`)

Convert a `.pffd` observation run to Zarr v3 stores readable by xarray, dask,
numpy, TensorStore, Julia, and Rust. See [docs/zarr_v3_spec.md](docs/zarr_v3_spec.md)
for the full layout specification.

### Write

```python
from pypff.io2 import PanosetiRun
from pypff.zarr import convert_run

run = PanosetiRun("path/to/obs.pffd")
stores = convert_run(run, "output/L0_zarr")
# → one .zarr per (data_product, module)
# → one .panoseti-meta/ sidecar bundle (configs, logs, hk.pff)
```

Or via the CLI:

```bash
uv run pypff zarr path/to/obs.pffd output/L0_zarr
```

### Read

```python
from pypff.zarr import PanosetiZarrRun
import xarray as xr

# High-level wrapper (mirrors PanosetiRun)
zrun = PanosetiZarrRun("output/L0_zarr")
store = zrun.get_product("dp_ph256.bpp_2.module_254")
ts    = store.timestamps()           # int64 ns array
ds    = store.to_dataset()           # xarray.Dataset
cfgs  = zrun.configs                 # parsed run configs

# Or open directly with xarray — all arrays visible as variables
ds = xr.open_zarr(str(stores[0]), consolidated=False)
# ds.images      (T, H, W)   int16 / uint16
# ds.unix_t_ns   (T,)        int64  — nanosecond timestamps
# ds.pkt_num     (T,)        uint32 ─┐ per-frame header fields
# ds.pkt_nsec    (T,)        uint32  │ (single-level: ph256)
# ds.quabo_num   (T,)        uint8  ─┘
# ds.quabo_0_pkt_num  (T,)   uint32 ─┐ module-level headers
# …                                  ─┘ (img16, ph1024)
```

### Zarr store layout

All arrays live at the **root** of each store (no sub-groups), so
`xr.open_zarr(store)` surfaces every variable — images, timestamps, and all
header fields — as a single Dataset aligned on the shared `time` dimension.
Logical grouping of headers is expressed via `header_fields` and `quabo_fields`
root attributes. Full specification: [docs/zarr_v3_spec.md](docs/zarr_v3_spec.md).

## Testing
Run the test suite via the built-in CLI:

```bash
uv run pypff test all
```

The test suite includes:
- **Tier 1 (Unit):** Basic logic and timing tests.
- **Tier 2 (Logic):** Higher-level I/O, slicing, and concurrency tests.
- **Legacy Integration:** The original `pypff` test suite using provided sample data.

## Dockerized CI
Build the CI environment:
```bash
docker build -t pypff-ci -f src/ci/Dockerfile.ci .
```
