Metadata-Version: 2.4
Name: dumpduck
Version: 0.2.3
Summary: Compact, lazy-readable HDF5 trajectories with incremental atomistic property storage.
License: MIT
Keywords: ase,hdf5,lammps,molecular-dynamics,nmr,trajectory
Requires-Python: >=3.10
Requires-Dist: ase>=3.22
Requires-Dist: h5py>=3.8
Requires-Dist: numpy>=1.22
Requires-Dist: pip>=26.1.1
Requires-Dist: tqdm>=4.67.3
Provides-Extra: compression
Requires-Dist: hdf5plugin>=4; extra == 'compression'
Provides-Extra: test
Requires-Dist: pytest>=7; extra == 'test'
Description-Content-Type: text/markdown

# dumpDUCK

`dumpDUCK` stores atomistic trajectories as compact, lazy-readable HDF5 files. It is designed for large MD trajectories where you want to read one frame at a time, and for incremental labelling workflows where new properties are added after the trajectory already exists.

## Installation

```bash
pip install -e .
```

Optional Zstandard/Blosc compression:

```bash
pip install -e '.[compression]'
```

## Convert a trajectory

LAMMPS dump:

```bash
dumpduck convert 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.dump 4-azif_hda_512FUs_seed4-quenched_1500K_to_300K_rate0.1Kps-equilibrated300K_5ns.h5 \
  --format lammpstrj \
  --type-map '1:C,2:H,3:N,4:Zn' \
  --compression gzip \
  --compression-level 6 \
  --float-dtype float32 \
  --chunk-frames 16
```

```bash
dumpduck convert azif_rmc_2010_nmr_300K_10fs.lammpstrj azif_rmc_2010_nmr_nvt_300K_10fs.h5 \
  --format lammpstrj \
  --type-map '1:C,2:H,3:N,4:Zn' \
  --compression blosc-zstd \
  --compression-level 9 \
  --float-dtype float32 \
  --chunk-frames 64
```

```bash

TYPE_MAP='1:Zn,2:Zn'
TYPE_MAP="${TYPE_MAP},3:H,4:H,5:H,6:H,7:H,8:H,9:H,10:H,11:H,12:H,13:H,14:H"
TYPE_MAP="${TYPE_MAP},15:C,16:C,17:C,18:C,19:C,20:C,21:C,22:C,23:C,24:C,25:C,26:C"
TYPE_MAP="${TYPE_MAP},27:N,28:N,29:N,30:N,31:N,32:N,33:N,34:N"

dumpduck convert \
  zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.lammpstrj \
  zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.h5 \
  --format lammpstrj \
  --type-map "${TYPE_MAP}" \
  --compression blosc-zstd \
  --compression-level 7 \
  --float-dtype float32 \
  --chunk-frames 100 \
  --n-frames 100000
```

```bash
dumpduck info zif4_2x2x2_300K_nvt_nmr_nvt_300K_1ns_10fs.h5
```

ASE-readable trajectory:

```bash
dumpduck convert trajectory.xyz trajectory.h5 --chunk-frames 16
```

## Inspect a file

```bash
dumpduck info trajectory.h5
```

Example output:

```text
file: trajectory.h5
format: dumpduck-hdf5
version: 0.2.0
frames: 100001
atoms: 4352

core datasets:
  positions        shape=(100001, 4352, 3) dtype=float32 chunks=(16, 4352, 3) compression=gzip

properties:
  atomic/shielding_tensors
    shape: (100001, 4352, 3, 3)
    dtype: float32
    valid frames: 2183 / 100001
    units: ppm
```

## Lazy reading

```python
from dump_duck import H5Trajectory

with H5Trajectory('trajectory.h5') as traj:
    atoms = traj[0]

    for atoms in traj.iter_frames(start=0, stop=1000, step=10):
        print(atoms.info['timestep'], atoms.positions.shape)
```

Only the requested frame is read from disk.

## Incremental properties

Properties live under `/properties/atomic/<name>` or `/properties/frame/<name>`.
Each property has:

```text
data   # actual data
valid  # bool mask saying which frames have been written
```

This allows sparse labelling: the property can exist for all frames, while only a subset has been computed.

### NMR shielding tensors, one frame at a time

```python
from dump_duck import H5Trajectory

with H5Trajectory('trajectory.h5', mode='r+') as traj:
    if not traj.has_property('shielding_tensors', kind='atomic'):
        traj.create_property(
            'shielding_tensors',
            kind='atomic',
            frame_shape=(3, 3),
            dtype='float32',
            units='ppm',
            description='Per-atom NMR shielding tensors',
            compression='gzip',
            compression_level=6,
            chunk_frames=1,
        )

    for i, atoms in enumerate(traj.iter_frames()):
        if traj.property_valid('shielding_tensors', i, kind='atomic'):
            continue

        shielding = calculator.predict_shielding_tensors(atoms)  # shape: (n_atoms, 3, 3)
        traj.write_property('shielding_tensors', i, shielding, kind='atomic')
```

### Chemical shifts

```python
with H5Trajectory('trajectory.h5', mode='r+') as traj:
    traj.create_property(
        'chemical_shifts',
        kind='atomic',
        frame_shape=(),
        dtype='float32',
        units='ppm',
        description='Per-atom NMR chemical shifts',
    )

    traj.write_property('chemical_shifts', 0, shifts, kind='atomic')  # shape: (n_atoms,)
```

### Frame-wise energies

```python
with H5Trajectory('trajectory.h5', mode='r+') as traj:
    traj.create_property('energy', kind='frame', dtype='float64', units='eV')
    traj.write_property('energy', 0, 123.4, kind='frame')
```

## Extract frames

```bash
dumpduck extract trajectory.h5 frame_1000.xyz --index 1000
```

With valid properties included as ASE arrays/info:

```bash
dumpduck extract trajectory.h5 labelled.xyz --start 0 --stop 100 --step 10 --include-properties
```

## Compression notes

Portable built-in options:

```text
none, lzf, gzip
```

Optional plugin options with `dumpduck[compression]`:

```text
zstd, blosc-zstd
```

For MD trajectories, a good default is:

```text
gzip level 6, float32, chunk_frames 16
```

For single-frame random access, use smaller chunks. For better compression and sequential reading, use larger chunks such as 32 or 64.

## HDF5 layout

```text
/
  atomic_numbers        (n_atoms,)
  ids                   (n_atoms,)
  lammps_types          optional, (n_atoms,)
  mol_ids               optional, (n_atoms,)

  positions             (n_frames, n_atoms, 3)
  cells                 (n_frames, 3, 3)
  pbc                   (n_frames, 3)
  timesteps             (n_frames,)

  properties/
    atomic/
      <name>/
        data            (n_frames, n_atoms, *frame_shape)
        valid           (n_frames,)
    frame/
      <name>/
        data            (n_frames, *frame_shape)
        valid           (n_frames,)
```
