Metadata-Version: 2.4
Name: unifiedefficientloader
Version: 0.4.4
Summary: A unified interface for memory efficient per tensor loading of safetensors files as raw bytes from offset, handling CPU/GPU pinned transfers, and converting between tensors and dicts.
Author: silveroxides
License: MIT License
        
        Copyright (c) 2026 silveroxides
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: comfy-aimdo==0.3.0
Provides-Extra: torch
Requires-Dist: torch; extra == "torch"
Provides-Extra: safetensors
Requires-Dist: safetensors; extra == "safetensors"
Provides-Extra: tqdm
Requires-Dist: tqdm; extra == "tqdm"
Provides-Extra: all
Requires-Dist: torch; extra == "all"
Requires-Dist: safetensors; extra == "all"
Requires-Dist: tqdm; extra == "all"
Dynamic: license-file

# unifiedefficientloader

A unified interface for loading safetensors, handling CPU/GPU pinned transfers, and converting between tensors and dicts.

## Documentation

Full API reference and guides in [docs/](docs/index.md).

## Installation

You can install this package via pip. Since it heavily relies on `torch` and `safetensors` but doesn't strictly force them as hard dependencies for package building/installation, make sure you have them installed in your environment:

```bash
pip install unifiedefficientloader
pip install torch safetensors tqdm
```

## Usage

### Unified Safetensors Loader

```python
from unifiedefficientloader import UnifiedSafetensorsLoader

# Standard mode (preload all)
with UnifiedSafetensorsLoader("model.safetensors", low_memory=False) as loader:
    tensor = loader.get_tensor("weight_name")

# Low memory mode (streaming)
with UnifiedSafetensorsLoader("model.safetensors", low_memory=True) as loader:
    for key in loader.keys():
        tensor = loader.get_tensor(key)
        # Process tensor...
        loader.mark_processed(key) # Frees memory
```

### Incremental Safetensors Writer


```python
from unifiedefficientloader import UnifiedSafetensorsLoader, IncrementalSafetensorsWriter

# Initialize Writer
writer = IncrementalSafetensorsWriter(output_path, metadata=metadata)
writer.__enter__()

# Load model tensors and process them.
with UnifiedSafetensorsLoader("model.safetensors", low_memory=True) as loader:
    for key in loader.keys():
        tensor = loader.get_tensor(key)
        # Process tensor...
        writer.write(key, tensor)
        del tensor
        loader.mark_processed(key) # Frees memory


```

### Loading Specific Tensors Dynamically (Header Analysis)

You can analyze the file's header without loading the entire multi-gigabyte safetensors file into memory. This allows you to locate specific data (like embedded JSON dictionaries stored as `uint8` tensors) and load *only* those specific tensors directly from their file offsets.

```python
from unifiedefficientloader import UnifiedSafetensorsLoader, tensor_to_dict

with UnifiedSafetensorsLoader("model.safetensors", low_memory=True) as loader:
    # 1. Analyze the header metadata without loading any tensors
    # loader._header contains the full safetensors header directory
    uint8_tensor_keys = [
        key for key, info in loader._header.items()
        if isinstance(info, dict) and info.get("dtype") == "U8"
    ]

    # 2. Load ONLY those specific tensors using their keys
    for key in uint8_tensor_keys:
        # get_tensor dynamically reads only the bytes for this tensor
        # based on the offsets found in the header
        loaded_tensor = loader.get_tensor(key)

        # 3. Decode the uint8 tensor back into a Python dictionary
        extracted_dict = tensor_to_dict(loaded_tensor)
        print(f"Decoded {key}:", extracted_dict)
```

### Optimized Asynchronous Streaming via ThreadPoolExecutor

For maximum I/O throughput while maintaining strict memory backpressure, use `async_stream`. This utilizes a `ThreadPoolExecutor` for background disk reading and a bounded queue to prevent memory exhaustion. By setting `pin_memory=True`, memory pinning is performed sequentially in the main thread to avoid OS-level lock contention and preserve high DMA transfer speeds.

```python
from unifiedefficientloader import UnifiedSafetensorsLoader, transfer_to_gpu_pinned

with UnifiedSafetensorsLoader("model.safetensors", low_memory=True) as loader:
    keys_to_load = loader.keys()

    # Create the continuous streaming generator
    # prefetch_batches controls how many batches to buffer in memory
    stream = loader.async_stream(
        keys_to_load,
        batch_size=8,
        prefetch_batches=2,
        pin_memory=True
    )

    # Iterate directly over the generator
    for batch in stream:
        for key, pinned_tensor in batch:
            # Transfer directly to GPU via DMA (pinning is already done)
            gpu_tensor = transfer_to_gpu_pinned(pinned_tensor, device="cuda")

            # ... process gpu_tensor ...
            loader.mark_processed(key)
```

### Unified Data Loader

A high-performance, threaded alternative to PyTorch's standard `DataLoader`. It eliminates multiprocessing IPC overhead and features a zero-copy pipeline capable of streaming batches directly from pinned CPU memory to VRAM (`direct_gpu=True`).

```python
from unifiedefficientloader import UnifiedDataLoader
from torchvision import datasets, transforms

dataset = datasets.FakeData(transform=transforms.ToTensor())

# Replaces torch.utils.data.DataLoader
# Pre-allocates pinned buffer pools and streams directly to GPU
loader = UnifiedDataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,
    direct_gpu=True
)

for batch_image, batch_label in loader:
    # batch is already on the GPU (device="cuda")
    pass
```

### Direct-to-GPU Streaming (Zero-Copy)

For the absolute fastest loading times on CUDA devices, use the `direct_gpu=True` flag. This creates a pipeline that pre-allocates pinned memory pools and GPU memory slabs. Tensors are loaded from disk directly into pinned buffers, and immediately asynchronously copied to the GPU using CUDA streams, hiding the PCIe transfer latency completely behind the disk I/O.

```python
from unifiedefficientloader import UnifiedSafetensorsLoader

with UnifiedSafetensorsLoader("model.safetensors", low_memory=True, direct_gpu=True) as loader:
    stream = loader.async_stream(
        loader.keys(),
        batch_size=8,
        prefetch_batches=2,
    )
    for batch in stream:
        for key, gpu_tensor in batch:
            # gpu_tensor is already on the GPU
            assert gpu_tensor.device.type == "cuda"
            # ... process gpu_tensor ...
            loader.mark_processed(key)  # releases GPU buffer back to pool
```

### Zero-Copy MMAP Loading

`use_mmap=True` maps the file into virtual memory via the `uel` native extension. No data is copied into RAM — PyTorch holds a direct pointer into OS page cache.

```python
from unifiedefficientloader import UnifiedSafetensorsLoader

with UnifiedSafetensorsLoader("model.safetensors", low_memory=True, use_mmap=True) as loader:
    state_dict = loader.load_all()
    # all tensors are zero-copy views into mapped memory
```

Requires the `uel` native extension to be compiled. Falls back silently to standard IO if unavailable. See [docs/mmap.md](docs/mmap.md) and [docs/building.md](docs/building.md).

### Tensor/Dict Conversion

```python
from unifiedefficientloader import dict_to_tensor, tensor_to_dict

my_dict = {"param": 1.0, "name": "test"}
tensor = dict_to_tensor(my_dict)
recovered_dict = tensor_to_dict(tensor)
```

### Pinned Memory Transfers

```python
import torch
from unifiedefficientloader import transfer_to_gpu_pinned

tensor = torch.randn(100, 100)
# Transfers using pinned memory if CUDA is available, otherwise falls back gracefully
gpu_tensor = transfer_to_gpu_pinned(tensor, device="cuda:0")
