Metadata-Version: 2.4
Name: merkle-weight-verify
Version: 0.2.0
Summary: ML model weight verification via hierarchical Merkle trees — O(1) integrity check, layer-aware diff, incremental sync. The missing verification layer for safetensors, PyTorch, and HuggingFace Hub.
Author-email: Geoffrey Wang <geoffreywang1117@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/GeoffreyWang1117/merkle-weight-verify
Project-URL: Issues, https://github.com/GeoffreyWang1117/merkle-weight-verify/issues
Keywords: merkle-tree,integrity,verification,model-weights,hashing,provenance
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Security :: Cryptography
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Provides-Extra: fast
Requires-Dist: blake3>=1.0.0; extra == "fast"
Provides-Extra: safetensors
Requires-Dist: safetensors>=0.4.0; extra == "safetensors"
Provides-Extra: torch
Requires-Dist: torch>=2.0.0; extra == "torch"
Provides-Extra: huggingface
Requires-Dist: safetensors>=0.4.0; extra == "huggingface"
Requires-Dist: huggingface_hub>=0.20.0; extra == "huggingface"
Provides-Extra: all
Requires-Dist: blake3>=1.0.0; extra == "all"
Requires-Dist: safetensors>=0.4.0; extra == "all"
Requires-Dist: torch>=2.0.0; extra == "all"
Requires-Dist: huggingface_hub>=0.20.0; extra == "all"
Dynamic: license-file

# merkle-weight-verify

Hierarchical Merkle tree for verifying the integrity of large files -- ML model weights, datasets, or any binary blob.

## Why

Downloading a 70B model from HuggingFace? Fine-tuning and sharing weights across a team? You need to answer:

- **"Is this file exactly what I expect?"** -- O(1) root hash comparison
- **"What changed between two versions?"** -- O(k log C) tree-walk diff, where k = changed chunks
- **"How much do I need to re-download?"** -- incremental sync estimates (50-68% bandwidth savings for partial fine-tuning)

Zero ML dependencies. Only Python standard library (`hashlib`, `dataclasses`, `json`).

## Test Suite

142 tests covering every public API surface, including real-model benchmarks with ResNet18:

```bash
pip install pytest torch torchvision   # test dependencies
python -m pytest tests/ -v
```

| Module | Tests | What's covered |
|--------|-------|----------------|
| `test_hasher` | 15 | All 4 hash algorithms, default switching, `verify_hash`, `detect_algorithm` |
| `test_chunking` | 10 | `chunk_bytes`, iterator, tensor, `estimate_chunk_count`, edge cases |
| `test_merkle_tree` | 30 | Tree construction (1/2/4/odd chunks), proofs, `diff_tree`, serialisation, `LayerMerkleTree`, `ModelMerkleTree`, `build_model_merkle_tree`, `_extract_layer_name` |
| `test_comparison` | 10 | `compare_model_trees` (detailed/not, layer-only-in-one), `estimate_sync_savings` |
| `test_strategies` | 12 | Fixed, Adaptive, TheoreticalOptimal strategies, bounds and edge cases |
| `test_flat_tree` | 7 | `FlatModelTree`, `flat_compare`, `TwoLevelModelTree` alias |
| `test_benchmark_resnet18` | 15 | Real ResNet18: build, verify, proof, JSON roundtrip, fine-tuning diff, sync savings, parallel vs serial, timing benchmarks |

ResNet18 benchmark results (11.7M params):

| Operation | Time | Detail |
|-----------|------|--------|
| Build tree (parallel) | 0.034s | All 60 param tensors |
| Build tree (serial) | 0.034s | Single-threaded baseline |
| Compare (1 param changed) | 0.1ms | 264 hash comparisons vs 2953 total chunks |
| Verify | 0.1ms | Full re-hash and root check |

## Install

```bash
pip install merkle-weight-verify
```

## Quick Start

### Hash and verify a file

```python
from merkle_verify import MerkleTree, compute_hash, chunk_bytes

data = open("model.safetensors", "rb").read()
chunks = chunk_bytes(data, chunk_size=16384)  # 16KB chunks
hashes = [compute_hash(c) for c in chunks]

tree = MerkleTree(hashes)
print(tree.root_hash)  # single hash represents entire file
```

### Compare two versions

```python
tree_v1 = MerkleTree(hashes_v1)
tree_v2 = MerkleTree(hashes_v2)

changed_indices, comparisons = tree_v1.diff_tree(tree_v2)
print(f"{len(changed_indices)} chunks changed, {comparisons} hash comparisons")
# vs. naive linear scan: would need len(hashes) comparisons
```

### Merkle proofs

```python
proof = tree.get_proof(chunk_index=42)
assert proof.verify()  # cryptographic proof that chunk 42 is part of this tree
```

### Multi-layer model trees

```python
from merkle_verify import ModelMerkleTree, LayerMerkleTree

model_tree = ModelMerkleTree(model_name="llama-3-70b")
# ... build layer trees from state_dict ...
model_tree.compute_model_root()

# Compare two model versions
changed_layers = model_tree.get_changed_layers(other_tree)
# Only re-download changed layers
```

### Estimate sync savings

```python
from merkle_verify import estimate_sync_savings

savings = estimate_sync_savings(old_tree, new_tree)
print(f"Save {savings['savings_percentage']:.1f}% bandwidth with incremental sync")
```

## Features

| Feature | Description |
|---------|-------------|
| **O(1) verification** | Compare root hashes to verify entire file integrity |
| **O(k log C) diff** | Tree-walk finds only changed chunks without scanning all |
| **Merkle proofs** | Cryptographic proof that a chunk belongs to a tree |
| **4 hash algorithms** | SHA-256, SHA-512, SHA3-256, BLAKE2b |
| **Hierarchical trees** | Model > Layer > Parameter > Chunk (4-level hierarchy) |
| **Chunking strategies** | Fixed, adaptive (size-based), theoretical optimal (c*=1/p) |
| **Incremental sync** | Estimate bandwidth savings for partial updates |
| **Serialization** | JSON import/export for tree persistence |
| **Parallel builds** | ThreadPoolExecutor for large models |

## Chunking Strategies

```python
from merkle_verify import FixedChunkStrategy, AdaptiveChunkStrategy, TheoreticalOptimalStrategy

# Fixed: 16KB chunks for everything (default)
fixed = FixedChunkStrategy(chunk_size=16384)

# Adaptive: chunk size scales with parameter size
# Small params (3KB bias) -> small chunks, large params (150MB embedding) -> larger chunks
adaptive = AdaptiveChunkStrategy(target_chunks=64)

# Theoretical optimal: c* = 1/p where p = modification probability per byte
optimal = TheoreticalOptimalStrategy(modification_prob=0.001)
```

## API Reference

### Core Classes

- `MerkleTree(chunk_hashes, chunk_size)` -- Build tree from chunk hashes
- `LayerMerkleTree(layer_name)` -- Group parameter trees by layer
- `ModelMerkleTree(model_name)` -- Top-level model tree

### Hashing

- `compute_hash(data, algorithm=None)` -- Hash bytes/string
- `hash_pair(left, right)` -- Hash two child hashes (internal node)
- `verify_hash(data, expected_hash)` -- Check data matches hash
- `HashAlgorithm.SHA256 | SHA512 | SHA3_256 | BLAKE2B`

### Comparison

- `compare_model_trees(tree_a, tree_b)` -- Full model diff with 3-level pruning
- `estimate_sync_savings(tree_a, tree_b)` -- Bandwidth savings estimate

## Performance

Tested on GPT-2 (124M params) through LLaMA-3-70B:

| Operation | GPT-2 | LLaMA-7B | LLaMA-70B |
|-----------|-------|----------|-----------|
| Build tree | 0.3s | 2.1s | 18s |
| Root compare | <1ms | <1ms | <1ms |
| Full diff (1% change) | 5ms | 12ms | 45ms |
| Full diff (naive) | 50ms | 340ms | 2.8s |

## Roadmap

### v0.2 — Production Hardening + Ecosystem Foundations

**Tier 1: Immediate (high feasibility, high impact)**

- [x] **BLAKE3 fast hash backend** — 5-10x speedup via `blake3` package (AVX2/NEON SIMD, multithreading). Optional dep: `pip install merkle-weight-verify[fast]`
- [x] **Safetensors plugin** — `merkle_verify.safetensors_adapter`: `sign()`, `verify()`, `diff()`, `verify_tensor()`. Leverages `safe_open()` per-tensor mmap access. Merkle manifest stored in sidecar `.merkle.json`. Fills gap left by safetensors Issue #220 (closed "not planned").
- [x] **PyTorch integration** — `merkle_verify.pytorch_adapter`: `merkle_save()`, `merkle_load()`, `verify_checkpoint()`. Addresses PyTorch Issue #126952 (open, unassigned).
- [x] **CLI tool** — `merkle-verify hash|sign|verify|diff|info` with auto-detection of safetensors/PyTorch/generic files
- [x] **Streaming build** — `MerkleTree.from_file()` and `build_file_merkle_tree()` with O(chunk_size) memory
- [x] **Fix Merkle proof for odd-count edge nodes** — all leaves (including duplicated last-node) now produce valid proofs
- [x] **Optimize `get_changed_chunks()`** — replaced O(C) linear scan with O(k log C) `diff_tree()` call

### v0.3 — Ecosystem Integration

**Tier 2: Strategic (medium feasibility, high value)**

- [ ] **HuggingFace Hub / Xet complement** — `merkle-verify hf-check <repo-id>`, `ModelMerkleTree.from_hf_cache()`. Post-download semantic verification on top of Xet's byte-level transport. xet-core is Apache-2.0 but `hf-xet` has no public chunk API, so we operate on downloaded safetensors files.
- [ ] **Sigstore model signing integration** — Generate `.merkle.json` sidecar compatible with `model-signing` (v1.1.1) workflow. Sign Merkle manifest alongside model files for combined provenance + fine-grained verification.
- [ ] **Delta sync protocol** — given diff, produce minimal binary patch for incremental transfer
- [ ] **Adaptive strategy auto-tuning** — profile modification density from git history to pick optimal `c*`

### v0.4+ — Scale

**Tier 3: Future (low feasibility now, revisit when ecosystem matures)**

- [ ] **NVIDIA cuPQC GPU backend** — 388-891 GB/s hash throughput on Blackwell GPUs. Currently blocked: C++/CUDA only, no Python bindings, v0.4 pre-release. Revisit when NVIDIA ships pip package.
- [ ] **Distributed tree construction** — split across workers for 100B+ models
- [ ] **Persistent tree store** — SQLite/LevelDB backend for caching trees across runs
- [ ] **Formal verification** — prove diff_tree correctness with property-based testing (Hypothesis)

### Research Directions

- Empirical evaluation of `c* = 1/p` theory across fine-tuning regimes (LoRA, full, distillation)
- Comparison with content-defined chunking (CDC/Rabin) vs fixed-size for weight files
- Integration with federated learning: per-client diff aggregation via Merkle proofs

## License

Apache-2.0 — Copyright 2026 Geoffrey Wang
