Metadata-Version: 2.4
Name: seedbraid
Version: 2.0.1
Summary: Reference-based file reconstruction with CDC chunking, SBD1 binary seed format, and IPFS transport
Author: aimsise
License-Expression: MIT
Project-URL: Homepage, https://github.com/aimsise/seedbraid
Project-URL: Repository, https://github.com/aimsise/seedbraid
Project-URL: Documentation, https://github.com/aimsise/seedbraid#readme
Project-URL: Issues, https://github.com/aimsise/seedbraid/issues
Project-URL: Changelog, https://github.com/aimsise/seedbraid/blob/main/CHANGELOG.md
Keywords: cdc,chunking,deduplication,ipfs,binary-format,file-reconstruction,seed
Classifier: Development Status :: 5 - Production/Stable
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: System :: Archiving
Classifier: Typing :: Typed
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.12
Provides-Extra: zstd
Requires-Dist: zstandard>=0.23; extra == "zstd"
Provides-Extra: crypto
Requires-Dist: cryptography>=43.0; extra == "crypto"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.6; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: cryptography>=43.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6; extra == "docs"
Requires-Dist: mkdocs-material>=9.5; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.25; extra == "docs"
Dynamic: license-file

# Seedbraid

[![CI](../../actions/workflows/ci.yml/badge.svg)](../../actions/workflows/ci.yml)

Seedbraid is a reference-based reconstruction tool for large, similar binary artifacts.

It combines deterministic content-defined chunking (CDC), a compact binary `SBD1` seed format, reusable genome storage, and optional IPFS transport so you can ship reconstruction intent instead of repeatedly shipping full blobs.

## Why Seedbraid

Seedbraid is designed for workflows where ordinary file distribution becomes wasteful:

- large binary artifacts change often, but stay mostly similar
- fixed-size chunking loses reuse under shifted offsets
- you want compact transport plus bit-perfect restore guarantees
- you want one CLI surface for encode, verify, decode, publish, and fetch

In short: Seedbraid helps you move less data, reuse more content, and still verify exact reconstruction.

## When Seedbraid Is a Good Fit

Seedbraid works especially well for:

- large binary versioning: datasets, ML models, media assets, VM images
- distribution of many similar files across releases
- shift-heavy changes such as insertions that break fixed chunk reuse
- IPFS-based distribution and retrieval with integrity validation
- environments where transfer size, dedup reuse, and reproducibility matter

## Core Capabilities

- Lossless encode/decode with SHA-256 verification
- Deterministic chunking with `fixed`, `cdc_buzhash`, and `cdc_rabin`
- Genome storage backed by SQLite for deduplicated chunk reuse
- `SBD1` binary seed container with manifest, recipe, optional RAW, and integrity data
- IPFS publish/fetch transport
- Optional remote pin integration
- Strict verification mode for production-grade restore checks
- Optional signing and encryption support

## Installation

### pip
```bash
pip install seedbraid
```

### pipx
```bash
pipx install seedbraid
seedbraid --help
```

### uvx
```bash
uvx seedbraid --help
uvx seedbraid doctor
```

### Optional extras
```bash
# pip
pip install "seedbraid[zstd]"
pip install "seedbraid[crypto]"    # encryption / signing support

# pipx
pipx install "seedbraid[zstd]"
pipx install "seedbraid[crypto]"

# uvx
uvx --from "seedbraid[zstd]" seedbraid doctor
uvx --from "seedbraid[crypto]" seedbraid doctor
```

## Quick Start

### 1. Encode a file into a seed
```bash
seedbraid encode input.bin --genome ./genome --out seed.sbd --portable
```

### 2. Verify the seed
```bash
seedbraid verify seed.sbd --genome ./genome --strict
```

### 3. Decode the file back
```bash
seedbraid decode seed.sbd --genome ./genome --out recovered.bin
```

### 4. Compare the result
```bash
cmp -s input.bin recovered.bin && echo "bit-perfect roundtrip: OK"
```

> **Note:** If you installed via `uvx`, prefix commands with `uvx` (e.g. `uvx seedbraid encode ...`).
> For development builds, use `uv run --no-editable seedbraid` instead.

## Typical Workflow

A common Seedbraid workflow looks like this:

1. Prime or learn reusable chunks into a genome
2. Encode a target artifact into a compact `SBD1` seed
3. Verify integrity before distribution
4. Publish the seed if needed, including via IPFS
5. Fetch and decode later using the genome
6. Run strict verification when exact restore is required

## Stability

Seedbraid v2.0.0 is production-ready.

Before deploying to your environment, validate behavior in your own runtime, storage, and network configuration.

Treat successful `verify --strict` and bit-perfect restore checks as release gates.

## Production Validation Checklist

Before using Seedbraid in CI/CD or production pipelines, run a strict smoke workflow like this:

```bash
uv sync --no-editable --extra dev

workdir="$(mktemp -d)"
python3 - <<'PY' "$workdir/input.bin"
from pathlib import Path
import sys

out = Path(sys.argv[1])
payload = (b"seedbraid-beta-smoke" * 20000) + bytes(range(256)) * 200
out.write_bytes(payload)
print(f"wrote {out} bytes={len(payload)}")
PY

uv run --no-sync --no-editable seedbraid encode "$workdir/input.bin" \
  --genome "$workdir/genome" \
  --out "$workdir/seed.sbd" \
  --chunker cdc_buzhash \
  --avg 65536 --min 16384 --max 262144 \
  --learn --portable --compression zlib

uv run --no-sync --no-editable seedbraid verify "$workdir/seed.sbd" \
  --genome "$workdir/genome" \
  --strict

uv run --no-sync --no-editable seedbraid decode "$workdir/seed.sbd" \
  --genome "$workdir/genome" \
  --out "$workdir/decoded.bin"

cmp -s "$workdir/input.bin" "$workdir/decoded.bin" \
  && echo "bit-perfect roundtrip: OK"
```

## CLI Reference

> All examples below use bare `seedbraid`. If you installed via `uvx`, prefix with `uvx`.
> For development builds, use `uv run --no-editable seedbraid`.

### Core Commands

#### Encode
```bash
seedbraid encode input.bin --genome ./genome --out seed.sbd

seedbraid encode input.bin --genome ./genome --out seed.sbd \
  --chunker cdc_buzhash --avg 65536 --min 16384 --max 262144 \
  --learn --no-portable --compression zlib

seedbraid encode input.bin --genome ./genome --out seed.private.sbd \
  --manifest-private

export SB_ENCRYPTION_KEY='your-secret-passphrase'
seedbraid encode input.bin --genome ./genome --out seed.encrypted.sbd \
  --encrypt --manifest-private
```

#### Decode
```bash
seedbraid decode seed.sbd --genome ./genome --out recovered.bin

seedbraid decode seed.encrypted.sbd --genome ./genome --out recovered.bin \
  --encryption-key "$SB_ENCRYPTION_KEY"
```

#### Verify
```bash
seedbraid verify seed.sbd --genome ./genome
seedbraid verify seed.sbd --genome ./genome --strict
seedbraid verify seed.sbd --genome ./genome --require-signature --signature-key "$SB_SIGNING_KEY"
seedbraid verify seed.encrypted.sbd --genome ./genome --strict \
  --encryption-key "$SB_ENCRYPTION_KEY"
```

`verify` supports two modes:

- Quick mode: checks seed integrity and required chunk availability
- Strict mode: reconstructs all content and enforces source size and SHA-256 match

#### Prime
```bash
seedbraid prime "./dataset/**/*" --genome ./genome --chunker cdc_buzhash
```

#### Doctor
```bash
seedbraid doctor --genome ./genome
```

`doctor` checks:

- Python runtime compatibility (`>=3.12`)
- kubo API reachability (`SB_KUBO_API`)
- `IPFS_PATH` state
- genome path writability
- compression support (`zlib`, optional `zstd`)

### Advanced Commands

#### Genome Snapshot / Restore
```bash
seedbraid genome snapshot --genome ./genome --out genome.sgs
seedbraid genome restore genome.sgs --genome ./genome-dr --replace
```

#### Publish Chunks to IPFS
```bash
seedbraid publish-chunks seed.sbd --genome ./genome
seedbraid publish-chunks seed.sbd --genome ./genome \
  --manifest-out chunks.json --workers 32
seedbraid publish-chunks seed.sbd --genome ./genome \
  --pin --remote-pin \
  --remote-endpoint https://pin.example/api/v1 \
  --remote-token "$SB_PINNING_TOKEN"
```

`publish-chunks` publishes all CDC chunks referenced by a seed to IPFS as raw blocks, generates a chunk manifest sidecar (`.sbd.chunks.json`), and optionally pins the chunk DAG locally or via a remote pinning provider.

#### Fetch and Decode from IPFS
```bash
seedbraid fetch-decode seed.sbd --out recovered.bin
seedbraid fetch-decode seed.sbd --out recovered.bin \
  --workers 64 --batch-size 200 --retries 5
seedbraid fetch-decode seed.sbd --out recovered.bin \
  --gateway https://ipfs.io/ipfs
```

`fetch-decode` reads a seed and its chunk manifest, fetches all chunks from IPFS in parallel batches, and reconstructs the original file. Requires the chunk manifest sidecar (`.sbd.chunks.json`) alongside the seed.

#### Decode with IPFS Genome
```bash
seedbraid decode seed.sbd --genome ipfs:// --out recovered.bin
seedbraid decode seed.sbd --genome ipfs:///path/to/cache --out recovered.bin
seedbraid decode seed.sbd --genome ipfs:// --out recovered.bin \
  --gateway https://ipfs.io/ipfs
```

Using `--genome ipfs://` activates hybrid storage: chunks are fetched from IPFS with local SQLite caching. `ipfs://` uses a temporary cache; `ipfs:///path/to/cache` persists fetched chunks for future reuse.

#### Publish to IPFS
```bash
seedbraid publish seed.sbd --no-pin
seedbraid publish seed.sbd --pin
seedbraid publish seed.sbd --remote-pin \
  --remote-endpoint https://pin.example/api/v1 --remote-token "$SB_PINNING_TOKEN"
```

`publish` emits a warning when the seed is unencrypted. For sensitive data, prefer:

```bash
seedbraid encode --encrypt --manifest-private ...
```

When `--remote-pin` is enabled, Seedbraid also registers the CID with a configured Pinning Services API-compatible provider.

#### Fetch from IPFS
```bash
seedbraid fetch <cid> --out fetched.sbd
seedbraid fetch <cid> --out fetched.sbd --retries 5 --backoff-ms 300
seedbraid fetch <cid> --out fetched.sbd --gateway https://ipfs.io/ipfs
```

`fetch` retries with exponential backoff via the kubo HTTP API and can fall back to an HTTP gateway.

#### Pin Health
```bash
seedbraid pin-health <cid>
```

#### Remote Pin Add
```bash
export SB_PINNING_ENDPOINT='https://pin.example/api/v1'
export SB_PINNING_TOKEN='your-api-token'
seedbraid pin remote-add <cid>
```

#### Sign Seed
```bash
export SB_SIGNING_KEY='your-shared-secret'
seedbraid sign seed.sbd --out seed.signed.sbd --key-env SB_SIGNING_KEY --key-id team-a
```

#### Export / Import Genes
```bash
seedbraid export-genes seed.sbd --genome ./genome --out genes.pack
seedbraid import-genes genes.pack --genome ./another-genome
```

## Generate an Encryption Key

Generate a high-entropy key for `SB_ENCRYPTION_KEY`:

```bash
seedbraid gen-encryption-key
```

Print shell export format:

```bash
seedbraid gen-encryption-key --shell
```

Set the current shell variable directly:

```bash
eval "$(seedbraid gen-encryption-key --shell)"
```

## IPFS Setup

Start the kubo daemon:

```bash
ipfs daemon
```

By default, seedbraid connects to the kubo HTTP API at
`http://127.0.0.1:5001/api/v0`.  Override with the `SB_KUBO_API`
environment variable:

```bash
export SB_KUBO_API=http://127.0.0.1:5001/api/v0
```

Run `seedbraid doctor` to verify connectivity.

## Remote Pinning Setup

To use a remote pinning service, set the endpoint and token as environment variables.

Using a shell profile (`~/.bashrc`, `~/.zshrc`):

```bash
export SB_PINNING_ENDPOINT='https://api.pinata.cloud/psa'
export SB_PINNING_TOKEN='your-api-token'
```

Using [direnv](https://direnv.net/) (`.envrc` in your project directory):

```bash
# .envrc
export SB_PINNING_ENDPOINT='https://api.pinata.cloud/psa'
export SB_PINNING_TOKEN='your-api-token'
```

With these variables set, `--remote-pin` works without passing `--remote-endpoint` and `--remote-token` each time.

### Verifying a Remote Pin

After publishing with `--remote-pin`, confirm the pin is active:

```bash
# 1. Check local pin and block availability
seedbraid pin-health <cid>

# 2. Verify the pinned content is fetchable from the network
seedbraid fetch <cid> --out /tmp/verify.sbd
seedbraid verify /tmp/verify.sbd --genome ./genome --strict
```

If `pin-health` reports the CID is pinned and `fetch` + `verify --strict` succeed, the remote pin is working correctly.

## Common Failures

- `kubo daemon not reachable`
  - Install Kubo, start the daemon with `ipfs daemon`, and verify with `seedbraid doctor`
- `Missing required chunk` on decode or verify
  - Provide the correct `--genome`, or re-encode with `--portable`
- `zstd` compression error
  - Install optional dependency `zstandard`, or use `--compression zlib`

## Data Recovery Guide

Reconstructing a file requires **two things**: a **seed** (the recipe describing chunk order) and the **chunks** themselves (the actual data). If either is missing, recovery is impossible.

### When Recovery Succeeds

| Scenario | Why It Works |
|---|---|
| Seed on hand + local genome available | Recipe and ingredients are both local |
| Seed on hand + own IPFS node running with chunks pinned | Recipe is local; ingredients are in your node's storage |
| Seed on hand + chunks held by a pinning service (Pinata, etc.) | Recipe is local; ingredients are in a paid storage provider |
| Seed on hand + teammate's IPFS node holds the chunks | Recipe is local; ingredients are on a peer's node |
| Seed created with `--portable` (chunks embedded in seed) | Recipe and ingredients are bundled together in one file |
| Seed on hand + genome snapshot (`.sgs` backup) exists | Recipe is local; ingredients are in a backup archive |

### When Recovery Fails

| Scenario | Why It Fails |
|---|---|
| **Seed file lost** | Without the recipe, there is no way to know which chunks to fetch or how to reassemble them |
| Seed exists, but genome deleted and chunks never published to IPFS | Recipe exists, but all ingredients have been discarded |
| Seed exists, but IPFS node stopped and no other node holds the chunks | Recipe exists, but the only store that had the ingredients is offline |
| Seed exists, but IPFS pin removed and garbage collection ran | Recipe exists, but automatic cleanup deleted the ingredients |
| Seed exists, but pinning service subscription expired | Recipe exists, but the storage provider disposed of the ingredients |
| Seed exists, but **even one chunk** is missing from all sources | Partial recovery is not supported; every chunk is required |
| Seed is encrypted and the **encryption key is lost** | The recipe is unreadable without the key |

### Protecting Against Data Loss

| Action | Risk Mitigated |
|---|---|
| Back up seed files | Prevents seed loss |
| Use `--pin` when publishing chunks | Prevents IPFS garbage collection |
| Use a pinning service (`--remote-pin`) | Survives local node shutdown |
| Encode with `--portable` | Self-contained seed; no external chunk source needed (seed size increases) |
| Keep encryption keys in a secret manager | Prevents key loss for encrypted seeds |
| Take genome snapshots (`genome snapshot`) | Preserves local chunk data independently of IPFS |

> **Safest option:** `--portable` embeds all chunks in the seed, making it fully self-contained. The trade-off is that the seed grows to roughly the size of the original file, reducing the benefit of IPFS distribution.

## Troubleshooting Matrix

| Symptom | Error Code | Next Action |
|---|---|---|
| Encryption requested but key missing | `SB_E_ENCRYPTION_KEY_MISSING` | Pass `--encryption-key` or set `SB_ENCRYPTION_KEY`. |
| Signing requested but key missing | `SB_E_SIGNING_KEY_MISSING` | Export signing key env var and retry `seedbraid sign`. |
| Kubo daemon unreachable | `SB_E_IPFS_NOT_FOUND` | Install Kubo, run `ipfs daemon`, set `SB_KUBO_API` if non-default endpoint. |
| IPFS fetch/publish failure | `SB_E_IPFS_FETCH` / `SB_E_IPFS_PUBLISH` | Check daemon/network, retry, use gateway fallback if needed. |
| Remote pin configuration missing | `SB_E_REMOTE_PIN_CONFIG` | Set endpoint/token env vars or pass options. |
| Remote pin auth failed | `SB_E_REMOTE_PIN_AUTH` | Verify provider token permissions and retry. |
| Remote pin request invalid | `SB_E_REMOTE_PIN_REQUEST` | Check CID/provider options and retry. |
| Remote pin timeout/failure | `SB_E_REMOTE_PIN_TIMEOUT` / `SB_E_REMOTE_PIN` | Increase retries/timeout or check provider health. |
| Seed parse/integrity failure | `SB_E_SEED_FORMAT` | Re-fetch/rebuild seed and verify source integrity. |
| IPFS chunk publish failed | `SB_E_IPFS_CHUNK_PUT` | Check IPFS daemon, retry, verify chunk availability. |
| IPFS chunk fetch failed | `SB_E_IPFS_CHUNK_GET` | Check daemon/network, retry, use `--gateway` fallback. |
| Chunk manifest invalid | `SB_E_CHUNK_MANIFEST_FORMAT` | Regenerate manifest with `publish-chunks`. |
| IPFS MFS operation failed | `SB_E_IPFS_MFS` | Verify daemon is running with `seedbraid doctor`. |

---

# Development & Contributing

The sections below are for contributors and developers working on Seedbraid itself.

## Development Setup

```bash
uv sync --no-editable --extra dev
```

Optional zstd support:

```bash
uv sync --no-editable --extra dev --extra zstd
```

Refresh the lockfile after dependency changes:

```bash
uv lock
```

## Local Checks

```bash
UV_CACHE_DIR=.uv-cache uv run --no-editable ruff check .
PYTHONPATH=src uv run --no-editable python -m pytest
PYTHONPATH=src uv run --no-editable python -m pytest tests/test_compat_fixtures.py
```

IPFS tests auto-skip when the kubo daemon is not reachable.

Compatibility fixtures are stored in `tests/fixtures/compat/v1/` and validated by `tests/test_compat_fixtures.py`.

To regenerate them intentionally:

```bash
uv run --no-editable python scripts/gen_compat_fixtures.py
```

## CI

GitHub Actions workflows:

- `.github/workflows/ci.yml`
  - `ruff check .`
  - `python -m pytest`
  - compatibility fixtures validation
  - benchmark gate
- `.github/workflows/publish-seed.yml`
  - manual only, `dry_run=true` by default
  - generates a seed from `source_path`
  - runs `seedbraid verify --strict`
  - publishes to IPFS only when `dry_run=false`
  - installs Kubo when needed
  - verifies Kubo release signature status and checksum
  - supports `pin`, `portable`, `manifest_private`, and optional `encrypt`

Local parity commands:

```bash
uv sync --no-editable --extra dev
uv run --no-sync --no-editable ruff check .
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest tests/test_compat_fixtures.py
uv run --no-sync --no-editable python scripts/bench_gate.py \
  --min-reuse-improvement-bps 1 \
  --max-seed-size-ratio 1.20 \
  --min-cdc-throughput-mib-s 0.10 \
  --json-out .artifacts/bench-report.json
```

## Benchmarking

### 1-byte insertion dedup benchmark
```bash
uv run --no-editable python scripts/bench_shifted_dedup.py
uv run --no-editable python scripts/bench_gate.py \
  --min-reuse-improvement-bps 1 \
  --max-seed-size-ratio 1.20 \
  --min-cdc-throughput-mib-s 0.10 \
  --json-out .artifacts/bench-report.json
```

Expected behavior:

- `cdc_buzhash` should show better reuse than `fixed` when a single-byte insertion shifts offsets
- `bench_gate.py` exits non-zero when configured thresholds are violated

## Integrations

### DVC Integration
- Minimal DVC bridge lives in `examples/dvc/`
- Pipeline stages are `encode -> verify --strict -> fetch`
- The integration recipe and artifact layout are documented in `examples/dvc/README.md`

### OCI Integration
- ORAS bridge scripts and usage docs live in `examples/oci/`
- Default OCI metadata convention:
  - artifact type: `application/vnd.seedbraid.seed.v1`
  - layer media type: `application/vnd.seedbraid.seed.layer.v1+sbd`
  - annotations: source SHA-256, chunker, manifest-private flag, seed title
- Push/pull scripts:
  - `examples/oci/scripts/push_seed.sh <seed.sbd> <registry/repository:tag>`
  - `examples/oci/scripts/pull_seed.sh <registry/repository:tag> <out.sbd>`
- After pull, run strict verification:
  - `seedbraid verify <out.sbd> --genome <genome-path> --strict`

### ML Tooling Hooks
- Scripts for MLflow metadata logging and Hugging Face upload live in `examples/ml/`
- MLflow hook logs seed metadata fields
- Hugging Face hook uploads `seed.sbd` and a metadata sidecar
- Restore workflow is documented in `examples/ml/README.md`

## Roadmap

Current adoption priorities include:

- a faster onboarding path
- stronger benchmark evidence versus alternatives
- security and operator tooling such as signing, encryption, `doctor`, `snapshot`, and `restore`
- stable format governance and backward-compatibility policy for long-lived seed archives

## Project Documents

- Format spec: `docs/FORMAT.md`
- Design rationale: `docs/DESIGN.md`
- Threat model: `docs/THREAT_MODEL.md`
- Error codes: `docs/ERROR_CODES.md`
- Performance gates: `docs/PERFORMANCE.md`
- DVC example: `examples/dvc/README.md`
- OCI example: `examples/oci/README.md`
- ML tooling example: `examples/ml/README.md`

## Support Seedbraid

Seedbraid is maintained as an open-source project.

If Seedbraid helps your workflow, please consider supporting the project through the repository `Sponsor` button. Support goes directly toward maintenance, documentation, and compatibility/performance validation.

## Open Source Governance

- License: `MIT` (`LICENSE`)
- Security policy: `SECURITY.md`
- Contributing guide: `CONTRIBUTING.md`
- Code of Conduct: `CODE_OF_CONDUCT.md`
