Metadata-Version: 2.4
Name: seedbraid
Version: 1.1.2
Summary: Seedbraid reference-based reconstruction with CDC and IPFS seed transport
Author: aimsise
License-Expression: MIT
Project-URL: Homepage, https://github.com/aimsise/seedbraid
Project-URL: Repository, https://github.com/aimsise/seedbraid
Project-URL: Documentation, https://github.com/aimsise/seedbraid#readme
Project-URL: Issues, https://github.com/aimsise/seedbraid/issues
Project-URL: Changelog, https://github.com/aimsise/seedbraid/blob/main/CHANGELOG.md
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: System :: Archiving
Classifier: Typing :: Typed
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.12
Provides-Extra: zstd
Requires-Dist: zstandard>=0.23; extra == "zstd"
Provides-Extra: crypto
Requires-Dist: cryptography>=43.0; extra == "crypto"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.6; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: cryptography>=43.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6; extra == "docs"
Requires-Dist: mkdocs-material>=9.5; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.25; extra == "docs"
Dynamic: license-file

# Seedbraid

[![CI](../../actions/workflows/ci.yml/badge.svg)](../../actions/workflows/ci.yml)

Seedbraid provides reference-based reconstruction with deterministic content-defined chunking (CDC), a binary SBD1 seed format, and IPFS publish/fetch transport.

## Beta Status (Read First)
- Seedbraid is currently in beta stage.
- Before production use, run strict validation in your own runtime/storage/network environment.
- Treat successful `verify --strict` and bit-perfect restore checks as release gates for your team.

## Strict Validation Workflow (Required Before Production)
Run the following smoke workflow before relying on Seedbraid in CI/CD or production pipelines:

```bash
uv sync --no-editable --extra dev

workdir="$(mktemp -d)"
python3 - <<'PY' "$workdir/input.bin"
from pathlib import Path
import sys

out = Path(sys.argv[1])
payload = (b"seedbraid-beta-smoke" * 20000) + bytes(range(256)) * 200
out.write_bytes(payload)
print(f"wrote {out} bytes={len(payload)}")
PY

uv run --no-sync --no-editable seedbraid encode "$workdir/input.bin" \
  --genome "$workdir/genome" \
  --out "$workdir/seed.sbd" \
  --chunker cdc_buzhash \
  --avg 65536 --min 16384 --max 262144 \
  --learn --portable --compression zlib

uv run --no-sync --no-editable seedbraid verify "$workdir/seed.sbd" \
  --genome "$workdir/genome" \
  --strict

uv run --no-sync --no-editable seedbraid decode "$workdir/seed.sbd" \
  --genome "$workdir/genome" \
  --out "$workdir/decoded.bin"

cmp -s "$workdir/input.bin" "$workdir/decoded.bin" \
  && echo "bit-perfect roundtrip: OK"

UV_CACHE_DIR=.uv-cache uv run --no-sync --no-editable ruff check .
PYTHONPATH=src UV_CACHE_DIR=.uv-cache uv run --no-sync --no-editable python -m pytest
```

## Features
- Lossless encode/decode with SHA-256 verification.
- Chunkers: `fixed`, `cdc_buzhash`, `cdc_rabin`.
- Genome storage (SQLite) for deduplicated chunk reuse.
- SBD1 binary seed container (`manifest + recipe + optional RAW + integrity`).
- IPFS CLI integration (`publish`, `fetch`).
- Optional remote pin integration (`pin remote-add`, publish-time remote pin).

## Why Seedbraid
- Seed-first architecture: reconstruction intent is shipped as a compact `SBD1` seed (`manifest + recipe`) instead of shipping full blobs repeatedly.
- End-to-end integrity posture: strict verify mode, compatibility fixtures, and performance gates are built into the project workflow.
- Practical Web3 distribution: CID publish/fetch is part of the same CLI surface as encode/decode, reducing operational handoffs.
- Shift-resilient dedup by default: CDC is first-class and benchmarked against fixed chunking with reproducible scripts.

## Best-Fit Use Cases
- Large binary versioning: datasets, ML models, media assets, and VM images.
- Distribution of many similar files: share a common genome and distribute compact seeds.
- IPFS-based distribution and retrieval: distribute by CID and verify reconstruction integrity.
- Shift-heavy changes (for example, single-byte insertion): CDC improves reuse over fixed chunking.

## What It Takes for OSS Adoption
- A 5-minute onboarding path (installation + first encode/decode tutorial).
- Benchmark evidence that Seedbraid wins against alternatives on size, transfer time, and restore speed.
- Security and operations readiness: signing/encryption and operator tooling (`doctor`, `snapshot`, `restore`).
- Stable format governance and backward-compatibility policy for long-lived seed archives.

## Installation

### pip (PyPI)
```bash
pip install seedbraid
```

### pipx (isolated global install)
```bash
pipx install seedbraid
seedbraid --help
```

### uvx (ephemeral, no install needed)
```bash
uvx seedbraid --help
uvx seedbraid doctor
```

### With optional extras
```bash
# pip
pip install "seedbraid[zstd]"

# pipx
pipx install "seedbraid[zstd]"

# uvx
uvx --from "seedbraid[zstd]" seedbraid doctor
```

## Development Setup
```bash
uv sync --no-editable --extra dev
```

Optional zstd support:
```bash
uv sync --no-editable --extra dev --extra zstd
```

Refresh lockfile after dependency changes:
```bash
uv lock
```

## Generate Encryption Key
Generate a high-entropy key for `SB_ENCRYPTION_KEY`:
```bash
uv run --no-editable seedbraid gen-encryption-key
```

Print shell export format:
```bash
uv run --no-editable seedbraid gen-encryption-key --shell
```

Set current shell variable directly:
```bash
eval "$(uv run --no-editable seedbraid gen-encryption-key --shell)"
```

## CLI
### Encode
```bash
uv run --no-editable seedbraid encode input.bin --genome ./genome --out seed.sbd \
  --chunker cdc_buzhash --avg 65536 --min 16384 --max 262144 \
  --learn --no-portable --compression zlib

uv run --no-editable seedbraid encode input.bin --genome ./genome --out seed.private.sbd \
  --manifest-private

export SB_ENCRYPTION_KEY='your-secret-passphrase'
uv run --no-editable seedbraid encode input.bin --genome ./genome --out seed.encrypted.sbd \
  --encrypt --manifest-private
```

### Decode
```bash
uv run --no-editable seedbraid decode seed.sbd --genome ./genome --out recovered.bin
uv run --no-editable seedbraid decode seed.encrypted.sbd --genome ./genome --out recovered.bin \
  --encryption-key "$SB_ENCRYPTION_KEY"
```

### Verify
```bash
uv run --no-editable seedbraid verify seed.sbd --genome ./genome
uv run --no-editable seedbraid verify seed.sbd --genome ./genome --strict
uv run --no-editable seedbraid verify seed.sbd --genome ./genome --require-signature --signature-key "$SB_SIGNING_KEY"
uv run --no-editable seedbraid verify seed.encrypted.sbd --genome ./genome --strict \
  --encryption-key "$SB_ENCRYPTION_KEY"
```

`verify` supports two modes:
- Quick mode (default): checks seed integrity and required chunk availability.
- Strict mode (`--strict`): reconstructs all content and enforces source size and SHA-256 match.

### Prime
```bash
uv run --no-editable seedbraid prime "./dataset/**/*" --genome ./genome --chunker cdc_buzhash
```

### Genome Snapshot / Restore
```bash
uv run --no-editable seedbraid genome snapshot --genome ./genome --out genome.sgs
uv run --no-editable seedbraid genome restore genome.sgs --genome ./genome-dr --replace
```

### Publish (IPFS)
```bash
uv run --no-editable seedbraid publish seed.sbd --no-pin
uv run --no-editable seedbraid publish seed.sbd --pin
uv run --no-editable seedbraid publish seed.sbd --remote-pin \
  --remote-endpoint https://pin.example/api/v1 --remote-token "$SB_PINNING_TOKEN"
```

`publish` emits a warning when seed is unencrypted. For sensitive data, prefer:
`seedbraid encode --encrypt --manifest-private ...` before publishing.
When `--remote-pin` is enabled, Seedbraid also registers CID with configured remote
pin provider (Pinning Services API-compatible).

### Fetch (IPFS)
```bash
uv run --no-editable seedbraid fetch <cid> --out fetched.sbd
uv run --no-editable seedbraid fetch <cid> --out fetched.sbd --retries 5 --backoff-ms 300
uv run --no-editable seedbraid fetch <cid> --out fetched.sbd --gateway https://ipfs.io/ipfs
```

`fetch` retries `ipfs cat` with exponential backoff and can fallback to an HTTP gateway.

### Pin Health (IPFS)
```bash
uv run --no-editable seedbraid pin-health <cid>
```

### Remote Pin Add (IPFS)
```bash
export SB_PINNING_ENDPOINT='https://pin.example/api/v1'
export SB_PINNING_TOKEN='your-api-token'
uv run --no-editable seedbraid pin remote-add <cid>
```

### Doctor
```bash
uv run --no-editable seedbraid doctor --genome ./genome
```

`doctor` checks:
- Python runtime compatibility (>=3.12)
- IPFS CLI availability/version
- `IPFS_PATH` state
- genome path writability
- compression support (`zlib`, optional `zstd`)

### Sign Seed (optional)
```bash
export SB_SIGNING_KEY='your-shared-secret'
uv run --no-editable seedbraid sign seed.sbd --out seed.signed.sbd --key-env SB_SIGNING_KEY --key-id team-a
```

### Export / Import Genes (optional)
```bash
uv run --no-editable seedbraid export-genes seed.sbd --genome ./genome --out genes.pack
uv run --no-editable seedbraid import-genes genes.pack --genome ./another-genome
```

## IPFS Installation/Check
Check if IPFS CLI is available:
```bash
ipfs --version
```

If missing, install Kubo (IPFS CLI) and ensure `ipfs` is on your PATH.

## Common Failures
- `ipfs CLI not found`:
  - Install IPFS and verify with `ipfs --version`.
- `Missing required chunk` on decode/verify:
  - Provide the correct `--genome`, or re-encode with `--portable`.
- `zstd` compression error:
  - Install optional dependency `zstandard`, or use `--compression zlib`.

## Troubleshooting Matrix
| Symptom | Error Code | Next Action |
|---|---|---|
| Encryption requested but key missing | `SB_E_ENCRYPTION_KEY_MISSING` | Pass `--encryption-key` or set `SB_ENCRYPTION_KEY`. |
| Signing requested but key missing | `SB_E_SIGNING_KEY_MISSING` | Export signing key env var and retry `seedbraid sign`. |
| IPFS CLI missing | `SB_E_IPFS_NOT_FOUND` | Install Kubo and confirm `ipfs --version`. |
| IPFS fetch/publish failure | `SB_E_IPFS_FETCH` / `SB_E_IPFS_PUBLISH` | Check daemon/network, retry, use gateway fallback if needed. |
| Remote pin configuration missing | `SB_E_REMOTE_PIN_CONFIG` | Set endpoint/token env vars or pass options. |
| Remote pin auth failed | `SB_E_REMOTE_PIN_AUTH` | Verify provider token permissions and retry. |
| Remote pin request invalid | `SB_E_REMOTE_PIN_REQUEST` | Check CID/provider options and retry. |
| Remote pin timeout/failure | `SB_E_REMOTE_PIN_TIMEOUT` / `SB_E_REMOTE_PIN` | Increase retries/timeout or check provider health. |
| Seed parse/integrity failure | `SB_E_SEED_FORMAT` | Re-fetch/rebuild seed and verify source integrity. |

## CI (SBD-ECO-001)
GitHub Actions workflows:
- `.github/workflows/ci.yml`
  - Lint: `ruff check .`
  - Test: `python -m pytest`
  - Compatibility fixtures: `python -m pytest tests/test_compat_fixtures.py`
  - Benchmark gate: `python scripts/bench_gate.py ...`
- `.github/workflows/publish-seed.yml` (manual only, `dry_run=true` default)
  - Generates seed from `source_path` via `seedbraid encode`
  - Runs strict integrity check via `seedbraid verify --strict`
  - Publishes to IPFS only when `dry_run=false`
  - Installs Kubo (`ipfs` CLI) on runner when `dry_run=false` (version configurable via `kubo_version`)
  - Verifies Kubo release tag signature status via GitHub API before install
  - Verifies downloaded Kubo archive checksum (`sha512`) before extraction
  - Supports `pin`, `portable`, `manifest_private`, and optional `encrypt`
    (`SB_ENCRYPTION_KEY` secret required when `encrypt=true`)

Local parity commands:
```bash
uv sync --no-editable --extra dev
uv run --no-sync --no-editable ruff check .
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest
PYTHONPATH=src uv run --no-sync --no-editable python -m pytest tests/test_compat_fixtures.py
uv run --no-sync --no-editable python scripts/bench_gate.py \
  --min-reuse-improvement-bps 1 \
  --max-seed-size-ratio 1.20 \
  --min-cdc-throughput-mib-s 0.10 \
  --json-out .artifacts/bench-report.json
```

## DVC Integration (SBD-ECO-003)
- Minimal DVC bridge lives in `examples/dvc/`.
- Pipeline stages are `encode -> verify --strict -> fetch`.
- `verify` stage is strict and must fail pipeline reproduction on integrity mismatch.
- Integration recipe and artifact layout are documented in `examples/dvc/README.md`.

## OCI Integration (SBD-ECO-004)
- ORAS bridge scripts and usage docs live in `examples/oci/`.
- Default OCI metadata convention:
  - artifact type: `application/vnd.seedbraid.seed.v1`
  - layer media type: `application/vnd.seedbraid.seed.layer.v1+sbd`
  - annotations: source SHA-256, chunker, manifest-private flag, seed title
- Push/pull scripts:
  - `examples/oci/scripts/push_seed.sh <seed.sbd> <registry/repository:tag>`
  - `examples/oci/scripts/pull_seed.sh <registry/repository:tag> <out.sbd>`
- After pull, run strict verification:
  - `seedbraid verify <out.sbd> --genome <genome-path> --strict`

## ML Tooling Hooks (SBD-ECO-005)
- Scripts for MLflow metadata logging and Hugging Face upload live in `examples/ml/`.
- MLflow hook logs seed metadata fields (seed digest, manifest provenance, optional transport refs).
- Hugging Face hook uploads `seed.sbd` + metadata sidecar with env-provided token credentials.
- Restore workflow from logged metadata is documented in `examples/ml/README.md`.

## Tests and CI-Equivalent Local Commands
```bash
uv run --no-editable ruff check .
uv run --no-editable python -m pytest
uv run --no-editable python -m pytest tests/test_compat_fixtures.py
```

IPFS tests auto-skip when `ipfs` is not installed.
Compatibility fixtures are stored in `tests/fixtures/compat/v1/` and are
validated by `tests/test_compat_fixtures.py`.
Regenerate intentionally with:
`uv run --no-editable python scripts/gen_compat_fixtures.py`.

## 1-byte Insertion Dedup Benchmark
Run:
```bash
uv run --no-editable python scripts/bench_shifted_dedup.py
uv run --no-editable python scripts/bench_gate.py \
  --min-reuse-improvement-bps 1 \
  --max-seed-size-ratio 1.20 \
  --min-cdc-throughput-mib-s 0.10 \
  --json-out .artifacts/bench-report.json
```

Expected behavior:
- `cdc_buzhash` should show better reuse than `fixed` when a single-byte insertion shifts offsets.
- `bench_gate.py` exits non-zero when configured thresholds are violated.

## Project Documents
- Format spec: `docs/FORMAT.md`
- Design rationale: `docs/DESIGN.md`
- Threat model: `docs/THREAT_MODEL.md`
- Error codes: `docs/ERROR_CODES.md`
- Performance gates: `docs/PERFORMANCE.md`
- DVC workflow bridge example: `examples/dvc/README.md`
- OCI/ORAS distribution example: `examples/oci/README.md`
- ML tooling hooks example: `examples/ml/README.md`

## Support Seedbraid
- Seedbraid is maintained as an open-source project.
- If Seedbraid helps your workflow, please consider donating via the repository `Sponsor` button.
- Donations directly support maintenance, documentation, and compatibility/performance validation.

## Open Source Governance
- License: `MIT` (`LICENSE`)
- Security policy: `SECURITY.md`
- Contributing guide: `CONTRIBUTING.md`
- Code of Conduct: `CODE_OF_CONDUCT.md`
