Metadata-Version: 2.4
Name: coldcrate
Version: 0.1.0
Summary: Row-wise, self-describing, single-file cold-storage format with per-entry compression and encryption
Project-URL: Homepage, https://github.com/Larryvrh/coldcrate
Project-URL: Repository, https://github.com/Larryvrh/coldcrate
Project-URL: Issues, https://github.com/Larryvrh/coldcrate/issues
Author-email: larryvrh <larryvrh@gmail.com>
License: MIT
License-File: LICENSE
Keywords: append-only,archive,cold-storage,compression,encryption,file-format
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Security :: Cryptography
Classifier: Topic :: System :: Archiving
Requires-Python: >=3.9
Requires-Dist: xxhash>=3.0
Provides-Extra: all
Requires-Dist: cryptography>=41; extra == 'all'
Requires-Dist: lz4>=4.0; extra == 'all'
Requires-Dist: zstandard>=0.21; extra == 'all'
Provides-Extra: crypto
Requires-Dist: cryptography>=41; extra == 'crypto'
Provides-Extra: dev
Requires-Dist: cryptography>=41; extra == 'dev'
Requires-Dist: lz4>=4.0; extra == 'dev'
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: zstandard>=0.21; extra == 'dev'
Provides-Extra: lz4
Requires-Dist: lz4>=4.0; extra == 'lz4'
Provides-Extra: zstd
Requires-Dist: zstandard>=0.21; extra == 'zstd'
Description-Content-Type: text/markdown

<p align="center">
  <img src="https://raw.githubusercontent.com/Larryvrh/coldcrate/main/assets/logo.png" alt="ColdCrate" width="520">
</p>

<p align="center">
  <b>A row-wise, self-describing, single-file format for cold storage.</b><br>
  Structured rows + heavy blobs, archived once and read back by offset — with built-in compression and encryption.
</p>

<p align="center">
  <a href="https://pypi.org/project/coldcrate/"><img alt="PyPI" src="https://img.shields.io/pypi/v/coldcrate?color=2b7bba"></a>
  <img alt="Python" src="https://img.shields.io/badge/python-3.9%2B-2b7bba">
  <a href="LICENSE"><img alt="License" src="https://img.shields.io/badge/license-MIT-2b7bba"></a>
  <img alt="Format" src="https://img.shields.io/badge/format-v1-2b7bba">
  <img alt="Dependencies" src="https://img.shields.io/badge/core%20deps-xxhash-2b7bba">
</p>

<p align="center">
  <b>English</b> · <a href="README.zh-CN.md">简体中文</a>
</p>

---

## What is ColdCrate?

ColdCrate is a file **format** (and a small, dependency-light Python library) for archiving datasets where each record is a **structured row plus a heavy blob** — think images with metadata, embeddings, documents, model shards.

A *chunk* is a single file: a small header, an **embedded JSON schema** that fully describes every row, then an **append-only** stream of length-prefixed entries. Each entry's payload can be **compressed** (LZ4 / Zstd) and **encrypted** (AES-256-XTS, derived from a passphrase) independently.

```
┌──────────────┬───────────────┬──────── append-only ────────────────┐
│  Header 128B │  JSON schema  │  entry · entry · entry · entry · …   │
└──────────────┴───────────────┴──────────────────────────────────────┘
                  ▲ travels with the data — the file explains itself
```

It deliberately ships **no built-in index**. Fast lookup is the caller's job via an external manifest of `(resource_id, offset)` — so the format stays a clean, predictable byte container.

## Why ColdCrate?

- 📦 **Self-describing.** The schema lives in the file. Hand someone a chunk and they can read every field name, type, and description — no side-channel docs. `scan()` can fully rebuild a lost manifest.
- 🧬 **Real types, real nesting.** `u8…u64 / i8…i64 / f32 / f64 / bool / bytes / utf8 / uuid / timestamp`, plus fixed/variable arrays and nested structs, arbitrarily composed.
- 🗜️ **Compression built in.** Per-entry LZ4 or Zstd with a tunable level; skip it per-entry for already-compressed blobs.
- 🔐 **Encryption built in.** AES-256-XTS keyed from a passphrase (random salt + scrypt in the header). The schema is encrypted too, so field names don't leak. Wrong passphrase fails fast at `open()`.
- ➕ **Append-only, no seal.** Keep appending anytime; a crash never corrupts prior entries. Checksummed `scan()` recovers what's valid; `repair()` truncates trailing garbage.
- 🧊 **Built for scale.** Streaming write/read (flat memory regardless of chunk size), 8-byte aligned for `mmap`, and embarrassingly parallel across chunks — multi-GB chunks, thousands of them, are the design target.
- 🪶 **Light.** Core install pulls in only `xxhash`. Compression and crypto backends are optional extras, imported lazily.
- 🔎 **Fully typed.** Ships `py.typed` (PEP 561) and is mypy-strict clean, so your type checker sees every signature.

## Install

```bash
pip install coldcrate            # core (xxhash checksums only)
pip install coldcrate[zstd]      # + Zstd compression
pip install coldcrate[lz4]       # + LZ4 compression
pip install coldcrate[crypto]    # + AES-256-XTS encryption
pip install coldcrate[all]       # everything
```

Backends are imported lazily; using one you didn't install raises a clear `CompressionError` / `EncryptionError`.

## Quick start

```python
import coldcrate as cc

schema = cc.Schema(
    description="image dataset",
    fields=[
        cc.Field("source",     "utf8", description="origin URL"),
        cc.Field("category",   "utf8"),
        cc.Field("dimensions", cc.Struct([
            cc.Field("width",  "u32"),
            cc.Field("height", "u32"),
        ])),
        cc.Field("tags",       cc.VarArray("utf8")),
        cc.Field("embedding",  cc.FixedArray("f32", 768), nullable=True),
        cc.Field("image_data", "bytes"),
    ],
)

# --- write ---
manifest = []
with cc.ChunkWriter.create("images.coldcrate", schema, compression="zstd") as w:
    res = w.append(b"img-001", {
        "source": "http://example.com/a.jpg",
        "category": "cat",
        "dimensions": {"width": 800, "height": 600},
        "tags": ["cute", "outdoor"],
        "embedding": None,
        "image_data": jpeg_bytes,
    })
    manifest.append((b"img-001", res.offset))   # remember where it landed

# --- read by offset (manifest-driven) ---
with cc.ChunkReader.open("images.coldcrate") as r:
    entry = r.read_at(manifest[0][1])
    print(entry.fields["category"], entry.checksum_ok)

    # or sweep everything in order
    for entry in r.scan():
        ...
```

A **row is a `dict`** matching the schema: nested structs are nested dicts, arrays are lists. Validation is strict — out-of-range ints, wrong types, missing non-nullable fields, and unknown keys all raise `SchemaError` instead of being silently coerced.

## When to use it — and when not to

**Reach for ColdCrate when:**

- You archive many structured records, each with a sizable blob, and read them back **by offset** (or by full scan) rather than by ad-hoc query.
- You want **one self-contained file** that explains itself, optionally compressed and encrypted, with no database to run.
- Your access pattern is **write-mostly-once, read-occasionally** — cold storage, dataset shipping, ML training shards.

**Look elsewhere when:**

| You need… | Use instead |
| --- | --- |
| Ad-hoc queries / secondary indexes | SQLite, DuckDB, a real DB |
| Columnar analytics over wide tables | Parquet / Arrow |
| In-place random update or delete | a mutable store (ColdCrate is append-only) |
| Authentication against active tampering, out of the box | add an HMAC/signature yourself (XTS has no MAC — see [Encryption](#encryption)) |
| Reading one sub-field without touching the rest | — reads decode the **whole** row eagerly |
| A non-Python reader today | — only the Python implementation exists (the format is simple, though) |

## Core concepts

- **Chunk** — one file. Created once with a fixed header + schema, then appended to.
- **Schema** — the row definition, embedded as JSON. One schema per chunk; every entry conforms to it.
- **resource_id** — an opaque per-entry handle (1–512 bytes), always stored in plaintext (even in encrypted chunks), used to reference entries from a manifest.
- **Manifest** — *your* external index: typically `(resource_id, chunk_path, offset, …)` rows you collect from each `append()`. ColdCrate has no built-in index because without a manifest a `resource_id` has no meaning to look up, and with one the offset already lives there. A full `scan()` rebuilds it.

## What's stored vs what you supply

A chunk is self-describing: the algorithms and parameters needed to read it are written into the header and schema, so there's no way to mis-specify them on open and silently corrupt a read.

| set at `create()` | stored in the chunk? | needed at read time? |
| --- | --- | --- |
| `schema` | ✅ embedded JSON | no — read from the file |
| `compression` algorithm | ✅ header | no |
| `encryption` algorithm | ✅ header | no |
| `kdf` params + random salt | ✅ header | no |
| `chunk_id`, `created_at` | ✅ header | no |
| `compression_level` | ❌ writer-side only | never — decompression doesn't need it |
| `passphrase` | ❌ it's the secret | **yes**, for encrypted chunks |

So the **only** thing you pass to `ChunkReader.open()` is the `passphrase`, and only for encrypted chunks. You can't "mismatch" the compression/encryption algorithm, level, or KDF — they come from the file, not from you. A **wrong passphrase fails fast at `open()`** (the encrypted schema won't decrypt), never a silent garbage read.

## Guide

### Schema & types

Primitives are strings; composites are helper objects, nesting arbitrarily.

| Type | Python value |
| --- | --- |
| `u8 u16 u32 u64 i8 i16 i32 i64` | `int` (range-checked) |
| `f32 f64` | `float` |
| `bool` | `bool` (strict — not `0/1`) |
| `bytes` | `bytes` / `bytearray` / `memoryview` |
| `utf8` | `str` |
| `uuid` | `uuid.UUID` (or 16 bytes) |
| `timestamp` | `int` — Unix microseconds (no timezone magic) |
| `Struct([Field, …])` | nested `dict` |
| `FixedArray(elem, n)` | `list` of exactly `n` |
| `VarArray(elem)` | `list` of any length |

Any `Field` can be `nullable=True` (value `None`, or omit the key). Nesting is capped at 64 levels and the embedded schema at 8 MiB on read, so a pathological or hostile schema raises a clean error instead of exhausting the stack. A single variable-length field (`bytes`/`utf8`/array) is bounded by a u32 length prefix (~4 GiB).

### Compression

Set per chunk; opt out per entry. The level is a writer-side speed/ratio knob and is **not** stored (decompression never needs it).

```python
cc.ChunkWriter.create("c.coldcrate", schema, compression="zstd", compression_level=19)
...
w.append(rid, row, compress=False)   # this blob is already compressed (e.g. JPEG)
```

### Encryption

The **passphrase is the only secret you supply.** A random salt and scrypt parameters are stored in the header, so the file fully describes how to re-derive its own key; the same passphrase yields different ciphertext across chunks.

```python
with cc.ChunkWriter.create(
    "secret.coldcrate", schema,
    compression="zstd", encryption="aes-256-xts", passphrase="correct horse",
    kdf=(18, 8, 1),                  # optional: raise scrypt log2n for cold storage
) as w:
    w.append(b"k", {...})

with cc.ChunkReader.open("secret.coldcrate", passphrase="correct horse") as r:
    entry = r.read_at(off)           # decrypted transparently
```

When a chunk is encrypted, its **schema is encrypted too** — field names are as sensitive as values. A keyless `open()` still exposes the header, `resource_id`s, integrity checks, and `scan_raw()` (stored bytes), but `reader.schema is None` and field decoding needs the key. A **wrong passphrase fails at `open()`** (the schema won't decrypt).

> **Threat model.** AES-256-XTS provides confidentiality, not authentication (length-preserving → nowhere for a MAC). The XXH64 checksum detects *corruption*, not *tampering* (it's unkeyed). If active modification is in scope, layer an HMAC or signature over the chunk yourself.

### Deletion (tombstones)

Append-only ⇒ deletion is logical. `append_tombstone(rid)` writes a marker (tombstone flag, empty payload); a reader returns it with `tombstone is True` and `fields is None`. Resolution is caller logic, like `resource_id` uniqueness:

```python
w.append(b"img-001", {...})
w.append_tombstone(b"img-001")       # later marker logically deletes it

live = {}
for e in r.scan():
    if e.tombstone:
        live.pop(e.resource_id, None)
    else:
        live[e.resource_id] = e.offset
```

### Durability & recovery

`append()` writes the entry and nothing else. The header's `entry_count` / `tail_offset` counters are committed on `flush()` / `close()`; `flush(sync=True)` adds `fsync`. `tail_offset` is written last as a commit marker: a reader trusts the cached counters **only** if `tail_offset == file size`, otherwise both read as `None` — never a misleading stale value.

After a crash, `coldcrate.repair(path)` scans the longest valid run of entries (checksum-validated, no passphrase needed), truncates trailing partial bytes, and rewrites the counters. `ChunkWriter.open()` refuses to append to a dirty chunk until you do this. A corrupt chunk never crashes the reader: `scan()` resyncs past damage and yields what's valid; any malformed input raises a clean `ColdCrateError`.

### Concurrency

One chunk has a **single writer**: `create()` / `open()` take a best-effort advisory exclusive lock (`fcntl.flock` where available), so a second writer fails fast instead of interleaving. **Readers take no lock** — many `ChunkReader`s (and threads sharing one, since `read_at` is positional) can read concurrently, including while a writer appends. Parallelism across chunks is unrestricted: each chunk is an independent file.

### Performance & scale

ColdCrate is streaming — one entry in memory at a time for both write and scan — so memory stays flat regardless of chunk or dataset size, and a single chunk can far exceed RAM. Indicative single-core throughput on **512 KiB incompressible payloads** (a realistic entry size; already-compressed media — your numbers depend on data and hardware):

| pipeline | write | scan | random `read_at` |
| --- | --- | --- | --- |
| plain | ~2.5 GiB/s | ~3.8 GiB/s | ~4.3 GiB/s |
| + zstd | ~1.7 GiB/s | ~3.3 GiB/s | ~3.5 GiB/s |
| + AES-256-XTS | ~1.0 GiB/s | ~2.0 GiB/s | ~2.1 GiB/s |

It's memory-bandwidth-bound at these sizes. Encryption roughly halves write throughput; the gap is much larger for tiny entries (a few KiB), where the per-entry cipher setup dominates rather than the AES itself — so size your entries accordingly. Because chunks are independent, **aggregate throughput scales with cores** — on a 22-core host, 8 parallel encrypted+zstd writers reach ~3.5 GiB/s vs ~700 MiB/s for one (~5×). For multi-TB datasets, shard across chunks and run roughly one writer per core (or per machine):

```python
from concurrent.futures import ProcessPoolExecutor

def write_shard(task):
    path, rows = task
    with cc.ChunkWriter.create(path, SCHEMA, compression="zstd") as w:
        for rid, row in rows:
            w.append(rid, row)

with ProcessPoolExecutor(max_workers=16) as ex:
    list(ex.map(write_shard, shard_tasks))
```

Measure on your own hardware:

```bash
python benchmarks/bench.py                          # compression × encryption matrix
python benchmarks/bench.py --codec                  # pure encode/decode throughput
python benchmarks/bench.py --parallel-chunks 16     # multi-process scaling
python benchmarks/bench.py --stress --target-gb 10  # sustained large write
python benchmarks/gil_scaling_probe.py              # why encryption isn't threaded
```

## Caveats & gotchas

Things worth knowing before you depend on it:

- **No built-in index.** You must keep a manifest of offsets, or `scan()` to find things. This is by design.
- **No authentication.** XTS protects confidentiality only; the checksum is unkeyed. Layer your own MAC/signature if tampering is a threat.
- **Reads decode the whole row.** There's no lazy/partial field access — fetching one sub-field still materializes the entire entry.
- **Big single payloads use a few × their size in RAM** transiently during compress/encrypt/decode. A single variable-length field caps at ~4 GiB; split larger blobs across entries.
- **Encrypted random access: reuse the reader.** `open()` runs scrypt (tens of ms). Opening per read makes the KDF dominate — keep readers open / pool them.
- **Compression `level` isn't stored**; it only affects the writer. Decompression works regardless.
- **One writer per chunk.** Concurrent writers are blocked where `flock` exists, undefined where it doesn't (e.g. Windows) — keep single-writer discipline yourself there.
- **Pre-1.0 format.** The on-disk layout (`FORMAT_VERSION = 1`) may change before 1.0; no cross-version compatibility guarantee yet.

---

## API reference

Everything is re-exported from the top-level `coldcrate` package (`import coldcrate as cc`). The package ships `py.typed` (PEP 561) and is fully annotated, so mypy / pyright resolve every signature below.

### Schema definition

#### `Schema(fields: list[Field], description: str | None = None, version: int = 1)`
The row definition embedded in a chunk. Validated on construction (`SchemaError` on a bad shape; nesting capped at 64).

| | |
| --- | --- |
| `fields: list[Field]` | ordered field definitions (serialisation order) |
| `description: str \| None` | optional human description |
| `version: int` | schema-format version (default `1`) |

Methods: `encode_row(row: dict) -> bytes`, `decode_row(buf: bytes | bytearray | memoryview) -> dict`, `to_dict() -> dict`, `to_json_bytes() -> bytes`, and classmethods `from_dict(d: dict) -> Schema`, `from_json_bytes(raw: bytes) -> Schema`.

#### `Field(name: str, type: TypeExpr, nullable: bool = False, description: str | None = None)`
One field. `type` is a type string or a composite (`Struct` / `FixedArray` / `VarArray`). `nullable=True` allows `None` (or omitting the key).

#### Type strings
`"u8" "u16" "u32" "u64" "i8" "i16" "i32" "i64" "f32" "f64" "bool" "bytes" "utf8" "uuid" "timestamp"`

#### `Struct(fields: list[Field])` · `FixedArray(elem: TypeExpr, count: int)` · `VarArray(elem: TypeExpr)`
Composite types. `Struct.fields` is a `list[Field]`; `elem` is any nested type; `count` is the fixed length (≥ 1).

### Writing

#### `ChunkWriter.create(path: str | os.PathLike, schema: Schema, *, compression: str = "none", compression_level: int | None = None, encryption: str = "none", passphrase: str | bytes | None = None, kdf: tuple[int, int, int] | None = None, chunk_id: uuid.UUID | None = None, created_at: int | None = None) -> ChunkWriter`
Create a new chunk (exclusive create — `FileExistsError` if it exists) and write its header + schema.

| param | meaning |
| --- | --- |
| `compression` | `"none"` / `"lz4"` / `"zstd"` |
| `compression_level` | backend level (zstd ~1–22, lz4 HC); `None` = default. Writer-side, not stored |
| `encryption` | `"none"` / `"aes-256-xts"` (requires `passphrase`) |
| `passphrase` | `str` / `bytes`; the only encryption secret |
| `kdf` | `(log2n, r, p)` scrypt cost; default `(15, 8, 1)`. `log2n ≤ 32` |
| `chunk_id` / `created_at` | override the generated UUID / Unix-µs timestamp |

#### `ChunkWriter.open(path: str | os.PathLike, *, passphrase: str | bytes | None = None, compression_level: int | None = None) -> ChunkWriter`
Open an existing chunk to append more entries. Requires the `passphrase` if encrypted; raises `InvalidChunkError` on a dirty chunk (call `repair()` first), `ColdCrateError` if another writer holds the lock.

#### `writer.append(resource_id: bytes, row: dict, *, compress: bool | None = None, encrypt: bool | None = None) -> AppendResult`
Serialize `row` (a dict matching the schema), optionally compress + encrypt, and append it. `resource_id` is 1–512 bytes. `compress` / `encrypt` default to the chunk's settings; pass `False` to skip for this entry. Returns where it landed.

#### `writer.append_tombstone(resource_id: bytes) -> AppendResult`
Append a deletion marker (tombstone flag, empty payload) for `resource_id`.

#### `writer.append_many(items: Iterable[tuple[bytes, dict]]) -> list[AppendResult]`
Convenience over `append` for an iterable of `(resource_id, row)` pairs. No implicit flush.

#### `writer.flush(*, sync: bool = False) -> None` · `writer.close(*, sync: bool = False) -> None`
Commit the mutable header counters (and `fsync` if `sync=True`). `close` flushes then closes (also a context-manager exit). Properties: `header`, `schema`, `entry_count`, `tail_offset`.

#### `AppendResult`
`offset: int` (absolute offset of the entry) · `checksum: int` (XXH64 of `resource_id ‖ stored payload`). Feed these into your manifest.

### Reading

#### `ChunkReader.open(path: str | os.PathLike, *, passphrase: str | bytes | None = None, mmap: bool = True) -> ChunkReader`
Open a chunk for reading. `passphrase` is needed only to decode fields of an encrypted chunk (header / `resource_id` / `scan_raw` work without it). `mmap=True` memory-maps for random access.

#### `reader.read_at(offset: int) -> Entry`
Read a single entry by absolute offset (from a manifest). Verifies the checksum (reported via `Entry.checksum_ok`, never raised). Raises `InvalidEntryError` if the offset isn't a valid entry, `EncryptionError` if encrypted and opened without a key.

#### `reader.scan(*, verify: bool = True) -> Iterator[Entry]`
Yield decoded entries from start to end. `verify=True` (default) checksums each entry and **resyncs past corruption** (cold-storage recovery); `verify=False` is faster, trusts the framing, and stops at the first anomaly (`checksum_ok` is `None`).

#### `reader.scan_raw(*, verify: bool = True) -> Iterator[RawEntry]`
Like `scan` but yields **stored** (still compressed/encrypted) payloads — needs no passphrase. Use for integrity patrol or copying.

Properties: `header -> ChunkHeader`, `schema -> Schema | None` (`None` for an encrypted chunk opened without the passphrase).

#### `Entry`
`offset: int` · `resource_id: bytes` · `fields: dict | None` (`None` for a tombstone) · `checksum_ok: bool | None` (`None` if unverified) · `flags: int`. Properties: `tombstone`, `compressed`, `encrypted`.

#### `RawEntry`
`offset` · `resource_id` · `payload: bytes` (stored form) · `checksum_ok` · `flags`, with the same flag properties.

### Maintenance

#### `coldcrate.repair(path: str | os.PathLike) -> RepairResult`
Recover a chunk left dirty by a crash: scan the longest valid contiguous run (checksum-validated, keyless), truncate trailing partial bytes, and rewrite the header counters. Returns `RepairResult(entry_count, tail_offset, truncated_bytes)`.

#### `ChunkHeader`
Frozen dataclass returned by `reader.header` / `writer.header`: `version, flags, chunk_id, created_at, schema_size, compression, encryption, kdf_salt, kdf_log2n, kdf_r, kdf_p`, plus `entry_count` and `tail_offset` (both `int` or `None` — non-`None` ⇒ exact) and a `data_start` property.

### Errors

All derive from **`ColdCrateError`**:

| exception | raised when |
| --- | --- |
| `InvalidChunkError` | not a valid chunk (bad magic/version/header, oversized schema, dirty on append) |
| `InvalidEntryError` | a malformed / truncated entry, or a row that doesn't fit the schema |
| `SchemaError` | invalid schema, or a row value that doesn't match it |
| `CompressionError` | missing backend, or a (de)compression failure |
| `EncryptionError` | missing passphrase/backend, or bad KDF parameters |

### Constants
`coldcrate.__version__` · `MAGIC` (`b"COLDCRT\0"`) · `FORMAT_VERSION` (`1`).

---

## License

[MIT](LICENSE) © larryvrh
