Metadata-Version: 2.3
Name: buckethead
Version: 0.1.2
Summary: In-memory SQLite backed by periodic snapshots to S3-compatible bucket storage.
Author: Dylan Miracle
Author-email: Dylan Miracle <me@dylanmiracle.net>
Requires-Dist: boto3>=1.42.91
Requires-Dist: pydantic>=2.9
Requires-Dist: pydantic-settings>=2.6
Requires-Dist: rich>=13.9
Requires-Dist: typer>=0.15
Requires-Dist: ipykernel>=6.29 ; extra == 'examples'
Requires-Dist: jupyter>=1.1 ; extra == 'examples'
Requires-Dist: matplotlib>=3.9 ; extra == 'examples'
Requires-Dist: pandas>=2.2 ; extra == 'examples'
Requires-Dist: scikit-learn>=1.5 ; extra == 'examples'
Requires-Dist: memray>=1.19.3 ; extra == 'profiling'
Requires-Dist: pyinstrument>=5.1.2 ; extra == 'profiling'
Requires-Python: >=3.13
Provides-Extra: examples
Provides-Extra: profiling
Description-Content-Type: text/markdown

# BucketHead

In-memory SQLite backed by periodic snapshots to S3-compatible bucket
storage (AWS S3, Cloudflare R2, MinIO).

Your app writes to a regular `sqlite3.Connection`. BucketHead keeps the
database in memory, snapshots the whole thing to your bucket on a timer,
and restores it on startup. No Redis, no managed service — just SQLite
and one bucket.

## Why

- **Fast.** Reads and writes hit an in-memory SQLite; microsecond
  latencies. BucketHead is never in the hot path.
- **Durable.** Snapshots land on S3 / R2 / MinIO at a configurable
  cadence, plus one final flush on shutdown.
- **Cheap.** Dirty-bit optimization skips the upload whenever the
  database hasn't changed since the last flush — so an idle workload
  costs nothing.
- **Structured.** You're using SQLite, not a hash table. Schemas,
  indexes, transactions, joins — all the usual stuff.
- **R2-friendly.** Zero egress fees + the dirty-bit mean snapshot cost
  is dominated by storage, not requests.

## Install

```bash
uv add buckethead                # runtime only
uv add 'buckethead[profiling]'   # + memray / pyinstrument hooks
```

Python 3.13+.

## Quickstart

```python
from pathlib import Path
from buckethead import BucketConfig, BucketSQLite

cfg = BucketConfig.for_r2(
    account_id="<cloudflare account id>",
    bucket="my-bucket",
    access_key_id="<r2 s3 api access key>",
    secret_access_key="<r2 s3 api secret>",
)

with BucketSQLite(cfg) as bh:
    # Raw SQL
    bh.connection.execute("CREATE TABLE kv (k TEXT PRIMARY KEY, v TEXT)")
    bh.connection.execute("INSERT INTO kv VALUES ('answer', '42')")
    bh.connection.commit()

    # Key/value interface
    bh.kv.set("user/123", "alice")
    bh.kv.get("user/123")                      # "alice"

    # File store — content-addressable, dedup'd
    bh_key = bh.files.put(Path("/tmp/big.bin"))
    bh.files.get(bh_key, dest=Path("/tmp/out.bin"))

# On exit: final flush → snapshot uploaded to R2.
# On next startup: restored automatically.
```

### Env-driven config

For 12-factor deployments, use `BucketSettings`:

```python
from buckethead import BucketSettings, BucketSQLite

cfg = BucketSettings.from_env().to_bucket_config()
# reads R2_ACCOUNT_ID (or R2_ENDPOINT_URL), R2_BUCKET,
# R2_ACCESS_KEY_ID, R2_SECRET_ACCESS_KEY, and BUCKETHEAD_KEY
bh = BucketSQLite(cfg)
```

## Three typed views on your data

| Attribute | Access pattern | What it's for |
|---|---|---|
| `bh.connection` | raw SQL | anything — full SQLite is yours |
| `bh.kv` | string-keyed `set` / `get` / dict protocol | small configuration, cache entries, session data |
| `bh.files` | content-addressable SHA-256 `bh-key` → bytes in R2 | arbitrary files (uploads, artifacts, ML inputs) |

## Branches

BucketHead can maintain multiple named snapshots against the same bucket,
one per "branch". Useful for trying risky migrations or running experiments
without polluting main state.

```python
bh.branches.create("experiment-1")      # fork from current
bh.branches.switch("experiment-1")      # flush outgoing, reload from target
bh.connection.execute("...")             # writes go to experiment-1's snapshot
bh.branches.switch("main")              # back to main
bh.branches.list()                      # ["experiment-1", "main"]

# Experiment succeeded — make main look like experiment-1:
bh.branches.switch("experiment-1")
bh.branches.overwrite("main")           # identity becomes main; in-memory unchanged
bh.branches.delete("experiment-1")      # optional cleanup
```

- Branches are R2 keys: `main` is `<BucketConfig.key>`; branch `X` is
  `<BucketConfig.key>.branch.X`. Listed via live ListObjectsV2 — no registry.
- Names must match `[a-zA-Z0-9_-]+`; `main` is reserved.
- **FileStore blobs are shared across branches.** Content-hash keys dedupe
  automatically; deleting a branch does not delete blobs that were unique
  to it. Run `bh.files.gc()` for orphan cleanup.
- Default branch at startup is `main`; override via
  `BucketConfig(initial_branch="X")` or `BUCKETHEAD_BRANCH=X`.

The `connection` lives in memory; `kv` rows and `files` metadata also live
in memory (they're SQLite tables). Only `files` blobs are stored outside
SQLite — one R2 object per blob, under a configurable `files/` prefix.

## CLI

```bash
buckethead inspect                         # schema + row counts of the snapshot
buckethead restore <local.db>              # download the snapshot to disk
buckethead files list                      # paginated listing
buckethead files get <bh-key> <dest>       # download one blob
buckethead files gc --dry-run              # what would orphan cleanup delete?
buckethead files gc --grace-seconds 300    # actually clean up
```

All CLI commands read credentials from the same env vars as
`BucketSettings.from_env()`.

## Observability

Pass callbacks to `BucketSQLite` to wire metrics or tracing:

```python
def on_flush_start() -> None: ...
def on_flush_complete(duration_s: float, bytes_uploaded: int) -> None: ...
def on_flush_error(exc: BaseException) -> None: ...

bh = BucketSQLite(
    cfg,
    on_flush_start=on_flush_start,
    on_flush_complete=on_flush_complete,
    on_flush_error=on_flush_error,
)
```

`bytes_uploaded == 0` means the dirty-bit skipped the upload — the DB
didn't change since the last flush.

For deeper profiling, enable the built-in hooks:

```python
from buckethead import BucketSQLite, ProfilingConfig

bh = BucketSQLite(
    cfg,
    profiling_config=ProfilingConfig(
        io_counters=True,     # JSON summary of bytes / ops per R2 call
        memory=True,          # requires buckethead[profiling]
        cpu=True,             # requires buckethead[profiling]
    ),
)
```

On `bh.stop()`, profiling artifacts are written under
`ProfilingConfig.output_dir` (default `./buckethead-profiles`).

## Configuration

### `BucketConfig`

| Field | Default | Notes |
|---|---|---|
| `bucket` | required | bucket name |
| `key` | `bucketsqlite/snap.db` | where the snapshot lives |
| `endpoint_url` | `None` | set for R2/MinIO; leave `None` for AWS S3 |
| `region` | `"auto"` | R2 wants `"auto"`; AWS wants a real region |
| `access_key_id` / `secret_access_key` | `None` | or use IAM / env |
| `files_prefix` | `"files/"` | FileStore objects go under this prefix |

For R2, use `BucketConfig.for_r2(account_id, bucket, access_key_id, secret_access_key)`
to skip the endpoint-URL boilerplate.

### `SnapshotConfig`

| Field | Default | Notes |
|---|---|---|
| `interval_seconds` | `60.0` | background flush cadence |
| `min_interval_seconds` | `5.0` | debounce for `flush()` (manual) |
| `keep_previous` | `True` | save `<key>.prev` before overwriting |

### Env vars (for `BucketSettings`)

| Variable | Purpose |
|---|---|
| `R2_ACCOUNT_ID` | Cloudflare account id (used to build endpoint URL) |
| `R2_ENDPOINT_URL` | full endpoint; overrides `R2_ACCOUNT_ID` |
| `R2_BUCKET` | bucket name |
| `R2_ACCESS_KEY_ID` | R2 S3 API token access key |
| `R2_SECRET_ACCESS_KEY` | R2 S3 API token secret |
| `BUCKETHEAD_KEY` | snapshot key (default `bucketsqlite/snap.db`) |
| `BUCKETHEAD_BRANCH` | initial branch the process attaches to (default `main`) |

Both `R2_ACCOUNT_ID` and `R2_ENDPOINT_URL` are optional — if neither is
set, `endpoint_url` stays `None` and boto3 connects to AWS S3 with its
normal credential discovery.

## Constraints and scope

- **Single-process.** The in-memory database lives in the process that
  constructs `BucketSQLite`. Multi-threaded access in that process is
  fine (`bh.connect()` vends a new connection per thread), but cross-
  process sharing is not supported.
- **Durability window.** Hard crash (OOM, SIGKILL, power loss) loses
  up to `interval_seconds` of writes. Call `bh.force_flush()` after
  any write that must not be lost.
- **DB size.** Snapshot wall-time is ~1 ms per MB of DB. Comfortable
  below 100 MB, usable up to ~500 MB, noticeable stalls above that.
- **Not a Redis drop-in.** No wire protocol, no pub/sub, no replication.
- **Not a distributed database.** Single writer, no HA.

## Deeper reading

- [`plan/project-spec.md`](plan/project-spec.md) — the design
- [`plan/build-plan.md`](plan/build-plan.md) — phased build log + decisions
- [`docs/diagrams.md`](docs/diagrams.md) — sequence diagrams

## License

TBD.
