Metadata-Version: 2.3
Name: buckethead
Version: 0.3.1
Summary: In-memory SQLite backed by periodic snapshots to S3-compatible bucket storage.
Author: Dylan Miracle
Author-email: Dylan Miracle <me@dylanmiracle.net>
Requires-Dist: boto3>=1.42.91
Requires-Dist: pydantic>=2.9
Requires-Dist: pydantic-settings>=2.6
Requires-Dist: rich>=13.9
Requires-Dist: tomlkit>=0.14.0
Requires-Dist: typer>=0.15
Requires-Dist: ipykernel>=6.29 ; extra == 'examples'
Requires-Dist: jupyter>=1.1 ; extra == 'examples'
Requires-Dist: matplotlib>=3.9 ; extra == 'examples'
Requires-Dist: pandas>=2.2 ; extra == 'examples'
Requires-Dist: scikit-learn>=1.5 ; extra == 'examples'
Requires-Dist: memray>=1.19.3 ; extra == 'profiling'
Requires-Dist: pyinstrument>=5.1.2 ; extra == 'profiling'
Requires-Python: >=3.13
Provides-Extra: examples
Provides-Extra: profiling
Description-Content-Type: text/markdown

# BucketHead

In-memory SQLite backed by periodic snapshots to S3-compatible bucket
storage (AWS S3, Cloudflare R2, MinIO).

Your app writes to a regular `sqlite3.Connection`. BucketHead keeps the
database in memory, snapshots the whole thing to your bucket on a timer,
and restores it on startup. No Redis, no managed service — just SQLite
and one bucket.

## Why

- **Fast.** Reads and writes hit an in-memory SQLite; microsecond
  latencies. BucketHead is never in the hot path.
- **Durable.** Snapshots land on S3 / R2 / MinIO at a configurable
  cadence, plus one final flush on shutdown.
- **Cheap.** Dirty-bit optimization skips the upload whenever the
  database hasn't changed since the last flush — so an idle workload
  costs nothing.
- **Structured.** You're using SQLite, not a hash table. Schemas,
  indexes, transactions, joins — all the usual stuff.
- **R2-friendly.** Zero egress fees + the dirty-bit mean snapshot cost
  is dominated by storage, not requests.

## Install

```bash
uv add buckethead                # runtime only
uv add 'buckethead[profiling]'   # + memray / pyinstrument hooks
```

Python 3.13+.

## Quickstart

```python
from pathlib import Path
from buckethead import BucketConfig, BucketSQLite

cfg = BucketConfig.for_r2(
    account_id="<cloudflare account id>",
    bucket="my-bucket",
    access_key_id="<r2 s3 api access key>",
    secret_access_key="<r2 s3 api secret>",
)

with BucketSQLite(cfg) as bh:
    # Raw SQL
    bh.connection.execute("CREATE TABLE kv (k TEXT PRIMARY KEY, v TEXT)")
    bh.connection.execute("INSERT INTO kv VALUES ('answer', '42')")
    bh.connection.commit()

    # Key/value interface
    bh.kv.set("user/123", "alice")
    bh.kv.get("user/123")                      # "alice"

    # File store — content-addressable, dedup'd
    bh_key = bh.files.put(Path("/tmp/big.bin"))
    bh.files.get(bh_key, dest=Path("/tmp/out.bin"))

# On exit: final flush → snapshot uploaded to R2.
# On next startup: restored automatically.
```

### Env-driven config

For 12-factor deployments, skip the `BucketConfig` entirely —
`BucketSQLite()` auto-loads it from `BUCKETHEAD_*` env vars and
`~/.config/buckethead/config.toml`:

```python
from buckethead import BucketSQLite

bh = BucketSQLite()
# reads BUCKETHEAD_BUCKET__NAME, BUCKETHEAD_BUCKET__ACCESS_KEY_ID,
# BUCKETHEAD_BUCKET__SECRET_ACCESS_KEY, and optionally
# BUCKETHEAD_BUCKET__ENDPOINT_URL or BUCKETHEAD_CLOUDFLARE__ACCOUNT_ID.
# [env] entries in the user config TOML are merged in via setdefault.
```

If you need the intermediate `BucketConfig` object, build it explicitly
via `BucketHeadSettings().to_bucket_config()` and pass it in.

## Four typed views on your data

| Attribute | Access pattern | What it's for |
|---|---|---|
| `bh.connection` | raw SQL | anything — full SQLite is yours |
| `bh.kv` | string-keyed `set` / `get` / dict protocol | small configuration, cache entries, session data |
| `bh.docs` | named collections of JSON documents with a Mongo-lite filter DSL | structured-ish records you want to query by field (users, events, config blobs) |
| `bh.files` | content-addressable SHA-256 `bh-key` → bytes in R2 | arbitrary files (uploads, artifacts, ML inputs) |

```python
users = bh.docs.collection("users")
users.insert({"name": "alice", "age": 30, "tags": ["beta"]})
users.find({"age": {"$gte": 18}, "tags": {"$in": ["beta"]}})
```

DocStore rows live in SQLite — they snapshot and branch with the rest
of the database. Escape hatch for queries the DSL doesn't cover:
`bh.connection.execute("SELECT doc FROM bh_docs WHERE ...")` with
`json_extract`.

## Branches

BucketHead can maintain multiple named snapshots against the same bucket,
one per "branch". Useful for trying risky migrations or running experiments
without polluting main state.

```python
bh.branches.create("experiment-1")      # fork from current
bh.branches.switch("experiment-1")      # flush outgoing, reload from target
bh.connection.execute("...")             # writes go to experiment-1's snapshot
bh.branches.switch("main")              # back to main
bh.branches.list()                      # ["experiment-1", "main"]

# Experiment succeeded — make main look like experiment-1:
bh.branches.switch("experiment-1")
bh.branches.overwrite("main")           # identity becomes main; in-memory unchanged
bh.branches.delete("experiment-1")      # optional cleanup
```

- Branches are R2 keys: `main` is `<BucketConfig.key>`; branch `X` is
  `<BucketConfig.key>.branch.X`. Listed via live ListObjectsV2 — no registry.
- Names must match `[a-zA-Z0-9_-]+`; `main` is reserved.
- **FileStore blobs are shared across branches.** Content-hash keys dedupe
  automatically; deleting a branch does not delete blobs that were unique
  to it. Run `bh.files.gc()` for orphan cleanup.
- Default branch at startup is `main`; override via
  `BucketConfig(initial_branch="X")` or `BUCKETHEAD_SNAPSHOT__BRANCH=X`.

The `connection` lives in memory; `kv` rows and `files` metadata also live
in memory (they're SQLite tables). Only `files` blobs are stored outside
SQLite — one R2 object per blob, under a configurable `files/` prefix.

## Tracking files on disk

`LocalFileTracker` keeps local filesystem paths in sync with FileStore
blobs and retains a full version history per path. Each `sync()` that
detects changed content appends a `FileVersion` row — nothing is ever
overwritten in place.

```python
from pathlib import Path
from buckethead import BucketSQLite, LocalFileTracker

with BucketSQLite(cfg) as bh:
    tracker = LocalFileTracker(bh.connection, bh.files)

    # Initial track — hashes the file, uploads the blob, records version 1.
    tracker.track(Path("/etc/app/settings.json"))

    # Periodically (or on demand) — re-hash every tracked path,
    # append a new version if anything changed.
    report = tracker.sync()
    # SyncReport(scanned=1, unchanged=0, updated=1, missing=[])

    # Which bh-key does a path currently point at?
    tracker.current(Path("/etc/app/settings.json"))

    # Full history, newest first.
    for v in tracker.history(Path("/etc/app/settings.json")):
        print(v.synced_at, v.bh_key, v.size)
```

Metadata lives in `bh_local_files` (one row per tracked path) and
`bh_local_file_versions` (append-only history). Both tables snapshot
with the rest of the database, so the version log survives restarts
and travels across branches. Blobs themselves live in FileStore under
content-addressable keys — identical content across different paths or
branches dedupes automatically.

## Sharing files

A FileStore blob can be exposed to the outside world through a separate
**share bucket** — either a public-read R2 bucket where the URL is a
stable public path, or a private bucket where the URL is a sig-v4
presigned GET. Provision it once per project:

```bash
buckethead provision share-bucket --project my-project
```

Then point `BucketSQLite` at the project by name:

```python
from buckethead import BucketSQLite

bh = BucketSQLite(project="my-project")
bh.start()

bh_key = bh.files.put(b"hello", filename="note.txt")
result = bh.shares.share(bh_key)    # copies into share bucket
print(result.url)                   # https://files.example.com/note/abcd1234.txt
```

`project=` reads the share bucket name from
`~/.config/buckethead/config.toml` and pulls bucket credentials from the
configured secret store (the same conventions `buckethead shares` uses).
Projects without a share bucket attached leave `bh.shares` raising, so
you only pay for the lookup when sharing is actually configured. Build
the `ShareConfig` standalone via `ShareConfig.from_project("my-project")`
if you'd rather compose it yourself.

## CLI

```bash
buckethead status                          # probe bucket + current snapshot key, no hydration
buckethead inspect                         # schema + row counts of the snapshot
buckethead restore <local.db>              # download the snapshot to disk

buckethead files list                      # paginated listing
buckethead files get <bh-key> <dest>       # download one blob
buckethead files gc --dry-run              # what would orphan cleanup delete?
buckethead files gc --grace-seconds 300    # actually clean up

buckethead provision bucket --project <n>  # one-time: create bucket + store creds
buckethead shares share <bh-key> --project <n>  # copy to share bucket, print URL
buckethead config show                     # inspect ~/.config/buckethead/config.toml
buckethead bench run <preset>              # KV latency / throughput / YCSB
buckethead stress run <scenario>           # cost + perf scenarios against real R2
```

All CLI commands read credentials from the same env vars as
`BucketHeadSettings()`. See [the CLI reference](docs/cli.md) for the
full surface — `provision`, `shares`, `config`, `bench`, and `stress`
each have sub-commands beyond the ones shown above.

## Observability

Pass callbacks to `BucketSQLite` to wire metrics or tracing:

```python
def on_flush_start() -> None: ...
def on_flush_complete(duration_s: float, bytes_uploaded: int) -> None: ...
def on_flush_error(exc: BaseException) -> None: ...

bh = BucketSQLite(
    cfg,
    on_flush_start=on_flush_start,
    on_flush_complete=on_flush_complete,
    on_flush_error=on_flush_error,
)
```

`bytes_uploaded == 0` means the dirty-bit skipped the upload — the DB
didn't change since the last flush.

For deeper profiling, enable the built-in hooks:

```python
from buckethead import BucketSQLite, ProfilingConfig

bh = BucketSQLite(
    cfg,
    profiling_config=ProfilingConfig(
        io_counters=True,     # JSON summary of bytes / ops per R2 call
        memory=True,          # requires buckethead[profiling]
        cpu=True,             # requires buckethead[profiling]
    ),
)
```

On `bh.stop()`, profiling artifacts are written under
`ProfilingConfig.output_dir` (default `./buckethead-profiles`).

## Configuration

### `BucketConfig`

| Field | Default | Notes |
|---|---|---|
| `bucket` | required | bucket name |
| `key` | `bucketsqlite/snap.db` | where the snapshot lives |
| `endpoint_url` | `None` | set for R2/MinIO; leave `None` for AWS S3 |
| `region` | `"auto"` | R2 wants `"auto"`; AWS wants a real region |
| `access_key_id` / `secret_access_key` | `None` | or use IAM / env |
| `files_prefix` | `"files/"` | FileStore objects go under this prefix |

For R2, use `BucketConfig.for_r2(account_id, bucket, access_key_id, secret_access_key)`
to skip the endpoint-URL boilerplate.

### `SnapshotConfig`

| Field | Default | Notes |
|---|---|---|
| `interval_seconds` | `60.0` | background flush cadence |
| `min_interval_seconds` | `5.0` | debounce for `flush()` (manual) |
| `keep_previous` | `True` | save `<key>.prev` before overwriting |

### Env vars (for `BucketHeadSettings`)

All vars use the `BUCKETHEAD_` prefix with `__` as the nesting delimiter
(pydantic-settings convention). So `BUCKETHEAD_BUCKET__NAME` lands in
`settings.bucket.name`.

| Variable | Purpose |
|---|---|
| `BUCKETHEAD_BUCKET__NAME` | bucket name (required) |
| `BUCKETHEAD_BUCKET__ACCESS_KEY_ID` | S3-API access key (required) |
| `BUCKETHEAD_BUCKET__SECRET_ACCESS_KEY` | S3-API secret (required) |
| `BUCKETHEAD_BUCKET__ENDPOINT_URL` | full endpoint; set for R2/MinIO/B2 |
| `BUCKETHEAD_BUCKET__REGION` | defaults to `auto` (R2); set a real region for AWS |
| `BUCKETHEAD_CLOUDFLARE__ACCOUNT_ID` | alternative to `ENDPOINT_URL` for R2 — endpoint is derived when `BUCKETHEAD_CLOUD=cloudflare-r2` (default) |
| `BUCKETHEAD_SNAPSHOT__KEY` | snapshot key (default `bucketsqlite/snap.db`) |
| `BUCKETHEAD_SNAPSHOT__BRANCH` | initial branch the process attaches to (default `main`) |
| `BUCKETHEAD_CLOUD` | cloud backend name; default `cloudflare-r2` |
| `BUCKETHEAD_SECRET_STORE` | secret-store backend name; default `1password` |

Both `BUCKETHEAD_BUCKET__ENDPOINT_URL` and
`BUCKETHEAD_CLOUDFLARE__ACCOUNT_ID` are optional — if neither is set,
`endpoint_url` stays `None` and boto3 connects to AWS S3 with its normal
credential discovery.

## Constraints and scope

- **Single-process.** The in-memory database lives in the process that
  constructs `BucketSQLite`. Multi-threaded access in that process is
  fine (`bh.connect()` vends a new connection per thread), but cross-
  process sharing is not supported.
- **Durability window.** Hard crash (OOM, SIGKILL, power loss) loses
  up to `interval_seconds` of writes. Call `bh.force_flush()` after
  any write that must not be lost.
- **DB size.** Snapshot wall-time is ~1 ms per MB of DB. Comfortable
  below 100 MB, usable up to ~500 MB, noticeable stalls above that.
- **Not a Redis drop-in.** No wire protocol, no pub/sub, no replication.
- **Not a distributed database.** Single writer, no HA.

## Deeper reading

- [`docs/diagrams.md`](docs/diagrams.md) — sequence diagrams for the
  lifecycle, `FileStore.put`, and `FileStore.gc` flows
- [Full docs site](https://cloutfront.github.io/buckethead/) — API
  reference and CLI usage

## License

TBD.
