Write-safety fix + dedup alias trail
spice-kernel-db design & implementation plan
Draft for review — no code yet. Settle decisions D1–D3 (bottom) before implementation.
Trigger. update bc_ops.tm crashed with ValueError: Hash mismatch for bc_mpo_hga_zero_s20191107_v02.bc.
Root cause (proven). Dedup symlinks collapse many distinct kernel names onto one inode; download_kernel does open(dest,"wb"), which follows those symlinks, so a single parallel run writes several different kernels into one physical file.
This plan covers two complementary workstreams:
- A — Write-path correctness (the bug): downloads must never write through a dedup symlink.
- B — Dedup alias trail (the design gap): every filename a content hash was ever seen under must be retained and followable, so dedup is transparent and trustworthy.
A is required to stop corruption. B is required so the user can follow what dedup did. They ship together.
os.replace fix.
See Figure 1 for the end-to-end picture.
Background — what the code does today
register_file(db.py:444) setsp = Path(path).resolve()(line 467) and storesstr(p)inlocations(line 575). It resolves the symlink first, so an aliased name likehga_scm_2018…is recorded under the target’s path (hga_zero…) — the alias name never reaches the DB.kernelskeeps only the first-registered filename as canonical (lines 512–514); a later identical-content name logs"Hash match: X is identical to already-registered Y"and is discarded (lines 552–556).download_kernel(remote.py:350) writes withopen(dest,"wb")(line 379) — follows symlinks.- Net effect: the only record that
hga_scm_2018…≡ this content is the on-disk symlink, pointing at a confusingly different name, with no DB trail.
Workstream A — Write-path correctness
A1. download_kernel → temp file + atomic replace
src/spice_kernel_db/remote.py (download_kernel, ~lines 371–412).
- Download to
dest.with_name(f".{dest.name}.<pid>.tmp")in the same directory (same filesystem → atomic rename). - On success (after completeness + optional
expected_hashchecks),os.replace(tmp, dest).os.replaceonto a symlink path replaces the link entry itself, atomically — it does not follow the link into the shared target. This is the core fix.
- On any exception, unlink the temp file (replace the current
dest.unlink()cleanup, which today risks unlinking a real file reached via symlink). - Keep the streaming SHA-256 (still the trusted hash of the bytes written).
Result: every download lands as its own independent real file. A shared dedup target can no longer be clobbered, even with 8 concurrent workers, even if several aliases point at it.
A2. (defense in depth) skip duplicate dest within one batch
db.py get_metakernel task-build loop (~1798–1823). If two tasks resolve to the same physical path, log and keep one. Cheap guard; A1 already makes this non-fatal.
A3. Repair runbook for the already-corrupted tree
The shared _zero_ / _sct_ / _fcp_ files currently hold whichever aliased kernel won the last race, so both the canonical file and every symlink pointing at it are suspect.
After A1 lands:
# For each affected BepiColombo metakernel, force a clean re-fetch:
spice-kernel-db get https://spiftp.esac.esa.int/data/SPICE/BEPICOLOMBO/kernels/mk/bc_ops.tm --force
spice-kernel-db get https://spiftp.esac.esa.int/data/SPICE/BEPICOLOMBO/kernels/mk/bc_plan.tm --force
# then re-dedup honestly:
spice-kernel-db dedup--force re-downloads every entry; with A1 each name now gets its own correct content; dedup re-links only the genuinely-identical ones.
Do not delete the current corrupted files first — they are the evidence for the regression test (A-test below). Capture their hashes before repair.
Workstream B — Dedup alias trail
B1. New table kernel_aliases
db.py _init_schema (~line 264), mirroring the existing CREATE TABLE IF NOT EXISTS + column-migration pattern used for superseded_by (lines 274–277).
CREATE TABLE IF NOT EXISTS kernel_aliases (
sha256 VARCHAR NOT NULL,
filename VARCHAR NOT NULL,
first_seen TIMESTAMP DEFAULT current_timestamp,
source_url VARCHAR,
PRIMARY KEY (sha256, filename)
);Semantics: every distinct filename a given content hash has ever been registered under — deduped or not. The physical file still dedups to one inode; this table remembers all the names.
B2. Populate in register_file
Inside the existing transaction (db.py:519–576), after the kernels upsert:
INSERT OR IGNORE INTO kernel_aliases (sha256, filename, source_url)
VALUES (?, ?, ?)keyed on (h, fname, source_url). Record fname (the as-referenced basename) before any resolve() collapses it — i.e. capture Path(path).name at function entry, not the resolved target’s name. This is the one subtlety: today fname = p.name is taken after p = Path(path).resolve().
B3. Backfill migration for existing DBs
One-time, in _init_schema right after the table is created (guard on “table was just created / empty”):
- Seed from current
kernels:INSERT OR IGNORE INTO kernel_aliases SELECT sha256, filename, … FROM kernels. - Seed from on-disk symlinks: for each distinct directory in
locations.abs_path, find symlinks; for eachlink → target, look up the target’s sha inlocations, insert(sha, basename(link)). This recovers the trail thatresolve()discarded (e.g. all thehga_scm_*names).
See decision D2 for automatic-during-migration vs. an explicit dedup --rebuild-aliases.
B4. New command aliases <name|hash>
cli.py: sub.add_parser("aliases", …) near the other read-only commands (~line 358), dispatch elif args.command == "aliases": (~line 503 region). Read-only (add to read_only_commands).
New KernelDB.aliases(name_or_hash) -> dict in db.py:
- Resolve input → content hash (accept a full/partial sha or any known filename via
kernel_aliases/kernels). - Return: content hash, canonical name,
kernel_type,size_bytes, all alias filenames (fromkernel_aliases), and all on-disk locations (fromlocations). - CLI renders a
rich.Table/Panel(house style).
$ spice-kernel-db aliases bc_mpo_hga_scm_20210101_20220101_s20230309_v01.bc
Content 8c7922… (CK, 16 KB)
Canonical filename: bc_mpo_hga_zero_s20191107_v02.bc
Also known as (7): bc_mpo_hga_scm_20181020_…, bc_mpo_hga_scm_20200101_…, …
Locations (1): …/BEPICOLOMBO/ck/bc_mpo_hga_zero_s20191107_v02.bcB5. Surface aliases in existing output
browse/getkernel tables: annotate deduped rows, e.g.bc_mpo_hga_zero_…v02.bc (+7 aliases).resolve_kerneldedup warning: name the canonical explicitly — “loadedhga_scm_2018…; content-identical tohga_zero…(alias).”
B6. Display-name policy — decision D1
With B1–B5, every name is recoverable, so the physical canonical matters less. Recommended UX:
- Keep physical canonical = first-registered (stable, no churn).
- Display by the name the user queried / the metakernel referenced, annotating the dedup relationship — so the user always sees the name they expect, never a surprise substitution.
Alternative: switch canonical to “most-referenced across metakernels.” More code, marginal benefit once aliases are visible. Recommend not doing this now.
Tests
- A1 regression (the bug): build a tree where
b.bcis a symlink →a.bc; download different content tob.bc; asserta.bcis untouched,b.bcis a new real file with the streamed bytes, andregister_filedoes not raise. (Mirrors the BepiColombohga_scm→hga_zerocase.) - A1 concurrency: two tasks whose dests are symlinks to the same inode, downloaded in parallel; assert both produce correct independent files.
- B2: registering two identical-content files under different names yields two
kernel_aliasesrows, onekernelsrow, one inode. - B3 backfill: DB + on-disk symlink →
aliaseslists the symlink name after migration. - B4:
aliasesresolves by hash, by canonical name, and by alias name. - Patterns per
tests/conventions:tmp_spice_tree,pytest-tmp-files, mocked network.
Docs (project rule — ship with the feature)
docs/cli.qmd— newaliasessubcommand (options table + example); note the new(+N aliases)annotation onbrowse/get.docs/api.qmd— newKernelDB.aliases()method; notekernel_aliasestable in the schema section.docs/troubleshooting.qmd— entry for the historicalHash mismatch … computed X, expected Ycrash: what it meant (write-through-symlink corruption), that it’s fixed, and theget --forcededuprepair runbook.
CHANGELOG.md— unreleased/next-version entries for both the fix and the alias feature.
Versioning / rollout
- Suggest 0.17.0 (new
aliasescommand + schema table = feature) — single release covering A + B. - Schema change is additive + auto-migrating (
CREATE IF NOT EXISTS+ backfill); no manual user action beyond the optional repair re-fetch. - Follow
CLAUDE.mdRelease Process; updateconda/meta.yamlmanually per the standing note.
Decisions needed before coding
Recommend “keep first-registered as physical canonical, display the queried/metakernel name + annotate.” Confirm?
Automatic filesystem walk during migration, vs. explicit dedup --rebuild-aliases. Recommend automatic, with a log line listing how many aliases were recovered.
Just BepiColombo (bc_ops, bc_plan), or sweep every mission’s tree for cross-name symlinks whose targets diverged upstream?
Suggested build order
- A1 (write-safety) + A1 regression test ← stops further corruption immediately
- A3 repair runbook (capture corrupted-file hashes first for the test fixture)
- B1–B3 (table + populate + backfill)
- B4–B5 (
aliasescommand + output annotations) - Docs + CHANGELOG + version bump + release