Write-safety fix + dedup alias trail

spice-kernel-db design & implementation plan

Author

Michael

Published

June 29, 2026

ImportantStatus

Draft for review — no code yet. Settle decisions D1–D3 (bottom) before implementation.

Trigger. update bc_ops.tm crashed with ValueError: Hash mismatch for bc_mpo_hga_zero_s20191107_v02.bc.

Root cause (proven). Dedup symlinks collapse many distinct kernel names onto one inode; download_kernel does open(dest,"wb"), which follows those symlinks, so a single parallel run writes several different kernels into one physical file.

This plan covers two complementary workstreams:

A is required to stop corruption. B is required so the user can follow what dedup did. They ship together.

Three-phase Graphviz flow diagram showing how identical bytes were deduped to one file in February, how a bc_ops update run had eight parallel threads write different kernels into that one shared inode causing the hash-mismatch crash, and how unlinking/replacing the symlink before writing fixes it.
Figure 1: Failure chain and fix: legitimate Feb dedup → today’s write-through-symlink corruption → the os.replace fix.

See Figure 1 for the end-to-end picture.

Background — what the code does today

  • register_file (db.py:444) sets p = Path(path).resolve() (line 467) and stores str(p) in locations (line 575). It resolves the symlink first, so an aliased name like hga_scm_2018… is recorded under the target’s path (hga_zero…) — the alias name never reaches the DB.
  • kernels keeps only the first-registered filename as canonical (lines 512–514); a later identical-content name logs "Hash match: X is identical to already-registered Y" and is discarded (lines 552–556).
  • download_kernel (remote.py:350) writes with open(dest,"wb") (line 379) — follows symlinks.
  • Net effect: the only record that hga_scm_2018… ≡ this content is the on-disk symlink, pointing at a confusingly different name, with no DB trail.

Workstream A — Write-path correctness

A1. download_kernel → temp file + atomic replace

src/spice_kernel_db/remote.py (download_kernel, ~lines 371–412).

  • Download to dest.with_name(f".{dest.name}.<pid>.tmp") in the same directory (same filesystem → atomic rename).
  • On success (after completeness + optional expected_hash checks), os.replace(tmp, dest).
    • os.replace onto a symlink path replaces the link entry itself, atomically — it does not follow the link into the shared target. This is the core fix.
  • On any exception, unlink the temp file (replace the current dest.unlink() cleanup, which today risks unlinking a real file reached via symlink).
  • Keep the streaming SHA-256 (still the trusted hash of the bytes written).

Result: every download lands as its own independent real file. A shared dedup target can no longer be clobbered, even with 8 concurrent workers, even if several aliases point at it.

A2. (defense in depth) skip duplicate dest within one batch

db.py get_metakernel task-build loop (~1798–1823). If two tasks resolve to the same physical path, log and keep one. Cheap guard; A1 already makes this non-fatal.

A3. Repair runbook for the already-corrupted tree

The shared _zero_ / _sct_ / _fcp_ files currently hold whichever aliased kernel won the last race, so both the canonical file and every symlink pointing at it are suspect.

After A1 lands:

# For each affected BepiColombo metakernel, force a clean re-fetch:
spice-kernel-db get https://spiftp.esac.esa.int/data/SPICE/BEPICOLOMBO/kernels/mk/bc_ops.tm  --force
spice-kernel-db get https://spiftp.esac.esa.int/data/SPICE/BEPICOLOMBO/kernels/mk/bc_plan.tm --force
# then re-dedup honestly:
spice-kernel-db dedup

--force re-downloads every entry; with A1 each name now gets its own correct content; dedup re-links only the genuinely-identical ones.

Warning

Do not delete the current corrupted files first — they are the evidence for the regression test (A-test below). Capture their hashes before repair.

Workstream B — Dedup alias trail

B1. New table kernel_aliases

db.py _init_schema (~line 264), mirroring the existing CREATE TABLE IF NOT EXISTS + column-migration pattern used for superseded_by (lines 274–277).

CREATE TABLE IF NOT EXISTS kernel_aliases (
    sha256      VARCHAR NOT NULL,
    filename    VARCHAR NOT NULL,
    first_seen  TIMESTAMP DEFAULT current_timestamp,
    source_url  VARCHAR,
    PRIMARY KEY (sha256, filename)
);

Semantics: every distinct filename a given content hash has ever been registered under — deduped or not. The physical file still dedups to one inode; this table remembers all the names.

B2. Populate in register_file

Inside the existing transaction (db.py:519–576), after the kernels upsert:

INSERT OR IGNORE INTO kernel_aliases (sha256, filename, source_url)
VALUES (?, ?, ?)

keyed on (h, fname, source_url). Record fname (the as-referenced basename) before any resolve() collapses it — i.e. capture Path(path).name at function entry, not the resolved target’s name. This is the one subtlety: today fname = p.name is taken after p = Path(path).resolve().

B3. Backfill migration for existing DBs

One-time, in _init_schema right after the table is created (guard on “table was just created / empty”):

  1. Seed from current kernels: INSERT OR IGNORE INTO kernel_aliases SELECT sha256, filename, … FROM kernels.
  2. Seed from on-disk symlinks: for each distinct directory in locations.abs_path, find symlinks; for each link → target, look up the target’s sha in locations, insert (sha, basename(link)). This recovers the trail that resolve() discarded (e.g. all the hga_scm_* names).

See decision D2 for automatic-during-migration vs. an explicit dedup --rebuild-aliases.

B4. New command aliases <name|hash>

cli.py: sub.add_parser("aliases", …) near the other read-only commands (~line 358), dispatch elif args.command == "aliases": (~line 503 region). Read-only (add to read_only_commands).

New KernelDB.aliases(name_or_hash) -> dict in db.py:

  • Resolve input → content hash (accept a full/partial sha or any known filename via kernel_aliases/kernels).
  • Return: content hash, canonical name, kernel_type, size_bytes, all alias filenames (from kernel_aliases), and all on-disk locations (from locations).
  • CLI renders a rich.Table/Panel (house style).
$ spice-kernel-db aliases bc_mpo_hga_scm_20210101_20220101_s20230309_v01.bc
Content  8c7922…  (CK, 16 KB)
Canonical filename:  bc_mpo_hga_zero_s20191107_v02.bc
Also known as (7):   bc_mpo_hga_scm_20181020_…, bc_mpo_hga_scm_20200101_…, …
Locations (1):       …/BEPICOLOMBO/ck/bc_mpo_hga_zero_s20191107_v02.bc

B5. Surface aliases in existing output

  • browse / get kernel tables: annotate deduped rows, e.g. bc_mpo_hga_zero_…v02.bc (+7 aliases).
  • resolve_kernel dedup warning: name the canonical explicitly — “loaded hga_scm_2018…; content-identical to hga_zero… (alias).”

B6. Display-name policy — decision D1

With B1–B5, every name is recoverable, so the physical canonical matters less. Recommended UX:

  • Keep physical canonical = first-registered (stable, no churn).
  • Display by the name the user queried / the metakernel referenced, annotating the dedup relationship — so the user always sees the name they expect, never a surprise substitution.

Alternative: switch canonical to “most-referenced across metakernels.” More code, marginal benefit once aliases are visible. Recommend not doing this now.

Tests

  • A1 regression (the bug): build a tree where b.bc is a symlink → a.bc; download different content to b.bc; assert a.bc is untouched, b.bc is a new real file with the streamed bytes, and register_file does not raise. (Mirrors the BepiColombo hga_scm→hga_zero case.)
  • A1 concurrency: two tasks whose dests are symlinks to the same inode, downloaded in parallel; assert both produce correct independent files.
  • B2: registering two identical-content files under different names yields two kernel_aliases rows, one kernels row, one inode.
  • B3 backfill: DB + on-disk symlink → aliases lists the symlink name after migration.
  • B4: aliases resolves by hash, by canonical name, and by alias name.
  • Patterns per tests/ conventions: tmp_spice_tree, pytest-tmp-files, mocked network.

Docs (project rule — ship with the feature)

  • docs/cli.qmd — new aliases subcommand (options table + example); note the new (+N aliases) annotation on browse/get.
  • docs/api.qmd — new KernelDB.aliases() method; note kernel_aliases table in the schema section.
  • docs/troubleshooting.qmd — entry for the historical Hash mismatch … computed X, expected Y crash: what it meant (write-through-symlink corruption), that it’s fixed, and the get --force
    • dedup repair runbook.
  • CHANGELOG.md — unreleased/next-version entries for both the fix and the alias feature.

Versioning / rollout

  • Suggest 0.17.0 (new aliases command + schema table = feature) — single release covering A + B.
  • Schema change is additive + auto-migrating (CREATE IF NOT EXISTS + backfill); no manual user action beyond the optional repair re-fetch.
  • Follow CLAUDE.md Release Process; update conda/meta.yaml manually per the standing note.

Decisions needed before coding

NoteD1 — display-name policy

Recommend “keep first-registered as physical canonical, display the queried/metakernel name + annotate.” Confirm?

NoteD2 — backfill mechanism

Automatic filesystem walk during migration, vs. explicit dedup --rebuild-aliases. Recommend automatic, with a log line listing how many aliases were recovered.

NoteD3 — scope of repair now

Just BepiColombo (bc_ops, bc_plan), or sweep every mission’s tree for cross-name symlinks whose targets diverged upstream?

Suggested build order

  1. A1 (write-safety) + A1 regression test ← stops further corruption immediately
  2. A3 repair runbook (capture corrupted-file hashes first for the test fixture)
  3. B1–B3 (table + populate + backfill)
  4. B4–B5 (aliases command + output annotations)
  5. Docs + CHANGELOG + version bump + release