scitex_clew

scitex-clew — Hash-based verification for reproducible science.

Standalone package. Zero dependencies (pure stdlib + sqlite3). When used with scitex, integration is automatic via @stx.session + stx.io.

Public API:

import scitex_clew as clew

# Verification
clew.status()                      # git-status-like overview
clew.run(session_id)               # verify one run (hash check)
clew.chain(target_file)            # trace file → source chain
clew.dag(targets)                  # verify full DAG
clew.rerun(target)                 # re-execute & compare (sandbox)
clew.rerun_dag(targets)            # rerun full DAG in topo order
clew.rerun_claims()                # rerun all claim-backing sessions
clew.list_runs(limit=100)          # list tracked runs
clew.stats()                       # database statistics

# Claims
clew.add_claim(...)                # register manuscript assertion
clew.list_claims(...)              # list registered claims
clew.verify_claim(...)             # verify a specific claim

# Stamping
clew.stamp(...)                    # create temporal proof
clew.list_stamps(...)              # list stamps
clew.check_stamp(...)              # verify a stamp

# Hashing
clew.hash_file(path)               # SHA256 of a file
clew.hash_directory(path)          # SHA256 of all files in dir

# Visualization
clew.mermaid(...)                  # generate Mermaid DAG diagram

# Examples
clew.init_examples(dest)           # scaffold example pipeline

# Session lifecycle hooks (invoked by @scitex.session)
clew.on_session_start(session_id)  # open a tracked run
clew.on_session_close(status=...)  # finalize run + combined hash
scitex_clew.status()[source]

Get verification status summary (like git status).

scitex_clew.run(session_id, from_scratch=False)[source]

Verify a specific run.

Parameters:
  • session_id (str) – Session identifier

  • from_scratch (bool, optional) – If True, re-execute the script and verify outputs (slow but thorough). If False, only compare hashes (fast).

scitex_clew.chain(target)[source]

Verify the dependency chain for a target file.

scitex_clew.dag(targets=None, claims=False, strict=False)[source]

Verify the DAG for multiple targets or all claims.

Parameters:
  • targets (list of str or Path, optional) – Target files to verify (mutually exclusive with claims).

  • claims (bool, optional) – If True, build the DAG from every registered claim.

  • strict (bool, optional) – If True (F2), return a failure-attribution dict with failed_node / root_cause / invalidated_claims / still_valid_claims instead of a DAGVerification.

scitex_clew.rerun(target, timeout=300, cleanup=True)[source]

Re-execute a session in a sandbox and compare outputs.

Parameters:
  • target (str or list[str]) – Session ID, script path, or artifact path.

  • timeout (int, optional) – Maximum execution time in seconds (default: 300).

  • cleanup (bool, optional) – Remove sandbox outputs after verification (default: True).

scitex_clew.rerun_dag(targets=None, timeout=300, cleanup=True)[source]

Rerun-verify an entire DAG in topological order.

Each session is re-executed in a sandbox against its ORIGINAL stored inputs (not freshly rerun outputs from upstream), then compared to the original outputs.

Parameters:
  • targets (list of str, optional) – Target output files whose upstream DAG should be rerun. If None, all runs in the database are used and their output files become the targets.

  • timeout (int, optional) – Maximum execution time per session in seconds (default: 300).

  • cleanup (bool, optional) – Whether to remove sandbox output directories after each rerun.

Returns:

Unified verification result for the entire DAG.

Return type:

DAGVerification

scitex_clew.rerun_claims(file_path=None, claim_type=None, timeout=300, cleanup=True)[source]

Rerun-verify all sessions that produced files referenced by claims.

Collects unique source files from matching claims, then delegates to rerun_dag with those files as targets.

Parameters:
  • file_path (str, optional) – Filter claims by manuscript file path.

  • claim_type (str, optional) – Filter claims by type (statistic, figure, table, text, value).

  • timeout (int, optional) – Maximum execution time per session in seconds (default: 300).

  • cleanup (bool, optional) – Whether to remove sandbox output directories after each rerun.

Returns:

Unified verification result for the upstream DAG of all source files referenced by the matching claims.

Return type:

DAGVerification

scitex_clew.list_runs(limit=100, status=None)[source]

List tracked runs.

scitex_clew.stats()[source]

Get database statistics.

scitex_clew.add_claim(file_path, claim_type, line_number=None, claim_value=None, source_file=None, source_session=None)[source]

Register a claim linking a manuscript assertion to the verification chain.

Parameters:
  • file_path (str) – Path to the manuscript file (e.g., paper.tex).

  • claim_type (str) – One of: statistic, figure, table, text, value.

  • line_number (int, optional) – Line number in the manuscript.

  • claim_value (str, optional) – The asserted value (e.g., “p = 0.003”).

  • source_file (str, optional) – Path to the source file that produced this claim.

  • source_session (str, optional) – Session ID that produced the source.

Returns:

The registered claim object.

Return type:

Claim

scitex_clew.list_claims(file_path=None, claim_type=None, status=None, limit=100)[source]

List registered claims with optional filters.

Parameters:
  • file_path (str, optional) – Filter by manuscript file path.

  • claim_type (str, optional) – Filter by claim type.

  • status (str, optional) – Filter by verification status.

  • limit (int) – Maximum number of claims to return.

Return type:

list of Claim

scitex_clew.verify_claim(claim_id_or_location)[source]

Verify a specific claim by checking its source against the verification chain.

Parameters:

claim_id_or_location (str) – Either a claim_id or a location string like “paper.tex:L42”.

Returns:

Verification result with claim details and chain status.

Return type:

dict

scitex_clew.export_claims_json(path=None, *, file_path_filter=None, read_only=True)[source]

Export every registered claim to a canonical JSON artifact.

The exported file is the single human-readable + machine-consumable view of the claims table in db.sqlite. The DB remains the source of truth; this JSON is a regenerable artifact.

Path resolution (mirrors scitex_clew._db._core._default_db_path()):

1. Explicit ``path`` argument.
2. ``$SCITEX_CLEW_CLAIMS_JSON`` env var (escape hatch).
3. ``<project_root>/.scitex/clew/runtime/claims.json``
   (project root = nearest ancestor dir with ``.git`` or
   ``pyproject.toml``; falls back to cwd if none found).
Parameters:
  • path (str | Path, optional) – Override the resolved path. Useful for tests / one-off dumps.

  • file_path_filter (str, optional) – When set, only claims registered against this manuscript file path are exported. Default: every claim in the DB.

  • read_only (bool, optional) – After writing, chmod 0o444 the file so accidental edits fail loudly at the OS layer. Default True (the file IS derived). Set False for tests that need to mutate the file.

Returns:

The path the artifact was written to (absolute).

Return type:

Path

Examples

>>> import scitex_clew as clew
>>> clew.add_claim("paper.tex", "value", 42, "0.94", source_file="r.csv")
>>> # claims.json now auto-exported under ./.scitex/clew/runtime/
>>> clew.export_claims_json()  # idempotent — re-emit on demand
PosixPath('.../.scitex/clew/runtime/claims.json')
scitex_clew.register_intermediate(name, value, supports=None, session_id=None, claim_type='value')[source]

Register a computed intermediate as a Clew claim.

Use this from inside a @stx.session script (or from an agent loop) to record any non-trivial intermediate value with explicit upstream support. The claim becomes part of the DAG and can be queried via clew.chain, clew.dag, or the MCP clew_chain / clew_dag tools.

Parameters:
  • name (str) – Descriptive identifier (e.g. “acute_n_sig_pathways”). Avoid generic names like “result_3” — the id is the only handle a future inspector has on the value.

  • value (Any) – The computed result. Coerced to string for storage; the hash chain sees repr(value) so types matter.

  • supports (Optional[List[str]]) – List of upstream claim ids or session ids that this value depends on. Stored as JSON in the claim’s value field for retrieval. None means no explicit upstream (use sparingly).

  • session_id (Optional[str]) – The session this value belongs to. If None, read from the SCITEX_SESSION_ID env var that @stx.session sets at start.

  • claim_type (str) – One of statistic, figure, table, text, value. Defaults to value since intermediates are usually scalar / categorical results.

Returns:

The registered claim object.

Return type:

Claim

Raises:

ValueError – If no session_id can be determined (env var unset and not passed).

Examples

Inside a @stx.session script:

>>> from scitex_clew import register_intermediate
>>> n_sig = sum(1 for p in pathways if p.padj < 0.05)
>>> register_intermediate(
...     name="chronic_r2_n_sig_pathways",
...     value=n_sig,
...     supports=["chronic_r2_min_pvals", "reactome_pathways_v2024"],
... )
scitex_clew.stamp(backend='file', service_url=None, session_ids=None, output_dir=None)[source]

Record root hash with external timestamp.

Parameters:
  • backend (str) – One of: file, rfc3161, zenodo.

  • service_url (str, optional) – URL for RFC 3161 TSA or Zenodo API.

  • session_ids (list of str, optional) – Specific sessions to stamp. If None, stamps all successful runs.

  • output_dir (str, optional) – Directory for file-based stamps (default: <db_dir>/stamps, i.e. .scitex/clew/runtime/stamps/).

Returns:

The timestamp proof record.

Return type:

Stamp

scitex_clew.list_stamps(limit=20)[source]

List all stamps.

Return type:

List[Stamp]

scitex_clew.check_stamp(stamp_id=None)[source]

Verify a stamp against current verification state.

Parameters:

stamp_id (str, optional) – Specific stamp to check. If None, checks the latest stamp.

Returns:

{stamp, current_root_hash, matches, details}

Return type:

dict

scitex_clew.hash_file(path, algorithm='sha256', chunk_size=8192)[source]

Compute hash of a file.

Parameters:
  • path (str or Path) – Path to the file to hash

  • algorithm (str, optional) – Hash algorithm (default: sha256)

  • chunk_size (int, optional) – Size of chunks to read (default: 8192)

Returns:

Hexadecimal hash string (first 32 characters)

Return type:

str

Examples

>>> hash_file("data.csv")
'a1b2c3d4e5f6...'
scitex_clew.hash_directory(path, pattern='*', recursive=True, algorithm='sha256')[source]

Compute hashes for all files in a directory.

Parameters:
  • path (str or Path) – Directory path

  • pattern (str, optional) – Glob pattern for files (default: “*”)

  • recursive (bool, optional) – Whether to search recursively (default: True)

  • algorithm (str, optional) – Hash algorithm (default: sha256)

Returns:

Mapping of relative paths to hashes

Return type:

dict

Examples

>>> hash_directory("./data/")
{'input.csv': 'a1b2...', 'config.yaml': 'c3d4...'}
scitex_clew.mermaid(session_id=None, target_file=None, target_files=None, claims=False, grouper=None, **kwargs)[source]

Generate a Mermaid DAG diagram.

Parameters:
  • session_id (str, optional) – Start from this session.

  • target_file (str, optional) – Start from the session that produced this file.

  • target_files (list of str, optional) – Multiple target files (multi-target DAG).

  • claims (bool, optional) – If True, build DAG from all registered claims.

  • grouper (callable | dict | None, optional) – File grouping strategy. Callable or JSON/dict spec (see scitex_clew.groupers.resolve_spec). If None, falls back to .scitex/clew/config.yaml (key grouper) if present.

scitex_clew.init_examples(dest, variant='sequential', *, find_examples_dir=<function _find_examples_dir>)[source]

Copy Clew example scripts to a destination directory.

Copies only the runnable scripts (.py, .sh) and README — not the output directories. Users run 00_run_all.sh themselves to generate outputs and populate the verification database.

Parameters:
  • dest (str or Path) – Destination directory. Created if it does not exist. Existing script files are overwritten.

  • variant (str, optional) – Example variant: “sequential” (default) or “multi_parent”.

  • find_examples_dir (callable, optional) – Locator callable (variant: str) -> Optional[Path] used to resolve the bundled examples source. Production callers should not pass this; it is the canonical PA-306 §1 DI seam — tests inject a hand-rolled fake that returns a tmp_path-rooted directory or None.

Returns:

{"path": str, "files": list[str], "file_count": int, "variant": str}

Return type:

dict

Raises:
scitex_clew.on_session_start(session_id, script_path=None, parent_session=None, verbose=False, metadata=None)[source]

Hook called when a session starts.

Parameters:
  • session_id (str) – Unique session identifier

  • script_path (str, optional) – Path to the script being run

  • parent_session (str, optional) – Parent session ID for chain tracking

  • verbose (bool, optional) – Whether to log status messages

  • metadata (dict, optional) – Additional metadata (e.g. notebook_path, cell_index)

Return type:

None

scitex_clew.on_session_close(status='success', exit_code=0, verbose=False, register=None)[source]

Hook called when a session closes.

Parameters:
  • status (str, optional) – Final status (success, failed, error)

  • exit_code (int, optional) – Exit code of the script

  • verbose (bool, optional) – Whether to log status messages

  • register (bool, optional) – If True, register session hashes with remote Clew Registry. If None, checks SCITEX_AUTO_REGISTER environment variable.

Return type:

None