scitex_clew

scitex-clew — Hash-based verification for reproducible science.

Standalone package. Zero dependencies (pure stdlib + sqlite3). When used with scitex, integration is automatic via @stx.session + stx.io.

Public API:

import scitex_clew as clew

# Verification
clew.status()                      # git-status-like overview
clew.run(session_id)               # verify one run (hash check)
clew.chain(target_file)            # trace file → source chain
clew.dag(targets)                  # verify full DAG
clew.rerun(target)                 # re-execute & compare (sandbox)
clew.rerun_dag(targets)            # rerun full DAG in topo order
clew.rerun_claims()                # rerun all claim-backing sessions
clew.list_runs(limit=100)          # list tracked runs
clew.stats()                       # database statistics

# Claims
clew.add_claim(...)                # register manuscript assertion
clew.list_claims(...)              # list registered claims
clew.verify_claim(...)             # verify a specific claim

# Stamping
clew.stamp(...)                    # create temporal proof
clew.list_stamps(...)              # list stamps
clew.check_stamp(...)              # verify a stamp

# Hashing
clew.hash_file(path)               # SHA256 of a file
clew.hash_directory(path)          # SHA256 of all files in dir

# Visualization
clew.mermaid(...)                  # generate Mermaid DAG diagram

# Examples
clew.init_examples(dest)           # scaffold example pipeline

# Session lifecycle hooks (invoked by @scitex.session)
clew.on_session_start(session_id)  # open a tracked run
clew.on_session_close(status=...)  # finalize run + combined hash
scitex_clew.status()[source]

Get verification status summary (like git status).

scitex_clew.run(session_id, from_scratch=False)[source]

Verify a specific run.

Parameters:
  • session_id (str) – Session identifier

  • from_scratch (bool, optional) – If True, re-execute the script and verify outputs (slow but thorough). If False, only compare hashes (fast).

scitex_clew.chain(target)[source]

Verify the dependency chain for a target file.

scitex_clew.dag(targets=None, claims=False, strict=False)[source]

Verify the DAG for multiple targets or all claims.

Parameters:
  • targets (list of str or Path, optional) – Target files to verify (mutually exclusive with claims).

  • claims (bool, optional) – If True, build the DAG from every registered claim.

  • strict (bool, optional) – If True (F2), return a failure-attribution dict with failed_node / root_cause / invalidated_claims / still_valid_claims instead of a DAGVerification.

scitex_clew.rerun(target, timeout=300, cleanup=True)[source]

Re-execute a session in a sandbox and compare outputs.

Parameters:
  • target (str or list[str]) – Session ID, script path, or artifact path.

  • timeout (int, optional) – Maximum execution time in seconds (default: 300).

  • cleanup (bool, optional) – Remove sandbox outputs after verification (default: True).

scitex_clew.rerun_dag(targets=None, timeout=300, cleanup=True)[source]

Rerun-verify an entire DAG in topological order.

Each session is re-executed in a sandbox against its ORIGINAL stored inputs (not freshly rerun outputs from upstream), then compared to the original outputs.

Parameters:
  • targets (list of str, optional) – Target output files whose upstream DAG should be rerun. If None, all runs in the database are used and their output files become the targets.

  • timeout (int, optional) – Maximum execution time per session in seconds (default: 300).

  • cleanup (bool, optional) – Whether to remove sandbox output directories after each rerun.

Returns:

Unified verification result for the entire DAG.

Return type:

DAGVerification

scitex_clew.rerun_claims(file_path=None, claim_type=None, timeout=300, cleanup=True)[source]

Rerun-verify all sessions that produced files referenced by claims.

Collects unique source files from matching claims, then delegates to rerun_dag with those files as targets.

Parameters:
  • file_path (str, optional) – Filter claims by manuscript file path.

  • claim_type (str, optional) – Filter claims by type (statistic, figure, table, text, value).

  • timeout (int, optional) – Maximum execution time per session in seconds (default: 300).

  • cleanup (bool, optional) – Whether to remove sandbox output directories after each rerun.

Returns:

Unified verification result for the upstream DAG of all source files referenced by the matching claims.

Return type:

DAGVerification

scitex_clew.list_runs(limit=100, status=None)[source]

List tracked runs.

scitex_clew.stats()[source]

Get database statistics.

scitex_clew.add_claim(file_path, claim_type, line_number=None, claim_value=None, source_file=None, source_session=None)[source]

Register a claim linking a manuscript assertion to the verification chain.

Parameters:
  • file_path (str) – Path to the manuscript file (e.g., paper.tex).

  • claim_type (str) – One of: statistic, figure, table, text, value.

  • line_number (int, optional) – Line number in the manuscript.

  • claim_value (str, optional) – The asserted value (e.g., “p = 0.003”).

  • source_file (str, optional) – Path to the source file that produced this claim.

  • source_session (str, optional) – Session ID that produced the source.

Returns:

The registered claim object.

Return type:

Claim

scitex_clew.list_claims(file_path=None, claim_type=None, status=None, limit=100)[source]

List registered claims with optional filters.

Parameters:
  • file_path (str, optional) – Filter by manuscript file path.

  • claim_type (str, optional) – Filter by claim type.

  • status (str, optional) – Filter by verification status.

  • limit (int) – Maximum number of claims to return.

Return type:

list of Claim

scitex_clew.verify_claim(claim_id_or_location)[source]

Verify a specific claim by checking its source against the verification chain.

Parameters:

claim_id_or_location (str) – Either a claim_id or a location string like “paper.tex:L42”.

Returns:

Verification result with claim details and chain status.

Return type:

dict

scitex_clew.register_intermediate(name, value, supports=None, session_id=None, claim_type='value')[source]

Register a computed intermediate as a Clew claim.

Use this from inside a @stx.session script (or from an agent loop) to record any non-trivial intermediate value with explicit upstream support. The claim becomes part of the DAG and can be queried via clew.chain, clew.dag, or the MCP clew_chain / clew_dag tools.

Parameters:
  • name (str) – Descriptive identifier (e.g. “acute_n_sig_pathways”). Avoid generic names like “result_3” — the id is the only handle a future inspector has on the value.

  • value (Any) – The computed result. Coerced to string for storage; the hash chain sees repr(value) so types matter.

  • supports (Optional[List[str]]) – List of upstream claim ids or session ids that this value depends on. Stored as JSON in the claim’s value field for retrieval. None means no explicit upstream (use sparingly).

  • session_id (Optional[str]) – The session this value belongs to. If None, read from the SCITEX_SESSION_ID env var that @stx.session sets at start.

  • claim_type (str) – One of statistic, figure, table, text, value. Defaults to value since intermediates are usually scalar / categorical results.

Returns:

The registered claim object.

Return type:

Claim

Raises:

ValueError – If no session_id can be determined (env var unset and not passed).

Examples

Inside a @stx.session script:

>>> from scitex_clew import register_intermediate
>>> n_sig = sum(1 for p in pathways if p.padj < 0.05)
>>> register_intermediate(
...     name="chronic_r2_n_sig_pathways",
...     value=n_sig,
...     supports=["chronic_r2_min_pvals", "reactome_pathways_v2024"],
... )
scitex_clew.stamp(backend='file', service_url=None, session_ids=None, output_dir=None)[source]

Record root hash with external timestamp.

Parameters:
  • backend (str) – One of: file, rfc3161, zenodo.

  • service_url (str, optional) – URL for RFC 3161 TSA or Zenodo API.

  • session_ids (list of str, optional) – Specific sessions to stamp. If None, stamps all successful runs.

  • output_dir (str, optional) – Directory for file-based stamps (default: <db_dir>/stamps, i.e. .scitex/clew/runtime/stamps/).

Returns:

The timestamp proof record.

Return type:

Stamp

scitex_clew.list_stamps(limit=20)[source]

List all stamps.

Return type:

List[Stamp]

scitex_clew.check_stamp(stamp_id=None)[source]

Verify a stamp against current verification state.

Parameters:

stamp_id (str, optional) – Specific stamp to check. If None, checks the latest stamp.

Returns:

{stamp, current_root_hash, matches, details}

Return type:

dict

scitex_clew.hash_file(path, algorithm='sha256', chunk_size=8192)[source]

Compute hash of a file.

Parameters:
  • path (str or Path) – Path to the file to hash

  • algorithm (str, optional) – Hash algorithm (default: sha256)

  • chunk_size (int, optional) – Size of chunks to read (default: 8192)

Returns:

Hexadecimal hash string (first 32 characters)

Return type:

str

Examples

>>> hash_file("data.csv")
'a1b2c3d4e5f6...'
scitex_clew.hash_directory(path, pattern='*', recursive=True, algorithm='sha256')[source]

Compute hashes for all files in a directory.

Parameters:
  • path (str or Path) – Directory path

  • pattern (str, optional) – Glob pattern for files (default: “*”)

  • recursive (bool, optional) – Whether to search recursively (default: True)

  • algorithm (str, optional) – Hash algorithm (default: sha256)

Returns:

Mapping of relative paths to hashes

Return type:

dict

Examples

>>> hash_directory("./data/")
{'input.csv': 'a1b2...', 'config.yaml': 'c3d4...'}
scitex_clew.mermaid(session_id=None, target_file=None, target_files=None, claims=False, grouper=None, **kwargs)[source]

Generate a Mermaid DAG diagram.

Parameters:
  • session_id (str, optional) – Start from this session.

  • target_file (str, optional) – Start from the session that produced this file.

  • target_files (list of str, optional) – Multiple target files (multi-target DAG).

  • claims (bool, optional) – If True, build DAG from all registered claims.

  • grouper (callable | dict | None, optional) – File grouping strategy. Callable or JSON/dict spec (see scitex_clew.groupers.resolve_spec). If None, falls back to .scitex/clew/config.yaml (key grouper) if present.

scitex_clew.init_examples(dest, variant='sequential', *, find_examples_dir=<function _find_examples_dir>)[source]

Copy Clew example scripts to a destination directory.

Copies only the runnable scripts (.py, .sh) and README — not the output directories. Users run 00_run_all.sh themselves to generate outputs and populate the verification database.

Parameters:
  • dest (str or Path) – Destination directory. Created if it does not exist. Existing script files are overwritten.

  • variant (str, optional) – Example variant: “sequential” (default) or “multi_parent”.

  • find_examples_dir (callable, optional) – Locator callable (variant: str) -> Optional[Path] used to resolve the bundled examples source. Production callers should not pass this; it is the canonical PA-306 §1 DI seam — tests inject a hand-rolled fake that returns a tmp_path-rooted directory or None.

Returns:

{"path": str, "files": list[str], "file_count": int, "variant": str}

Return type:

dict

Raises:
scitex_clew.on_session_start(session_id, script_path=None, parent_session=None, verbose=False, metadata=None)[source]

Hook called when a session starts.

Parameters:
  • session_id (str) – Unique session identifier

  • script_path (str, optional) – Path to the script being run

  • parent_session (str, optional) – Parent session ID for chain tracking

  • verbose (bool, optional) – Whether to log status messages

  • metadata (dict, optional) – Additional metadata (e.g. notebook_path, cell_index)

Return type:

None

scitex_clew.on_session_close(status='success', exit_code=0, verbose=False, register=None)[source]

Hook called when a session closes.

Parameters:
  • status (str, optional) – Final status (success, failed, error)

  • exit_code (int, optional) – Exit code of the script

  • verbose (bool, optional) – Whether to log status messages

  • register (bool, optional) – If True, register session hashes with remote Clew Registry. If None, checks SCITEX_AUTO_REGISTER environment variable.

Return type:

None