Metadata-Version: 2.4
Name: further
Version: 0.1.0
Summary: The Further Framework
Author-email: "Richard T. Llewellyn" <richard.trent.llewellyn@gmail.com>
License-File: LICENSE.md
License-File: NOTICE.txt
Requires-Python: >=3.14
Requires-Dist: click>=8.0.0
Requires-Dist: cloudpickle>=2.0.0
Requires-Dist: dask>=2023.0.0
Requires-Dist: distributed>=2023.0.0
Requires-Dist: docker>=7.0.0
Requires-Dist: fsspec>=2023.0.0
Requires-Dist: grandalf>=0.8
Requires-Dist: networkx>=3.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: prompt-toolkit>=3.0.0
Requires-Dist: psycopg-pool>=3.0.0
Requires-Dist: psycopg[binary]>=3.0.0
Requires-Dist: pyarrow>=20.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: pyzmq>=25.0.0
Requires-Dist: resolvelib>=1.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: s3fs>=2023.0.0
Requires-Dist: semantic-version>=2.10.0
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: backends
Requires-Dist: igraph>=1.0.0; extra == 'backends'
Requires-Dist: zarr>=3.0.0; extra == 'backends'
Provides-Extra: dev
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest-xdist; extra == 'dev'
Provides-Extra: graph
Requires-Dist: igraph>=1.0.0; extra == 'graph'
Provides-Extra: prefect
Requires-Dist: prefect-dask>=0.3.0; extra == 'prefect'
Requires-Dist: prefect>=3.0.0; extra == 'prefect'
Provides-Extra: zarr
Requires-Dist: zarr>=3.0.0; extra == 'zarr'
Description-Content-Type: text/markdown

# Further: A High-Level Conceptual Overview

NOTE:  Further is currently in initial private development.
Public release coming in second quarter 2026.

---

## What Is Further?

Further is an open-source Python framework for structured scientific and data analysis. It is designed
for researchers working alone, in a lab, or across collaborative multi-institution consortia who need
to move naturally between exploratory analysis and rigorous, reproducible pipelines — without
rewriting code when the line between the two shifts.

The central promise of Further is: **focus on the analysis you are writing today, and trust the
framework to manage how it connects to everything else.**

One core design challenge Further addresses that most pipeline frameworks sidestep is the
**non-determinism of external data**. Real research pipelines depend on files, databases, APIs, and
other resources that change independently of the analysis code. Further treats these as first-class
citizens through its *resource cell* abstraction: a principled mechanism for integrating external
data into an otherwise deterministic dependency graph, with automatic cache invalidation, configurable
freshness policies, and full traceability of which version of an external resource produced which
result.

---

## The Foundational Abstraction: The Cell

The atomic unit in Further is the **cell** — a named, versioned, self-contained computation. A cell
declares:

- **What it needs:** input parameters (typed, validated Pydantic models), divided into *Specs*
  (parameters that affect the result identity) and *Opts* (operational settings such as logging
  verbosity or output format that do not change the result).
- **What it depends on:** other cells, listed explicitly in a `req_cells` manifest. This is the
  author's complete statement of dependencies.
- **What it produces:** named *contributions* — typed outputs such as DataFrames, model weights,
  summary statistics, or any Python object.
- **How it runs:** a `Maker` class whose `make()` method contains the actual computation logic.

Cells are grouped into **projects**, which provide a namespace, shared project-level
parameters, and a manifest that the framework uses to discover and version all cells at load time.

This design enforces a discipline of **modular, locally-comprehensible analysis units**. Authors
think about one cell's purpose and immediate dependencies; the framework typically handles the rest.

---

## Dependency Graphs and Automatic Pipeline Assembly

When a user invokes a cell through a **Session**, Further creates the
dependency graph. By convention, we think of the graph with the root at the top, 
terminal nodes or leaves at the bottom, so that 'up' means towards the most dependent cells,
and 'down' means into the subgraph of dependencies.  The invoked cell is the root of the graph.

Further assembles the dependency graph in three graduated stages:

1. **Definition Graph (DG):** A static, import-time representation of all declared cell
   dependencies across the loaded projects. No parameters have been resolved yet.
2. **Abstract Graph (AG):** An expanded, path-sensitive structure built when a session call is
   initiated. Parameters are resolved symbolically; the graph is analyzed for parallelism
   opportunities, pre-memoization candidates, and potential deadlocks.
3. **Instance Tree (IT):** The concrete execution tree with all parameter values resolved. This is
   what actually runs.

Researchers author at the DG level (declaring dependencies and parameters in cell definitions,
writing the Maker logic that produces the cellular contributions) and operate at the Session level
(calling a root cell with initial parameters). The intermediate layers are handled entirely 
by the framework.

---

## Granular Memoization and Reproducibility

Every cell in Further has a **logic key** — a hash derived from its logic version, its concrete
input parameters (Specs), the versions and logic keys of all its dependencies, and optionally the
versions of tracked external libraries. When a cell is executed, its contributions are stored
durably and indexed under this key.

On subsequent calls, the framework checks the database before running any computation. If an
identical logic key exists and its stored contributions are intact, the cell is skipped entirely —
its previous result is returned directly. This is *granular memoization*: each cell in the graph
is independently cached, so a partial change (e.g., altering a parameter for a lower cell),
triggers re-execution only of the affected subgraph, not the entire pipeline.

This design has several important consequences:

- **Incremental computation is inexpensive.** Running the same analysis with one modified parameter costs
  only the work that is actually new.
- **Results are stable across sessions.** A result computed months ago with the same logic key is
  returned instantly in a new session.
- **Cache invalidation is explicit and versioned.** Authors bump a cell's `logic_version` when
  computation logic changes; this immediately invalidates all cached results for that cell,
  ensuring stale results cannot be silently reused.
- **Reproducibility is structural, not procedural.** The researcher does not need to manually
  track what ran and when — the framework does it through the dependency graph and logic key system.

---

## External Resources and Non-Determinism

Standard cells in Further are purely functional: given the same input parameters and the same
dependency results, they always produce the same output. Memoization is straightforward — the cache
key is a hash of the inputs.

External resources break this guarantee. A file on disk, a REST API, a database table, or a cloud
storage object can change at any time, independently of the analysis code. If a pipeline reads such
a source, the cached result may become stale without any change to the code or declared parameters.
Most pipeline frameworks either ignore this problem (silently returning stale results) or solve it
crudely (always re-fetching).

Further addresses this through **resource cells** — a specialized cell type that extends the
memoization model to include an explicitly tracked *release* identifier representing the version of
the external data:

```
standard cell:   output = f(inputs)
resource cell:   output = f(inputs, external_state)
                          where external_state is tracked as a "release"
```

### The Release Concept

A **release** is a lightweight string that uniquely identifies the current state of an external
resource. The author defines it by implementing a `get_release()` classmethod — a fast, stateless
check that returns a version identifier *without reading the actual data*:

| Resource type | Example release |
|---|---|
| File on disk | Modification timestamp (`"2024-01-15T10:30:00Z"`) |
| REST API | ETag or `Last-Modified` header |
| External database | High-water mark (`MAX(updated_at)`) |
| S3 object | Object ETag or version ID |

The release is incorporated into the cell's cache key. If the release is unchanged, the cached
result is returned. If it has changed, `make()` is called to fetch fresh data — and the new result
is stored under the new release.

### Pre-Memoization: Checking Without Fetching

The critical insight is that determining whether the cache is valid does not require reading the
data. The framework calls `get_release()` *before* deciding whether to execute `make()`:

```
1. Call get_release()         ← file stat, HTTP HEAD, count query: milliseconds
2. Release matches cache?
   YES → return cached result ← no data transfer
   NO  → call make()          ← full fetch, only when necessary
```

This *pre-memoization* step is especially valuable for large resources — a gigabyte file, a full
database table — where the cost of an unnecessary re-fetch would be substantial.

### Freshness Policies

Researchers have different needs regarding how current the external data must be. Further provides
a set of `ResourceChoice` policies that control how the release is selected:

| Policy | Behavior | Typical use |
|---|---|---|
| `latest` | Always use the most recent release | Live dashboards |
| `current` | Accept any release within a configurable shelf life | Periodic batch analyses |
| `present` | Accept any cached release, no expiration | Historical studies |
| `release` | Pin to a specific named release | Reproducible publications |
| `date_range` | Accept any release within a date window | Quarterly reports |

These policies can be set at the cell level or overridden at the **project level**, which is
particularly powerful: by passing a single project-level instruction at session call time, a
researcher can pin every resource cell in an entire pipeline to the same historical snapshot —
running a full analysis against the data as it existed on a given date, without modifying any
cell code.

### Traceability

Every stored result carries the release string that produced it. An upper cell or database
query can always answer: *which version of the external data was used to compute this result?*
This is the resource equivalent of `logic_version` for code: an explicit, persistent record of
the external state that contributed to a stored outcome.

### The Instruction vs. Identity Distinction

An important subtlety: the *policy instruction* (e.g., "give me whatever is current within
5 minutes") does not participate in the cache identity. Only the *resolved release string* does.
Two callers using different policies that happen to resolve to the same release produce exactly the
same stored result. This mirrors Further's handling of incremental computation, where the
instruction for *how many increments to compute* is distinct from the *increment identity* that
defines a stored result.

---

## The Parameter System

Parameters in Further are designed to flow through the dependency graph automatically, so that
authors can focus on the parameterization relevant to the cell they are writing.

Parameters are declared with explicit *kinds* that tell the framework how to handle them:

| Kind | Meaning |
|---|---|
| `CONST` | A fixed value known at definition time |
| `VAR` | A set of values — each value spawns a separate execution branch (Cartesian expansion) |
| `DYN` | A value computed at runtime inside the calling cell's `make()` |
| `ITER` | An iterative parameter whose value cycles until a termination condition is met |
| `RAND` | A random value generated fresh per execution |
| `QUERY` | A value resolved from a SQL query against the Further database at execution time |

Parameters exist at three scopes — **session**, **project**, and **cell** — with well-defined
precedence rules. Project-level parameters (shared across all cells in a project) propagate through
the dependency graph automatically; cells subscribe to the project parameters they need without
having to receive them through every intermediate caller.

When cells from different projects interact, **translations** allow a calling project to rename and
map its own parameters to the target project's expected names, so that shared analytical quantities
can be expressed consistently within each project's own vocabulary. **Addressed parameters** allow
highly targeted injection at specific edges deep in the graph, bypassing or complementing
project-level translations.

**VAR expansion** deserves particular note: by declaring a parameter as VAR with a list of values
(at the session level or within a `@cell_call` decorator), the researcher triggers a Cartesian
product of execution instances — essentially a sweep across a parameter space — with a single call.
Instances can run in parallel and are independently memoized.

---

## Configuration Regimes

Research is rarely conducted in a single operating mode. A data processing pipeline might need to
run in a "fast exploration" mode during development (coarser thresholds, smaller samples) and a
"production" mode for publication (full data, strict thresholds). A consortium pipeline might apply
different configurations to the same analytical library depending on which upstream module is
calling it. Further addresses this through **project initialization configurations** — named,
reusable configuration bundles called *ProjectInits*.

### Named Configuration Bundles

Each project can define a `project_inits.yaml` file containing labeled configuration contexts.
A label bundles a complete set of parameter values for that project — cell-level specs, project-wide
specs, resource freshness instructions, and framework opts — under a single named key:

```yaml
# project_inits.yaml
fast:
  purpose: "Exploratory mode — coarse thresholds, fast turnaround"
  cell_specs:
    analyzer:
      threshold: 0.5
      mode: "approximate"

thorough:
  purpose: "Production mode — strict thresholds, full data"
  cell_specs:
    analyzer:
      threshold: 0.01
      mode: "exhaustive"
```

A researcher selects a label at session call time (`project_init_label="thorough"`), and the
framework applies the corresponding configuration across all subscribing cells. No code changes
required — the same cells, the same project, different behavior.

### Composing Orthogonal Concerns

Labels can be merged at call time. When two labels govern independent concerns — for example one
label controls computational intensity, another controls output formatting — they can be combined
freely:

```python
session.call("project.root", project_init_label=["high_intensity", "detailed_output"])
```

Further's domain system enforces mutual exclusivity where it matters: labels within the same domain
(e.g., two competing speed configs) cannot be combined, while labels from different domains (speed
and style) compose without conflict.

### Routing Configurations Across Projects

In a multi-project pipeline the same called library project might need different configuration
depending on which part of the calling project invokes it. *Called assignments* handle this:
an assignment routes a specific ProjectInit label to a called project based on which cell is
making the call. A `cell_quick` caller routes the library to its "fast" config; a `cell_deep`
caller routes the same library to its "thorough" config — within the same session call, without
ambiguity.

This allows a shared analytical library to serve diverse purposes across a consortium while
remaining a single, consistently versioned codebase. The configuration regime is a property of
how the library is called, not of the library itself.

---

## Cross-Project Collaboration and Modularity

Further's **project** concept is designed explicitly for collaborative and multi-institutional
research. Different teams can develop and version their own projects independently. When one project
calls another, the framework:

- Enforces declared dependency relationships (calling projects declare which other projects they
  call).
- Isolates parameter namespaces (called project's cells cannot see the calling project's parameters
  unless explicitly bridged by a translation).
- Manages trust for cross-project type sharing (pickle-based contributions from an external project
  require explicit session-level trust declarations).
- Supports calls to cells running in separate compute containers, with automatic serialization of
  parameters and contributions across boundaries.

This makes it practical for a consortium of labs to each maintain a library project of reusable
analysis components, while a coordinating project assembles them into a cohesive pipeline —
all tracked and memoized across sessions and institutions.

---

## From Exploration to Production-Quality Code

A persistent tension in research computing is the gap between getting results quickly and writing
efficient, reproducible, long-lived code. Further acknowledges this tension and provides a
deliberate pathway between the two — with the same cell structure throughout.

### The Framework Favors What It Can See Early

Further's most powerful features — pre-memoization, static parallelism planning, VAR expansion,
and graph-level deduplication — all depend on the framework knowing parameter values *before*
execution begins. The more of a cell's parameter logic that lives in statically-declared
classmethods, the more the framework can do on the researcher's behalf without waiting.

The corollary is that parameters resolved only at runtime (inside `make()`) are opaque to the
static analysis machinery. They require the framework to first execute the calling cell's logic to produce the value,
then use that value to dispatch the child cell — a two-step runtime sequence rather than a
pre-planned dispatch. No pre-memoization, no advance parallelism planning for that branch.

### Exploratory: Top-Down Interventions

When a researcher is exploring — before the right parameterization is clear — Further offers
mechanisms that minimize authoring friction at the cost of some efficiency:

**STUB parameters** signal a runtime-resolved value that is expected to be temporary. A STUB cell
call computes its parameter inside `make()` and passes it dynamically, bypassing static analysis
for that edge. The framework treats it as an explicit marker of intent: *"I know this might be
computed earlier eventually — I am not there yet."* A STUB carries no memoization penalty for
the cells above it; only the immediate dynamic dispatch is affected.

**Transformer cells**, injected at session call time without any modification to the cells being
transformed, allow a researcher to intercept parameter flow at specific graph edges and reshape it.
This is a top-down intervention: rather than wiring transformation logic into the cells themselves,
the researcher applies it externally — useful when quickly testing normalization or scaling
strategies without committing them to the cell definitions.

**Addressed parameters**, similarly, allow precise injection of parameter values at specific edges
deep in the graph from the outside — from a calling cell's form, or from the session — without
modifying the called cell. A researcher can control a deep dependency's behavior during exploration
without touching that dependency's code.

Together these mechanisms allow rapid iteration: run an analysis, observe the results, adjust
parameters externally, re-run — without a code change cycle on every cell in the graph.

### Maturing: Moving Logic Into Classmethods

As an analysis stabilizes, the natural direction is to move parameter logic from runtime into
static declarations. Concretely:

- **STUB → VAR or CONST.** When it becomes clear what values a parameter should take, replace
  the runtime computation with a classmethod that emits those values explicitly. The framework
  can now see the values before execution: pre-memoization kicks in, parallel dispatch is
  planned in advance, and VAR expansion produces a full parameter sweep automatically.

- **Addressed params → explicit `@cell_call` declarations.** When a parameter injection pattern
  stabilizes, consider moving it into the calling cell's explicit `@cell_call` decorator makes the
  dependency visible in the Definition Graph. Static analysis can validate it, and other
  researchers reading the code can see it without tracing session-level configurations.

- **Transformers → first-class cells.** When a transformation proves durable, promoting it from
  an injected session-level transformer to a proper cell in the graph makes the flow more predictable
  and efficient, eliminating unexpected top-down re-direction.

Each of these moves makes the cell's behavior more legible to the framework — and to collaborators.
The cell itself does not need to be rewritten from scratch; the authoring surface changes
incrementally.

### The Practical Guidance

Further does not force researchers to choose between quick results and good code. It allows starting
in exploratory mode — with STUBs, addressed params, and injected transformers — and migrating
toward a fully statically-declared graph as understanding accumulates. The memoization system
ensures this migration costs nothing computationally: results from the exploratory phase that
remain valid are reused exactly, and only genuinely new computations run.

The destination — a cell graph where most parameter values are declared in exposed classmethods,
versions are bumped intentionally, and configurations are named and composable — is a graph that
the framework can schedule, deduplicate, and pre-memoize with maximum efficiency, and that
collaborators can read, reuse, and extend with confidence.

---

## Versioning: Evolving Analyses Without Breaking Reproducibility

A defining challenge in long-running research is that "final" results from one phase become inputs
to the next — but the underlying analyses continue to evolve. Further addresses this through a
two-axis versioning system on every cell:

- **`logic_version`**: Tracks what the cell *computes*. A bump invalidates cached results.
- **`api_version`**: Tracks the cell's *public interface* (its parameters and contributions).
  A bump signals compatibility changes to callers.

Old cell versions can be **archived** alongside the current version. Both versions coexist in the
same session's dependency graph. Callers can pin to specific version ranges (e.g., `">=1.2,<2.0"`)
or always use the current stable version. When a later analysis needs the result of an older
analytical approach for comparison, it simply pins to the archived cell — no manual file management
required.

Projects have analogous versioning for their public interfaces. A project at version 2.0 can
coexist with archived version 1.x, allowing gradual migration of dependents.

---

## Execution Modes

Further is designed for heavy computational workloads. It provides six execution modes organized
in a two-branch hierarchy:

```
              INLINE
             /      \
          LOCAL      DASK
           |          |
   LOCAL_PROCESSES   DASK_STATIC_DAG
                      |
               PREFECT_STATIC_DAG
```

The **session** sets the infrastructure ceiling — the most capable mode available for the run.
Individual **cells** and **projects** can select a mode at or below that ceiling within their
branch, but cannot escalate across branches (a LOCAL session cannot use DASK, and vice versa).

### Local Branch

- **INLINE**: Synchronous, single-thread execution. Suitable for development, debugging, and cells
  that manage their own parallelism internally.
- **LOCAL**: Parallel execution using a thread pool (`ThreadPoolExecutor`). Independent cells run
  concurrently without requiring any external infrastructure.
- **LOCAL_PROCESSES**: Parallel execution using a process pool (`ProcessPoolExecutor`). Useful when
  cells are CPU-bound and benefit from true multiprocessing, at the cost of serialization overhead
  across process boundaries.

### Distributed Branch

- **DASK**: Distributed parallel execution using a Dask cluster (local or remote). The dependency
  graph drives automatic parallelism — independent cells are submitted to Dask workers concurrently
  without any author intervention.
- **DASK_STATIC_DAG**: An optimization over standard DASK. When a fully-static subtree — a region
  of the graph with no runtime-resolved parameters — exceeds a configurable size threshold
  (`static_dag_node_thresh`), the framework pre-computes all logic keys, performs a single batch
  cache check, and submits the entire subtree as a Dask-native dependency graph in one operation.
  This eliminates per-node round trips between the Hub and the scheduler. A single session may
  produce multiple such "static chapters," each submitted independently as the execution graph
  unfolds.
- **PREFECT_STATIC_DAG**: Wraps the Dask static DAG submission in a Prefect `@flow` with
  `DaskTaskRunner`. Prefect is *not* a separate compute engine — Dask still performs all
  computation. The wrapper gives the Prefect UI visibility into the true DAG topology, enabling
  monitoring, tagging, and retry configuration through Prefect's observability layer. If Prefect is
  not installed, the framework falls back gracefully to plain Dask static DAG submission with a
  warning.

### Parallel Opt-Out

Any cell can set `in_parallel=False` to force synchronous (INLINE) execution regardless of the
session's mode. This is useful for cells that create their own thread or process pools internally
and cannot safely be dispatched in parallel by the framework.

### Container Dispatch

Individual cells can be assigned to specific compute containers (e.g., GPU-enabled environments)
while the rest of the graph runs on standard hardware. The framework manages the serialization and
routing of parameters and contributions across container boundaries.

---

## Advanced Computation Patterns

Beyond standard batch pipelines, Further supports several patterns tailored to research workflows:

**Iterative cycles (ITER):** A cell can call another cell with an ITER parameter, creating a static
unrolling of a fixed-depth loop in the dependency graph. Call switches provide termination
conditions evaluated on Specs/Opts, enabling conditional short-circuiting at known stopping points.

**Incrementing cells ("anytime algorithms"):** For computations that produce useful partial results
at every step — MCMC chains, iterative optimizers, phylogenetic tree construction — Further
supports incrementing cells. Each increment is independently memoized. A caller can request
exactly N increments, at least N, the best currently stored, or a fixed amount of new work beyond
what exists. The framework resumes from the last stored increment rather than restarting.

**Recursive units:** Multi-cell cycles with dynamic termination (convergence detection inside
`make()`), modeled as a composite node in the dependency graph.

**Transformer cells:** Cells that intercept and reshape parameters flowing through an edge,
enabling context-sensitive transformations (e.g., normalization, scaling) to be injected without
modifying the analysis cells themselves.

---

## Persistence Architecture

Further stores results at two levels:

1. **PostgreSQL metadata database:** Tracks cell definitions, execution history, logic keys, run
   status, parameters, and the full dependency topology. This is the source of truth for
   memoization and enables introspection.
2. **Blob storage:** Large contributions (DataFrames, arrays, model weights) are stored in
   configurable blob storage — local filesystem, S3/MinIO, or PostgreSQL large objects. They are
   keyed by a content-addressing scheme and loaded on demand.

The separation means that small scalar results (summary statistics, counts, flags) can be stored
directly in the database as *recorded contributions*, making them immediately queryable via SQL
without any blob retrieval. Larger objects travel through blob storage and are loaded only when
the calling cell's `make()` method requests them.

---

## Introspection and Self-Referential Analysis

The Further database is queryable from within Further itself. Cells can be authored with
**database makers** that issue SQL queries against the execution history — for example, to retrieve
all stored results matching a particular parameter sweep, compare increments of a computation, or
aggregate contributions across sessions. This allows today's higher-level analysis to be literally
constructed from the persistent record of yesterday's lower-level runs.

The `QUERY` parameter kind takes this further: a parameter value can be resolved at execution time
by a SQL query against the database, so that the set of execution instances in a VAR-like expansion
is drawn dynamically from stored results rather than hard-coded by the author.

---

## Design Philosophy Summary

Further is built around five interlocking ideas:

1. **Local focus, global coherence.** Authors think about one cell and its immediate dependencies; 
   the framework maintains the whole graph.
2. **Reproducibility as infrastructure, not discipline.** Memoization, versioning, and result
   tracking are structural properties, not conventions to be followed.
3. **Exploration and production on the same track.** The same cell can be run exploratorily today
   and serve as a memoized dependency in a larger pipeline tomorrow, with no code changes.
4. **Parameter state belongs to the graph.** Global parameter consistency across a complex
   dependency graph is a framework responsibility, not an author responsibility.
5. **Open-ended time horizons.** Results are persisted across sessions, versions coexist, and
   incremental computations resume — because research rarely ends.
