Metadata-Version: 2.4
Name: snakemake-scheduler-plugin-grapheonrl
Version: 0.6.0
Summary: Snakemake scheduler plugin: cascade scheduler (MILP/HEFT/GNNRL) with self-calibrating runtimes, digital twin run history, and GNNRL workflow-specific training
Project-URL: Homepage, https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl
Project-URL: Repository, https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl
Author-email: Aasish Kumar Sharma <aasish.sharma@uni-goettingen.de>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering
Requires-Python: <4.0,>=3.11
Requires-Dist: numpy>=1.24.0
Requires-Dist: pulp>=2.7
Requires-Dist: pyyaml>=6.0
Requires-Dist: snakemake-interface-common>=1.20.1
Requires-Dist: snakemake-interface-scheduler-plugins>=2.0.2
Requires-Dist: torch>=2.0.0
Provides-Extra: gnnrl
Requires-Dist: numpy>=1.24.0; extra == 'gnnrl'
Requires-Dist: torch>=2.0.0; extra == 'gnnrl'
Provides-Extra: milp
Requires-Dist: pulp>=2.7; extra == 'milp'
Description-Content-Type: text/markdown

# Snakemake Scheduler Plugin: GrapheonRL

[![Python](https://img.shields.io/badge/python-%E2%89%A53.11-blue.svg)](https://python.org)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![Version](https://img.shields.io/badge/version-0.6.0-blue.svg)](CHANGELOG.md)
[![Status](https://img.shields.io/badge/status-beta-yellow.svg)]()

Cascade scheduler for heterogeneous HPC workflows. Reimplements and extends the
design of [milp_snakemake_scheduler](https://github.com/AasishKumarSharma/milp_snakemake_scheduler)
using the Snakemake 9 plugin interface, with added HEFT ordering, GNNRL policy,
and self-calibrating runtimes. Does not import from milp_snakemake_scheduler;
shares its config file format for cross-plugin compatibility.

## What this plugin provides

### Scheduling algorithms (cascade order)

**MILP with node placement** (primary for small subgraphs, <=30 jobs)
Assigns each job to a specific node using binary ILP variables. Formulation:
x[j][n] binary (job j to node n), start/end continuous, makespan minimized.
Feature compatibility enforced: GPU jobs go only to GPU nodes; core and
memory capacity respected per node. Falls back to RCPSP ordering when nodes
not configured. Uses PuLP/CBC solver. MILP is an exact solver: it is accepted
unconditionally when feasible, with no quality gate comparison against HEFT.
Benchmark verification: plugin MILP equals the time-indexed MILP certified
optimal (14s) on the 10-job known-optimal test workflow.

**GNNRL** (experimental, <=300 jobs)
3-layer message-passing network with symmetric neighbourhood aggregation and
LayerNorm. Scores task priorities from 12 task features (cores, memory,
duration, indegree, outdegree, level, upward rank, GPU flag, etc.) and 6 node
features (cores, memory, storage, bandwidth, speed, utilisation). Global
one-shot inference on the full DAG at startup; result cached for all scheduling
rounds (O(n log n) per round). GPU-aware node assignment during inference.
Quality gate: generic pretrained model uses tight threshold (×1.01),
workflow-trained model uses relaxed threshold (×1.05). Training: BC warm-start
from HEFT teacher, then PPO fine-tuning with reward
`-(beta * makespan + alpha * resource_waste * makespan)` (default: alpha=0,
beta=1). Note: PPO training does not pass graph edges to the update step (each
training step operates on node features only); the graph structure influences
inference but not the gradient updates. Pre-trained model shipped;
workflow-specific fine-tuning available via `--scheduler-grapheonrl-train`.

**HEFT** (fallback, any size)
HEFT-inspired critical-path ordering (Topcuoglu et al. 2002). Computes
upward rank using calibrated per-rule durations (single-machine estimate) and
schedules tasks in decreasing rank order. Node assignment uses greedy
earliest-finish-time per compatible node. Used when MILP is not configured
and GNNRL is not loaded or fails the quality gate.

### Self-calibrating runtime estimation

On every execution the plugin measures actual per-rule wall-clock time from
Snakemake's `benchmark:` directive TSV files (most accurate) or from
SLURM `sacct` (for cluster runs), falling back to timing between scheduling
rounds. Results are stored in `scheduler_config.yaml [rules]` and loaded at
the next startup so HEFT's critical-path computation uses real measured
durations instead of the cores-as-proxy fallback.

### Node placement and SLURM steering

When `system_profile.json` is configured (same format as
milp_snakemake_scheduler), the MILP assigns each job to a specific node.
The plugin then attempts to set `job.resources.slurm_partition` to steer
the SLURM executor to the correct partition. Controllable via
`scheduler_config.yaml [grapheonrl.node_assignment]`.

### Digital twin

On every execution, the plugin writes `.snakemake/grapheonrl/dag_export.json`
capturing the task graph, HEFT oracle schedule, per-rule calibrated duration
history, and a run log that accumulates across runs (`run_history[]`,
`rule_stats`, `best_run`). Used for GNNRL warm-start training and offline
analysis. Use `--scheduler-grapheonrl-export-dag PATH` to write to a custom path
instead. Disable with `--scheduler-grapheonrl-disable-auto-twin true`.

## Installation

```bash
# From GitHub (recommended)
pip install git+https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl.git

# Or clone for development
git clone https://github.com/AasishKumarSharma/snakemake-scheduler-plugin-grapheonrl
cd snakemake-scheduler-plugin-grapheonrl
pip install -e .
```

Dependencies installed automatically: `torch`, `numpy`, `pulp`, `pyyaml`.

## Quick start

```bash
# Default: cascade with auto-training. GNNRL trains on the workflow before
# the first scheduling round, then executes using the trained model.
# The digital twin is updated automatically after each run.
snakemake --cores 8 --scheduler grapheonrl

# Explicit training control: more iterations for stronger policy
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-train-iters 200

# Disable auto-training (use generic pretrained model + quality gate)
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-disable-auto-train true

# HEFT only (fastest, no GNNRL inference)
snakemake --cores 8 --scheduler grapheonrl --scheduler-grapheonrl-strategy heft

# Export digital twin to a custom path
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-export-dag dag_export/dag_export.json

# With heterogeneous cluster (system_profile.json)
snakemake --cores 8 --scheduler grapheonrl \
    --scheduler-grapheonrl-node-config system_profile.json
```

## Configuration files

Two files are searched in order: CWD > snakefile dir > `~/.snakemake/` > package default.
Both use the same format as `milp_snakemake_scheduler` for cross-plugin compatibility.

**`scheduler_config.yaml`** - solver and calibration settings.
Copy the template from the repo root and edit as needed.

**`system_profile.json`** - cluster node definitions (clusters -> nodes ->
resources/features/properties). Copy from repo root and add your nodes.

Key `scheduler_config.yaml` sections (see template for all options):
```yaml
grapheonrl:
  strategy: cascade          # milp | heft | gnnrl | cascade
  milp_threshold: 30         # max remaining jobs for MILP
  gnnrl_threshold: 300       # max remaining jobs for GNNRL
  quality_gate: 1.01         # GNNRL only: accept if makespan <= HEFT * quality_gate
  node_assignment:
    enabled: true            # write slurm_partition to job resources
    mode: best_fit
  history:
    max_days: 90
    max_entries: 100
  calibration:
    use_sacct: true
    use_benchmark_files: true
  training:
    alpha: 0.0   # resource utilization penalty weight (0 = makespan only)
    beta:  1.0   # makespan weight
```

## Settings reference

All CLI flags: `--scheduler-grapheonrl-<name>`

| Flag | Default | Description |
|------|---------|-------------|
| `strategy` | `cascade` | Algorithm: `cascade`, `heft`, `gnnrl` *(experimental)*, `milp`, `priority` |
| `disable-auto-train` | `false` | Disable auto-training; use generic pretrained model |
| `disable-auto-twin` | `false` | Disable automatic digital twin updates |
| `train` | `false` | Force train GNNRL before executing (explicit override) |
| `train-iters` | `50` | PPO iterations (50=seconds, 200=minutes) |
| `train-after` | None | Auto-train after N runs in digital twin |
| `export-dag` | None | Path to write/update digital twin JSON (overrides auto path) |
| `model-path` | None | Explicit model path (auto-discovered if omitted) |
| `gnnrl-threshold` | `300` | Skip GNNRL above this job count |
| `milp-threshold` | `30` | Skip MILP above this job count |
| `milp-timeout` | `10.0` | MILP solver timeout in seconds |
| `node-config` | None | Path to system_profile.json |

## Tests

```bash
bash tests/run_all_tests.sh --quick   # 17 checks (~3 min, includes integration + optimality)
bash tests/run_all_tests.sh           # 24 checks (~10 min)
python tests/test_gnnrl.py            # 112 GNNRL + infrastructure checks
python tests/test_doc_claims.py       # 97 doc-claim verification checks
python tests/test_integration.py      # 58 end-to-end integration checks (~3 min)
```

## Assessment of GNNRL *(Experimental)*

### Workflow-trained GNNRL (default behavior)

By default, when no trained model exists for the current workflow, the plugin
automatically trains GNNRL before the first scheduling round. The trained model
learns the workflow's critical-path structure, resource requirements, and task
ordering from BC warm-start (HEFT teacher) followed by PPO fine-tuning.

Benchmark results on rnc workflows (gap from certified MILP optimal):

| Scale | HEFT gap | GNNRL (workflow-trained) gap |
|-------|----------|-----------------------------|
| rnc50  | 1.2% | **0.3-0.5%** |
| rnc100 | 1.0-1.5% | **0.3-0.6%** |
| rnc300 | 0.9-2.1% | **0.4-0.8%** |
| rnc5000 homo | baseline | **+2.2% improvement over HEFT** |
| rnc5000 hetero | 209,839 obj | **Self-iter300: 208,969 obj (beats HEFT)** |

Workflow-trained GNNRL is within 0.3-0.8% of certified optimal at small-to-medium
scale and consistently outperforms HEFT at large scale. This is the intended
operating mode.

### Generic pretrained model (auto-training disabled)

When `--scheduler-grapheonrl-disable-auto-train true` is set, the shipped generic model
runs without workflow-specific training. The generic model was trained on 3000
synthetic DAGs and evaluated on 300 held-out DAGs:

| Metric | Value |
|--------|-------|
| Win rate vs HEFT | 44.3% |
| Average relative improvement | -3.6% |
| Cases >5% better than HEFT | 15% |
| Cases >5% worse than HEFT | 26.7% |

The generic model is unreliable on unseen workflows. The quality gate (×1.01 vs
HEFT simulation) guards against bad generic-model decisions; HEFT runs as fallback
when the gate rejects. Do not use the generic model as your primary scheduler.

### Acceptance logic

**MILP** is always accepted when feasible. It is an exact solver - applying a
quality gate would mean rejecting a provably optimal solution. Benchmark
verification: on the 10-job known-optimal test workflow, the plugin MILP equals
the time-indexed MILP certified optimal (14s).

**GNNRL (workflow-trained)** uses a relaxed quality gate (×1.05 vs HEFT simulation)
because the simulation consistently underestimates the trained model's actual benefit.

**GNNRL (generic pretrained)** uses a tight quality gate (×1.01 vs HEFT simulation)
because the generic model is unreliable on unseen workflows.

## Related work

- [milp_snakemake_scheduler](https://github.com/AasishKumarSharma/milp_snakemake_scheduler):
  original MILP scheduler (old Snakemake API). GrapheonRL reimplements its design
  for the Snakemake 9 plugin interface.

## Citation

If you use this plugin, please cite the papers it is based on:

```bibtex
@inproceedings{sharma2025grapheonrl,
  title     = {Grapheon {RL}: A Graph Neural Network and Reinforcement Learning
               Framework for Constraint and Data-Aware Workflow Mapping and
               Scheduling in Heterogeneous {HPC} Systems},
  author    = {Sharma, Aasish Kumar and Kunkel, Julian},
  booktitle = {Proceedings of the 2025 IEEE 49th Annual Computers, Software,
               and Applications Conference (COMPSAC)},
  pages     = {489--494},
  year      = {2025},
  doi       = {10.1109/COMPSAC65507.2025.00341}
}

@inproceedings{sharma2025workflow,
  title     = {Workflow-Driven Modeling for the Compute Continuum: An
               Optimization Approach to Automated System and Workload Scheduling},
  author    = {Sharma, Aasish Kumar and Boehme, Christian and Gel{\ss}, Patrick
               and Yahyapour, Ramin and Kunkel, Julian},
  booktitle = {Proceedings of the 2025 IEEE 49th Annual Computers, Software,
               and Applications Conference (COMPSAC)},
  pages     = {2170--2175},
  year      = {2025},
  doi       = {10.1109/COMPSAC65507.2025.00343}
}

```

## Author

Aasish Kumar Sharma, Institute of Computer Science, GWDG, University of Gottingen

## License

MIT License
