Metadata-Version: 2.4
Name: stream-dse
Version: 1.13.0
Summary: Stream - Multi-core accelerator design space exploration with layer-fused scheduling
Author-email: Arne Symons <arne.symons@kuleuven.be>, Linyan Mei <linyan.mei@kuleuven.be>
License: MIT License
        
        Copyright (c) 2023 MICAS (KU Leuven)
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/KULeuven-MICAS/stream
Keywords: stream,multi-core,accelerator,layer-fused,scheduling,zigzag,dse,design-space-exploration,machine-learning,deep-learning,mapping
Classifier: License :: OSI Approved :: BSD License
Classifier: Programming Language :: Python
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: zigzag-dse==3.8.5
Requires-Dist: cerberus
Requires-Dist: ortools>=9.15
Requires-Dist: pydantic<2.12,>=2.0
Requires-Dist: pydot
Requires-Dist: xdsl<0.30,>=0.29.1
Provides-Extra: mcp
Requires-Dist: fastmcp>=3.2.4; extra == "mcp"
Provides-Extra: gurobi
Requires-Dist: gurobipy; extra == "gurobi"
Provides-Extra: dev
Requires-Dist: bumpver; extra == "dev"
Requires-Dist: pip-tools; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# 🌊 Stream

[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue)](https://www.python.org/)
[![Docs](https://img.shields.io/badge/docs-stream-blue)](https://kuleuven-micas.github.io/stream/)

**Stream** is a design space exploration (DSE) and constraint-optimization framework for **heterogeneous dataflow accelerators**: accelerator systems built by combining cores that each have their own dataflow and performance model (**AIE** and **TPU-like** are two example core types among others). Scheduling is **layer-fused**, and the **TETRA constraint optimization** uses MILP (Mixed-Integer Linear Programming) to decide tensor placement and transfer paths across the cores of such a system. Stream builds on top of [ZigZag](https://zigzag-project.github.io/zigzag/) for per-core cost estimation.

### 📖 [**Explore the Documentation**](https://kuleuven-micas.github.io/stream/)
### 🚀 [**Getting Started Guide**](https://kuleuven-micas.github.io/stream/getting-started/)

---

## ✨ Key Features

✔ **Heterogeneous dataflow cores**: compose an accelerator from cores that each carry their own dataflow and cost model (AIE, TPU-like, pooling, SIMD, and more).

✔ **Layer-fused scheduling** across the whole system of cores.

✔ **TETRA constraint optimization**: a MILP (`TransferAndTensorAllocator`) decides tensor placement and transfer-path routing.

✔ **Pluggable solver backends**: OR-Tools GSCIP (default, license-free), OR-Tools HiGHS, and Gurobi behind one unified `SolverModel` API.

✔ **ONNX workloads** with auto-generated or hand-written mappings.

✔ **AMD AIE code generation**: emit `aie` / `aiex` MLIR for the Ryzen AI NPU, ready for the [mlir-aie](https://github.com/Xilinx/mlir-aie) / IRON toolchain.

✔ **Built for AI agents**: an MCP server and typed IR models expose the pipeline programmatically.

The pipeline runs as a chain of stages: parse → tile → cost → MILP allocation → memory estimation.

---

## 🚀 Installation

Python `>=3.12` is required.

Full install with MCP server support (from the repo root):

```bash
pip install -e ".[mcp]"
```

Base install (no MCP server):

```bash
pip install -e .
```

The authoritative dependency source is `pyproject.toml` (package `stream-dse`). The base install pulls in `zigzag-dse`, `ortools>=9.15` (the default, license-free MILP backend), `pydantic`, `pydot`, and `xdsl`. Optional extras: `[mcp]` adds `fastmcp` (required for the MCP server); `[gurobi]` adds `gurobipy` (commercial solver, opt-in).

### AIE code generation

AIE-target MLIR codegen and tracing additionally need the AMD AIE toolchain (`mlir_aie`, `llvm-aie`, `xdsl-aie`, `snax-mlir`, `aie-python-extras`). These are git/URL installs that PyPI does not allow in package metadata, so a console script installs them after the base install rather than via an extra:

```bash
pip install -e .       # or, once published: pip install stream-dse
stream-setup-aie       # installs the AIE toolchain into the current environment
```

`stream-setup-aie --dry-run` prints exactly what it will install without making changes.

> ⚠️ **Platform caveat:** the AIE toolchain is Linux x86_64 only (manylinux wheels), CPython 3.12 or 3.13.

> 💡 **Solver license note:** OR-Tools (`ortools_gscip`, the default backend) is open-source and needs no license. Gurobi requires the `[gurobi]` extra (`pip install -e ".[gurobi]"`) plus a separate commercial license; `backend="gurobi"` errors at solve time without a valid license.

Optional pre-commit setup:

```bash
pre-commit install
```

---

## ⚡ Quick Start

Run the CO pipeline on a small two-Conv workload (a committed test fixture) with an auto-generated mapping (approximately 11 seconds):

```bash
python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/tpu_like_quad_core.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx
```

Or simply `just co-2conv` (this repo uses [`just`](https://github.com/casey/just) as a task runner; it defaults to `tpu_like_quad_core`, see the [matrix](#workload--hardware-matrix) below). `--mapping` is omitted, so the mapping is auto-generated by the pipeline; the hardware is a TPU-like quad-core system.

Expected output:

```
Total latency: 14344.0
  Group 0: 14344 (100.0%, wall=9.4s)
```

A YAML summary is written to `outputs/.../summary.yaml` with `total_latency: 14344.0`, plus workload/tiling/schedule PNG visualizations.

---

## 🧩 Hardware and Core Types

An accelerator in Stream is described as a **system of heterogeneous dataflow cores**. Core roles include compute, memory, shim, and offchip; example dataflow core types include **AIE**, **TPU-like**, and **pooling**.

Hardware and mapping files are organized as follows:

- `stream/inputs/examples/hardware/` - system-level hardware YAMLs (e.g. `tpu_like_quad_core.yaml`, `eyeriss_like_*.yaml`, `simba*.yaml`, `fusemax.yaml`).
- `stream/inputs/examples/hardware/cores/` - per-core-type YAMLs (e.g. `tpu_like.yaml`, `pooling.yaml`, `simd.yaml`, `offchip.yaml`, `eyeriss_like.yaml`).
- `stream/inputs/aie/hardware/` and `stream/inputs/aie/hardware/cores/` - AMD AIE example core types (e.g. `aie_tile.yaml`, `mem_tile_256KB.yaml`, `shim_dma.yaml`).
- `stream/inputs/examples/mapping/`, `stream/inputs/aie/mapping/`, and `stream/inputs/testing/mapping/` - mapping descriptions.

A mapping can be auto-generated (as in Quick Start above) or hand-written and passed via `--mapping`.

---

## 📊 Workload × Hardware Matrix

The generic CO pipeline runs any ONNX workload on any of the example hardware systems. The repo ships two small workloads and exercises them across all eight non-AIE example architectures, both from the `scripts/main_stream_co.py` entry point and from the pytest suite (`tests/test_hardware_combinations.py`).

**Workloads** - committed test fixtures under `stream/inputs/testing/workload/` (weight values are cleared, only tensor shapes matter for cost estimation, so the ONNX stay tiny; `just gen-workloads` regenerates them via the builders):

- **2-conv** - two chained Conv layers (`make_2_conv.py`).
- **swiglu** - a 5-node SwiGLU block: two Gemms, SiLU, an elementwise Mul, and a down-projection Gemm (`make_swiglu.py`).

| Hardware (`stream/inputs/examples/hardware/`) | Description | 2-conv | swiglu |
|---|---|:---:|:---:|
| `eyeriss_like_single_core` | one Eyeriss-like compute core (+ pooling, SIMD, DRAM) | ✓ | ✓ |
| `eyeriss_like_dual_core` | two Eyeriss-like compute cores | ✓ | ✓ |
| `eyeriss_like_quad_core` | four Eyeriss-like compute cores | ✓ | ✓ |
| `tpu_like_quad_core` | four TPU-like compute cores | ✓ | ✓ |
| `simba_small` | small Simba chiplet mesh | ✓ | ✓ |
| `simba` | 36-core Simba chiplet mesh | ✓ | ✓ |
| `fusemax` | FuseMax array + vector + DRAM | ✓ | ✓ |
| `meta_prototype_dual_core_simd_offchip` | two Meta-prototype compute cores (+ pooling, SIMD, DRAM) | ✓ | ✓ |

✓ = completes through the generic CO pipeline. All combinations run in the default fast suite; on these small single-fusion-group workloads even the 36-core `simba` mesh finishes in seconds.

**Run one combination** - the `justfile` wraps `scripts/main_stream_co.py`; `hw` is any hardware stem from the table (default `tpu_like_quad_core`):

```bash
just co-2conv fusemax           # 2-conv on an architecture
just co-swiglu simba_small      # swiglu on an architecture
```

Equivalently, the raw entry-point call:

```bash
python scripts/main_stream_co.py \
  --hardware stream/inputs/examples/hardware/fusemax.yaml \
  --workload stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx
```

**Run the whole matrix** - the `justfile` wraps `pytest tests/test_hardware_combinations.py`, which runs 2-conv + swiglu over all eight architectures plus a parse-only check confirming every hardware definition loads:

```bash
just matrix          # parse + 2-conv + swiglu over all 8 architectures (incl. simba)
```

---

## 🖥️ Command-Line Entry Points

All entry-point scripts live in `scripts/` and are run from the repo root (so relative input paths resolve and `stream` imports as the installed package).

| Script | Purpose |
|--------|---------|
| `scripts/main_stream_co.py` | Generic CO pipeline for any workload + hardware pair; manual or auto-generated mapping; YAML summary output. **General-purpose (non-AIE).** |
| `scripts/main_gemm.py` | CO allocation + optional AIE MLIR codegen for GEMM workloads (AMD Strix AIE). |
| `scripts/main_swiglu.py` | CO allocation + optional AIE MLIR codegen for SwiGLU workloads (AMD Strix AIE). |
| `scripts/main_swiglu_dse_single.py` | Single-mapping SwiGLU DSE evaluation (AIE). |
| `scripts/main_swiglu_dse.py` | Multi-mapping SwiGLU DSE sweep over tile sizes (AIE). |
| `scripts/main_aie_co.py` | CO allocation for a hard-coded single AIE tile workload (no args; run as `python scripts/main_aie_co.py`). |
| `scripts/main_gemm_codegen.py` | Direct GEMM → AIE MLIR codegen via xDSL transforms (no CO pipeline); `--M/--N/--K`. |

`scripts/main_stream_co.py` is the general-purpose entry point. The others are AIE-specific: they hardwire AMD Strix or single-tile AIE hardware, and codegen requires NPU hardware. Note that `scripts/main_aie_co.py` takes no arguments (all paths are hard-coded). Plotting and trace post-processing utilities live in `scripts/analysis/`.

Full `scripts/main_stream_co.py` CLI syntax:

```bash
python scripts/main_stream_co.py \
  --hardware PATH_TO_HW_YAML \
  --workload PATH_TO_ONNX \
  [--mapping PATH_TO_MAPPING_YAML]  # omit for auto-generated mapping
  [--output OUTPUT_DIR]             # default: "outputs"
  [--experiment-id ID]
  [--skip-if-exists]
```

---

## 🐍 Public API

The public API lives in `stream/api.py`.

The primary entry point is `optimize_allocation_co_generic`, which auto-generates the mapping from the workload and hardware (no hand-written mapping YAML needed). This snippet is confirmed to run and print `total_latency: 14344.0` (the 2-conv ONNX it references is produced by `just gen-workloads`):

```python
import tempfile
from stream.api import configure_logging, optimize_allocation_co_generic

configure_logging()

with tempfile.TemporaryDirectory() as tmp:
    ctx = optimize_allocation_co_generic(
        hardware="stream/inputs/examples/hardware/tpu_like_quad_core.yaml",
        workload="stream/inputs/testing/workload/2conv_1_8_32_32_16_32_3.onnx",
        experiment_id="my-first-run",
        output_path=tmp,
    )
    print("total_latency:", ctx.get("total_latency"))
    print("group_latencies:", ctx.get("group_latencies"))
```

Expected output: `total_latency: 14344.0`.

The other two public functions:

- `optimize_allocation_co_with_mapping(hardware, workload, mapping, experiment_id, output_path, ...)` - runs CO with a hand-written mapping YAML. `optimize_allocation_co` is a backward-compatible **alias** for it (both names importable).
- `optimize_mapping(hardware, workload, experiment_id, output_path, max_nb_mappings=20, ...)` - DSE pipeline: enumerates mapping variants and runs CO for each.

All three return a `StageContext`. Useful keys: `ctx.get("total_latency")`, `ctx.get("group_latencies")`, `ctx.get("scheduler")`, `ctx.get("workload")`, `ctx.get("accelerator")`.

---

## 🤖 MCP Server (for AI agents)

Stream ships an MCP server (`stream/mcp/server.py`, server name `stream`) that lets an AI agent submit and inspect TETRA CO jobs. Requires the `[mcp]` extra (`pip install -e ".[mcp]"`).

Launch command (from the repo root):

```bash
python3 -c "from stream.mcp.server import mcp; mcp.run(transport='stdio')"
```

The server runs on STDIO (JSON-RPC) transport and blocks until the client disconnects.

The 6 tools:

| Tool | Purpose |
|------|---------|
| `run_optimization(hardware, workload, mapping, output_path, backend, ...)` | Submit a TETRA CO job; returns a `job_id` immediately; solve runs in the background. |
| `poll_optimization(job_id)` | Check job status (`pending` / `running` / `complete` / `failed` / `not_found`). |
| `get_workload_ir(workload=None, experiment_id=None)` | Return the workload DAG as `WorkloadIR` JSON. |
| `get_accelerator_ir(hardware=None, experiment_id=None)` | Return the hardware model as `AcceleratorIR` JSON. |
| `get_allocation_ir(job_id)` | Return the TETRA allocation result as `AllocationIR` JSON (3 persona views). |
| `get_solve_stats(job_id)` | Return MILP solve statistics (objective, time, gap, node count, backend). |

**Run / poll / inspect flow:**

1. `run_optimization(...)` returns `{"job_id": "...", "status": "pending"}`.
2. Poll `poll_optimization(job_id)` until `{"status": "complete"}`.
3. Inspect with `get_allocation_ir(job_id)` for the `AllocationIR` (algorithmic / hardware / compiler views) and `get_solve_stats(job_id)` for solve statistics.

---

## 🧠 Working in This Repo (AI agents)

**Programmatic / IR API** for structured JSON output:

```python
from stream.ir import WorkloadIR, AcceleratorIR, AllocationIR

# After running optimize_allocation_co_generic(...)
workload_ir = WorkloadIR.from_internal(ctx.get("workload"))
accelerator_ir = AcceleratorIR.from_internal(ctx.get("accelerator"))
allocation_ir = AllocationIR.from_internal(ctx.get("scheduler"))

workload_data = workload_ir.model_dump()      # JSON-compatible dict
hardware_data = accelerator_ir.model_dump()
allocation_data = allocation_ir.model_dump()
```

`AllocationIR` offers `.algorithmic_view()`, `.hardware_view()`, and `.compiler_view()` persona views.

---

## 📚 Further Documentation

- **Hosted documentation site:** [kuleuven-micas.github.io/stream](https://kuleuven-micas.github.io/stream/), the human-facing docs (installation, getting started, the workload/hardware/mapping input formats, and driving Stream from an AI agent via the MCP server and IR models), rebuilt from `docs/` on every push to `main`.
- **Stream paper (IEEE):** [A. Symons, L. Mei, S. Colleman, P. Houshmand, S. Karl and M. Verhelst, "Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators"](https://ieeexplore.ieee.org/abstract/document/10713407).
- **ZigZag:** [zigzag-project.github.io/zigzag](https://zigzag-project.github.io/zigzag/), the per-core cost-estimation framework Stream builds on.
