Metadata-Version: 2.4
Name: simplex-tensor
Version: 1.0.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Compilers
Classifier: Topic :: System :: Hardware :: Hardware Drivers
Classifier: Operating System :: POSIX :: Linux
Requires-Dist: numpy>=1.24
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: maturin>=1.0 ; extra == 'dev'
Requires-Dist: scipy>=1.10 ; extra == 'ml'
Provides-Extra: dev
Provides-Extra: ml
Summary: SympleX – Polyhedral Tensor Superoptimizer with JAX-style purity enforcement and x86-64 JIT compilation
Keywords: jit,compiler,tensor,polyhedral,optimization,simd,avx512,numpy,scientific-computing
Author: SympleX Contributors
License: AGPL-3.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/hollowguy898-cloud/SympleX
Project-URL: Repository, https://github.com/hollowguy898-cloud/SympleX

# SympleX — Polyhedral Tensor Superoptimizer

**SympleX** is a C++20 compiler engine that combines equality saturation (e-graph superoptimization) with the polyhedral model to automatically discover mathematically equivalent but faster programs for AI training. It explores the space of valid program transformations — not just tile sizes — and maps the best discovered program directly onto GPU Tensor Cores, SRAM hierarchies, and distributed cluster topologies.

## Architecture

```
[AI Model Graph / Iteration Space]
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Level 1: Mathematical Superoptimizer (E-Graph)      │
│  ┌────────────────────────────────────────────────┐  │
│  │  E-Graph: compactly represents exponentially    │  │
│  │  many equivalent programs as equivalence classes │  │
│  └────────────────────────────────────────────────┘  │
│  Rewrite Rules:                                      │
│   - A*B + A*C → A*(B+C)   (factor / distribute)    │
│   - ReLU(MatMul(A,B)) → FusedMatMulReLU(A,B)       │
│   - MatMul(A,B)+bias → FusedMatMulAdd(A,B,bias)    │
│   - A + 0 == A,  A * 1 == A,  A * 0 == 0           │
│   - (A@B)^T == B^T@A^T,  Transpose(Transpose(A))=A │
│   - Softmax decomposition, LayerNorm decomposition   │
│   - Tiling decomposition, tile-pointwise distribution│
│  Polyhedral Guardrails:                              │
│   - Rejects rewrites that violate dependencies       │
│  Cost-Guided Extraction:                             │
│   - Picks cheapest program from saturated e-graph    │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Level 2: Hardware-Mapping Search                    │
│  Phase 1: Roofline pruning (memory-bound filtering)  │
│  Phase 2: Compute-symmetry alignment (TC multiples)  │
│  Phase 3: Hardware occupancy sieve (analytical/emp.) │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Code Generator (PTX Emitter)                        │
│  - WMMA/MMA Tensor Core instructions                 │
│  - Shared memory swizzling (XOR bank-conflict avoid) │
│  - Async TMA / cp.async pipelines                    │
│  - Double-buffering & software pipelining             │
└──────────────────────────────────────────────────────┘
              │
              ▼
┌──────────────────────────────────────────────────────┐
│  Distributed Sharding & Fault Tolerance              │
│  - 2D mesh (TP × PP × DP)                           │
│  - NCCL collective scheduling                         │
│  - 1F1B pipeline overlap                             │
│  - Resilient forward recovery (no rollback)           │
│  - Dynamic micro-batching                            │
│  - Activation checkpointing (save/recompute/offload)  │
└──────────────────────────────────────────────────────┘
              │
              ▼
     [Optimized GPU Binary / PTX]
```

## Modules

| Module | Path | Description |
|--------|------|-------------|
| **Polyhedral** | `include/symplex/polyhedral/` | Integer polytopes, affine maps, dependency analysis, iteration spaces |
| **Schedule** | `include/symplex/schedule/` | Schedule trees, tiling, operator fusion, GPU parallelization mapping |
| **Hardware** | `include/symplex/hardware/` | GPU topology, Tensor Core specs, memory hierarchy, roofline model |
| **Optimizer** | `include/symplex/optimizer/` | **E-graph superoptimizer** + 3-phase hardware search (roofline → symmetry → occupancy) |
| **Cost Model** | `include/symplex/costmodel/` | Roofline, analytical, empirical, and hybrid cost models |
| **Codegen** | `include/symplex/codegen/` | PTX emitter, WMMA/MMA instruction generation, swizzling, register allocation |
| **Distributed** | `include/symplex/distributed/` | Cluster mesh, SPMD sharding, NCCL bridge, pipeline overlap |
| **Fault Tolerance** | `include/symplex/fault_tolerance/` | Health monitoring, forward recovery, communicator repair, activation checkpointing |
| **Training** | `include/symplex/training/` | Training loop orchestrator, dynamic batch sizing, memory watchdog, JIT compiler pipeline |

## Quick Start

### Prerequisites

- C++20 compiler (GCC 12+, Clang 15+)
- CMake 3.20+
- Optional: CUDA Toolkit 12+ (for empirical profiling and GPU execution)
- Optional: ISL (Integer Set Library) for enhanced polyhedral analysis
- Optional: NCCL (for distributed training)

### Build

```bash
git clone https://github.com/hollowguy898-cloud/SympleX.git
cd SympleX
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
```

### Run Tests

```bash
./symplex_tests
```

### Run the Matmul Optimization Example

```bash
./example_matmul
```

This optimizes a 4096×4096×2048 matrix multiplication for the H100 GPU target, running the full 3-phase superoptimizer search and generating PTX code.

## Usage

### Optimize a Matrix Multiplication

```cpp
#include "symplex/training/compiler_pipeline.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::training;
using namespace symplex::hardware;

// Target: NVIDIA H100
HardwareTarget target = HardwareTarget::H100();

// Create compiler pipeline
CompilerPipeline pipeline(target);

// Optimize matmul C[M,N] += A[M,K] * B[K,N]
auto result = pipeline.compile_matmul(4096, 4096, 2048);

// result.ptx_source     — generated PTX kernel
// result.estimated_latency_ns  — predicted latency
// result.speedup_vs_naive     — speedup over naive tiling
// result.grid_dims / block_dims — GPU launch parameters
```

### Use the Superoptimizer Directly

```cpp
#include "symplex/optimizer/superoptimizer.h"
#include "symplex/polyhedral/iteration_space.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::optimizer;
using namespace symplex::polyhedral;
using namespace symplex::hardware;

HardwareTarget target = HardwareTarget::H100();
Superoptimizer opt(target);

auto ispace = make_matmul_iteration_space(1024, 1024, 512);
auto result = opt.optimize(ispace);

// result.best_tile         — optimal TileConfig
// result.estimated_latency_ns — predicted latency
// result.speedup_vs_naive  — speedup vs smallest Tensor Core tile
```

### Distributed Training with Fault Tolerance

```cpp
#include "symplex/training/training_loop.h"
#include "symplex/hardware/hardware_target.h"

using namespace symplex::training;
using namespace symplex::hardware;

TrainingConfig config;
config.global_batch_size = 2048;
config.enable_fault_tolerance = true;
config.enable_dynamic_batching = true;

HardwareTarget target = HardwareTarget::H100();
TrainingLoop loop(config, target);

auto ispace = make_matmul_iteration_space(4096, 4096, 2048);
loop.initialize(ispace);

auto results = loop.execute_epoch();
```

## Key Mathematical Concepts

### Iteration Space (I)
Every AI loop nest is modeled as an **integer polytope**:

```
I = { i ∈ Z^n | A·i + b ≥ 0 }
```

### Data Dependency Polyhedron (D)
Dependencies are vectors in the polyhedral space that must remain lexicographically positive:

```
d = i_sink - i_source,  d ≥ 0
```

### Schedule Map (Φ)
The central optimization maps iteration points to hardware coordinates and time:

```
Φ(i) → (DeviceID, SM_ID, Warp_ID, Thread_ID, TimeStep)
```

### 3-Phase Superoptimizer Search (Level 2)

1. **Roofline Pruning**: Drop 90% of tile configurations using analytical operational intensity bounds
2. **Compute-Symmetry Alignment**: Only evaluate tile sizes that are exact multiples of Tensor Core fragment dimensions (16×8×16 for H100)
3. **Hardware Occupancy Sieve**: Micro-benchmark the top candidates, selecting the configuration that maximizes SM occupancy

### Equality Saturation (Level 1 — the real superoptimizer)

Unlike traditional autotuners that only sweep hardware parameters, SympleX's superoptimizer explores the **space of equivalent programs** using equality saturation:

1. **E-Graph Construction**: The input program is represented as an e-graph — a data structure that compactly represents exponentially many equivalent expressions as equivalence classes
2. **Rewrite Rule Application**: Algebraic identities, fusion patterns, and tiling decompositions are applied iteratively, growing the e-graph to represent all discovered equivalent programs
3. **Polyhedral Guardrails**: Before any extracted program is accepted, it is validated against the original computation's data dependencies — rewrites that would violate semantics are rejected
4. **Cost-Guided Extraction**: The cheapest program is extracted from the saturated e-graph using a bottom-up dynamic programming approach, where fused operations (e.g., `FusedMatMulReLU`) have lower cost than their unfused equivalents (`ReLU(MatMul(A,B))`)

## Hardware Targets

Built-in profiles for:

| GPU | SMs | Tensor Core | HBM BW | SRAM/SM |
|-----|-----|-------------|--------|---------|
| **H100** (Hopper) | 132 | 16×8×16 FP16 | 3.35 TB/s | 228 KB |
| **B200** (Blackwell) | 160 | 16×8×32 FP16 | 8.0 TB/s | 304 KB |
| **Generic** | 84 | 16×8×16 FP16 | 2.0 TB/s | 164 KB |

Custom targets can be constructed via `HardwareTarget` fields.

## Project Structure

```
SympleX/
├── CMakeLists.txt
├── LICENSE                          # GNU AGPL v3
├── README.md
├── include/symplex/
│   ├── polyhedral/                  # Core polyhedral types
│   │   ├── integer_polytope.h
│   │   ├── affine_map.h
│   │   ├── dependency.h
│   │   ├── iteration_space.h
│   │   └── union_map.h
│   ├── schedule/                    # Schedule tree & transformations
│   │   ├── schedule_tree.h
│   │   ├── tiling.h
│   │   ├── fusion.h
│   │   ├── parallelization.h
│   │   └── schedule_map.h
│   ├── hardware/                    # GPU hardware models
│   │   └── hardware_target.h
│   ├── optimizer/                   # Superoptimizer search
│   │   ├── tile_config.h
│   │   ├── search_phase1.h
│   │   ├── search_phase2.h
│   │   ├── search_phase3.h
│   │   └── superoptimizer.h
│   ├── costmodel/                   # Performance cost models
│   │   ├── roofline.h
│   │   ├── analytical.h
│   │   ├── empirical.h
│   │   └── cost_model.h
│   ├── codegen/                     # PTX code generation
│   │   ├── wmma.h
│   │   ├── swizzle.h
│   │   ├── register_allocator.h
│   │   ├── ptx_emitter.h
│   │   └── code_generator.h
│   ├── distributed/                 # Distributed training
│   │   ├── mesh.h
│   │   ├── sharding.h
│   │   ├── nccl_bridge.h
│   │   └── pipeline_overlap.h
│   ├── fault_tolerance/             # Fault tolerance
│   │   ├── health_monitor.h
│   │   ├── forward_recovery.h
│   │   ├── communicator_repair.h
│   │   └── checkpoint.h
│   └── training/                    # Training orchestrator
│       ├── dynamic_batch.h
│       ├── memory_watchdog.h
│       ├── training_loop.h
│       └── compiler_pipeline.h
├── src/                             # Implementation files (mirrors include/)
├── tests/                           # Unit tests
├── benchmarks/                      # Performance benchmarks
└── examples/                        # Usage examples
```

## License

GNU Affero General Public License v3 — see [LICENSE](LICENSE).

