Metadata-Version: 2.2
Name: mujofil-warp
Version: 0.1.1
Summary: Photoreal Filament PBR rendering for GPU-resident MuJoCo (MJWarp), zero-copy to PyTorch
Keywords: mujoco,mjwarp,filament,pbr,rendering,reinforcement-learning,sim-to-real
Author: Tau Intelligence
License: Apache-2.0
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: POSIX :: Linux
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Multimedia :: Graphics :: 3D Rendering
Project-URL: Homepage, https://github.com/tau-intelligence/mujofil-warp
Project-URL: Repository, https://github.com/tau-intelligence/mujofil-warp
Project-URL: Issues, https://github.com/tau-intelligence/mujofil-warp/issues
Requires-Python: >=3.10
Requires-Dist: mujoco>=3.2
Requires-Dist: mujoco-warp
Requires-Dist: warp-lang
Requires-Dist: numpy
Provides-Extra: torch
Requires-Dist: torch>=2.4; extra == "torch"
Provides-Extra: examples
Requires-Dist: pillow; extra == "examples"
Description-Content-Type: text/markdown

# mujofil-warp

**Photoreal PBR rendering for GPU-resident MuJoCo (MJWarp), zero-copy to PyTorch.**

[MJWarp](https://github.com/google-deepmind/mujoco_warp) simulates thousands of
parallel MuJoCo worlds entirely on the GPU, but its built-in batch renderer is a
deliberately **low-fidelity single-hit raycaster** (flat Lambertian, no PBR / IBL
/ reflections, and it cannot load GLB environments).

`mujofil-warp` pairs MJWarp's GPU-resident physics with
[Google Filament](https://github.com/google/filament)'s **physically-based
renderer** (PBR materials, image-based lighting, soft shadows, SSAO) and delivers
each rendered frame **straight to PyTorch as a CUDA tensor — no CPU round-trip**.

## Highlights

- **Zero-copy to `torch.cuda`.** Filament renders into GPU memory that CUDA
  imports directly; observations arrive as `torch.cuda` tensors with no
  GPU→CPU→GPU bounce.
- **GPU-resident pipeline.** MJWarp steps physics on the GPU; only a tiny
  transform array crosses to the host. Pixels never leave the GPU.
- **Photoreal.** Full PBR metalness/roughness, IBL, soft shadows, SSAO, MSAA,
  filmic tone mapping — renders complete GLB environments MJWarp/MuJoCo can't.
- **Two backends.** An OpenGL single-sync path and a Vulkan shared-device path,
  selectable at runtime.

## Performance (RTX 4060 Laptop, 8 GiB)

All numbers are env-steps/s (= cameras/s), MJWarp GPU physics → `torch.cuda`.

**vs vanilla MuJoCo, same scene, same workload** (ours adds PBR + zero-copy):

| | 128px N=512 | 256px N=512 | 256px N=1024 |
|---|---|---|---|
| **mujofil-warp (GL)** | **10,675** | **9,949** | **10,628** |
| vanilla `mujoco.Renderer` | 8,394 | 4,808 | 5,021 |
| **speedup** | **1.27×** | **2.07×** | **2.12×** |

We beat vanilla MuJoCo by **1.25–2.12×** on equal work — the gap widens at higher
resolution because zero-copy avoids the CPU readback that scales with pixels.

**Full photoreal warehouse** (3 GLB meshes + IBL + 16 spotlights + SSAO — geometry
vanilla MuJoCo and MJWarp cannot even load): **~3,200 cam/s** at 128px, holding flat
from N=64 to N=2048.

**GL vs Vulkan backend** (full warehouse): the GL single-sync path is **1.3×**
faster and, critically, its sync cost is **constant** across N (one `flushAndWait`),
where the Vulkan path's grows linearly with batch size.

**vs MJWarp's own raycaster:** MJWarp scales to ~42,000 cam/s at N=2048 — but that
is **flat Lambertian on bare objects** (no PBR/IBL, no GLB environments). At small
N (≤32) `mujofil-warp` is faster *and* photoreal; at large N MJWarp wins raw
throughput by trading away all visual fidelity. Different categories: MJWarp is a
parallel raycaster, this is a photoreal rasterizer.

## Quickstart

```python
import mujoco, mujoco_warp as mjw, warp as wp, torch
from mujofil_warp import WarpRenderer

mjm = mujoco.MjModel.from_xml_path("scene.xml")
M = mjw.put_model(mjm)
d = mjw.make_data(mjm, nworld=32)
host = [mujoco.MjData(mjm) for _ in range(32)]

r = WarpRenderer(width=256, height=256, batch_size=32, preset="high")
r.load_model(mjm)

mjw.step(M, d); wp.synchronize()
gx = d.geom_xpos.numpy(); gm = d.geom_xmat.numpy().reshape(32, mjm.ngeom, 9)
for i, h in enumerate(host):
    h.geom_xpos[:] = gx[i]; h.geom_xmat[:] = gm[i]

obs = r.render_batch(mjm, host, cam_id=0)   # (32, 256, 256, 4) uint8 torch.cuda
```

See [examples/minimal_render.py](examples/minimal_render.py) for a runnable demo.

## Quality toggles

Every fidelity feature is an independent toggle so you can reproduce the
throughput/fidelity trade-offs in `benchmarks/` on your own hardware:

```python
from mujofil_warp import WarpRenderer, make_config

# keyword toggles
r = WarpRenderer(width=256, batch_size=32, ssao=False, shadows=True, msaa=True)

# or a named preset, optionally overriding individual toggles
r = WarpRenderer(width=256, batch_size=32, preset="fast")          # SSAO off, ~2x
r = WarpRenderer(width=256, batch_size=32, preset="high", bloom=True)

# or an explicit config
cfg = make_config(width=256, height=256, batch_size=32, exposure=1.6)
r = WarpRenderer(config=cfg)
```

| Toggle | Effect | Notes |
|---|---|---|
| `ssao` | screen-space ambient occlusion | **biggest cost — ~2× faster when off** |
| `ssao_quality` | SSAO quality `low`/`medium`/`high`/`ultra` | affects look more than speed |
| `ssao_ssct` | SSAO cone tracing (contact shadows) | small extra cost on top of SSAO |
| `shadows` | soft shadow maps | |
| `msaa` / `msaa_samples` | multi-sample AA | 2 / 4 / 8 |
| `bloom` | HDR bloom | off by default |
| `fxaa` | fast approximate AA | alternative to MSAA |
| `exposure` | linear exposure | before tone mapping |
| `tone_mapping` | FILMIC vs LINEAR | |
| `dithering` | temporal dithering | reduces banding |

**Presets:** `high` (photoreal, default), `medium` (high-quality SSAO, no cone
tracing), `fast` (SSAO off, ~2×), `ultra` (8× MSAA + bloom), `raw` (no AO/shadows/AA,
~3×).

## Backends

Select at runtime with `MUJOFIL_WARP_BACKEND`:

- **`gl`** (default) — OpenGL single-sync. Renders N worlds into N imported GL
  textures bracketed by one `flushAndWait`, then exports via GL↔CUDA interop. Sync
  cost is constant in N; **fastest in the warehouse.** Requires an X display
  (`DISPLAY`); when none is available it automatically falls back to Vulkan.
- **`vulkan`** — shared Vulkan device + exportable swapchain + CUDA external-memory
  import. Works fully headless (no X), but the 2-frame in-flight cap makes its sync
  cost grow with batch size.

```bash
# default is gl; force a backend explicitly with the env var:
MUJOFIL_WARP_BACKEND=gl     python examples/minimal_render.py --preset high
MUJOFIL_WARP_BACKEND=vulkan python examples/minimal_render.py --preset high
```

## Installation

```bash
pip install mujofil-warp
```

The wheel is **self-contained**: Filament and the CUDA runtime are statically
baked in, the compiled materials ship inside it, and `libc++` is bundled. There
is **no CUDA toolkit, no Filament, and no `mujofil` to install** — the only hard
requirement at runtime is an **NVIDIA GPU + driver**.

### Supported environments

Because the package contains **no CUDA device code** (only host-side runtime
calls), a single wheel is portable across GPUs and driver versions:

| Dimension | Support |
|---|---|
| GPU | Any NVIDIA GPU (Turing / Ampere / Ada / Hopper / …) — no compute-capability lock-in |
| Driver / CUDA | NVIDIA driver **≥ R525** (CUDA 12.0+). One wheel, all newer drivers |
| OS | Linux **x86_64**, glibc ≥ 2.34 (Ubuntu 22.04+, Debian 12+, RHEL/Alma/Rocky 9+, Fedora 35+) |
| Python | CPython 3.10 – 3.13 |

Not yet supported: aarch64 (Jetson/Grace), glibc < 2.34 (Ubuntu 20.04 / RHEL 8),
non-NVIDIA GPUs. These need a from-source Filament build (planned).

### Headless / display

Both backends are **fully headless** — no X server, no display, nothing extra to
install beyond the NVIDIA driver:

- **GL** (default) uses **surfaceless EGL**, so it renders headless at full speed
  on a bare GPU server (cloud, cluster, container). This is the recommended path
  for vision-RL training.
- **Vulkan** is also headless (shared device + exportable swapchain).

GL auto-falls back to Vulkan only if the GL module fails to initialize.

### Building from source

Most users never need this — `pip install mujofil-warp` ships prebuilt wheels.
Build from source only to hack on the C++ or target an unsupported environment.

**Prerequisites** (the native modules and Filament are built with Clang + libc++):

| Tool | Debian/Ubuntu | RHEL/Fedora/Alma |
|---|---|---|
| Clang + libc++ dev | `clang libc++-dev libc++abi-dev` | `clang` + libc++ (LLVM release) |
| CUDA toolkit (headers + static cudart) | `nvidia-cuda-toolkit` | `cuda-cudart-devel-12-x cuda-driver-devel-12-x` |
| EGL / GL dev headers | `libegl1-mesa-dev libgl1-mesa-dev` | `mesa-libEGL-devel mesa-libGL-devel` |
| Build tools (source-built Filament only) | `git cmake ninja-build` | `git cmake ninja-build` |

Then:

```bash
git clone https://github.com/tau-intelligence/mujofil-warp
cd mujofil-warp
CC=clang CXX=clang++ pip install .
```

**How Filament is resolved** (the GL backend's headless EGL rendering needs a
**custom EGL-enabled Filament** — Google's prebuilt Linux Filament is GLX-only).
`CMakeLists.txt` tries, in order:

1. **`FILAMENT_DIR=/path/to/egl-filament`** if you set it — used as-is (fastest).
2. **Download** a prebuilt EGL Filament artifact (seconds). The default path.
3. **Build from source** via `packaging/build_filament_egl.sh` (~20–30 min) if
   the download is unavailable — this is the step that needs git/cmake/ninja.

So a plain `pip install .` is **one command**; supply `FILAMENT_DIR` to skip the
download/build entirely:

```bash
CC=clang CXX=clang++ FILAMENT_DIR=/path/to/egl-filament pip install .
```

The EGL Filament artifact is reproducible from source:

```bash
packaging/build_filament_egl.sh ./_filament_egl   # clone + patch + build
```

### Dev rebuilds (no full reinstall)

For iterating on the C++ without a full `pip install`, the two helper scripts
build the modules in place (point `FILAMENT_DIR` at the EGL Filament build):

```bash
bash native/build_gl.sh   # OpenGL single-sync, headless EGL -> _mujofil_warp_gl
bash native/build.sh      # Vulkan zero-copy                  -> _mujofil_warp
```

## Architecture & porting

`mujofil-warp` is **one core with pluggable rendering backends**, so new platforms
are added as a backend — not a fork.

```
mujofil_warp/__init__.py     Python API, presets, backend selection   (shared)
native/render_module.cpp     pybind bindings, batching                (shared)
native/vendor/core/          scene / material / light bridge          (shared)
native/renderer_gl.cpp       Linux: surfaceless EGL  + CUDA interop   (backend)
native/renderer_warp.cpp     Linux: Vulkan device    + CUDA interop   (backend)
```

Everything platform-specific lives behind the `vf_mujoco::Renderer` interface
(context creation, GPU→tensor interop). Adding **macOS** or **Windows** means
adding one `renderer_*.{cpp,mm}` implementing that interface — the scene,
material, lighting, Python API, and batching layers are reused unchanged.

- **Windows** would use a WGL/EGL context + `OPAQUE_WIN32` external-memory handles
  for the CUDA interop.
- **macOS** is a different target: there is **no CUDA on Apple platforms**, so a
  Mac backend would use Filament's **Metal** backend and export to PyTorch via
  **MPS** (`MTLBuffer` → torch-MPS) rather than `torch.cuda`.

These are not yet implemented (they need the respective hardware to develop and
validate on), but the codebase is structured so they slot in without a fork.

## Layout

```
mujofil_warp/        Python package (WarpRenderer, make_config, presets)
native/              C++ renderer + pybind module + build scripts
  renderer_gl.cpp      OpenGL single-sync zero-copy backend
  renderer_warp.cpp    Vulkan shared-device zero-copy backend
  render_module.cpp    pybind bindings (shared by both backends)
examples/            runnable demos
benchmarks/          the benchmark suite behind the numbers above
spikes/              isolated feasibility proofs (GL↔CUDA, Vulkan↔CUDA, DLPack)
docs/ARCHITECTURE.md design + phased integration plan
```

## Relationship to `mujofil`

`mujofil-warp` reuses the CPU-MuJoCo `mujofil` renderer's scene/material/light
source but is a **separate build** — the published `mujofil` package is untouched.
Use `mujofil` for high-fidelity CPU-MuJoCo vector-env rendering; use
`mujofil-warp` when you want MJWarp's GPU-resident physics with photoreal,
zero-copy observations.

## License

Apache-2.0.
