# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

> The *Unreleased* section is for changes that are not yet released, but are going to be released in the next version.

## [0.6.0] - 2026-06-04

### Added

- **FP32 CUDA execution path.** The CUDA biquad / SOS parallel-scan kernels are
  now templated on `scalar_t` and dispatch on the input dtype: a float32 input
  runs the native FP32 path instead of being silently upcast to float64. On a
  consumer GPU (RTX 3070, 1:32 FP32:FP64) this is a **3.0–3.6× throughput win**
  for IIR cascades and resolves the multichannel inversion — a 60 s / 8-channel
  8th-order Butterworth runs in ~24 ms (FP32) vs ~89 ms (FP64), now *beating* the
  ~28 ms CPU path it previously lost to. float64 inputs still run FP64 for
  maximum precision; pass the dtype you want. See
  `benchmarks/bench_fp32_speedup.py`.
- **`tests/test_fp32_precision.py`** — validates the FP64 path against
  `scipy.signal.sosfilt` to double precision and the FP32 path against the same
  reference within a documented float32 bound (`max_abs` + RMS-relative), across
  Butterworth / Chebyshev1 at orders 2/4/8/16 on CPU and CUDA, recording
  per-design errors so the FP32-safe vs needs-FP64 boundary is tracked.
- **Explicit CUDA architectures.** The build now compiles native SASS for
  `sm_75;80;86;89` (Turing…Ada) by default — override with the
  `TORCHFX_CUDA_ARCHITECTURES` env var. Eliminates the first-CUDA-call PTX→SASS
  JIT stall and pins occupancy instead of leaving it to the toolkit default.
- **Tunable, dtype-aware dispatch threshold.** `PARALLEL_SCAN_THRESHOLD` (the
  sequential-vs-parallel-scan boundary) is now threaded into the kernels — it was
  previously dead in Python while the kernel hard-coded 2048. The default is
  dtype-aware (`float32` → 2048, `float64` → 1024): a crossover sweep
  (`benchmarks/bench_threshold_sweep.py`) showed the FP64 sequential kernel hits
  the parallel scan's overhead sooner, so a single 2048 left FP64 ~57% slower at
  T≈2048. `parallel_iir_forward` / `biquad_forward` accept a `threshold=` override
  (`0` forces parallel scan, a large value forces sequential).
- **Static-gain folding in the fusion planner.** A constant linear `Gain`
  (`clamp=False`) between SOS filters is now folded into the fused cascade's
  numerator instead of breaking the fused run, so `wave | IIR | Gain | IIR`
  materialises as a single `FusedSOSCascade` rather than three stages. The fold is
  exact (a scalar commutes through a linear filter); a clamping `Gain` or a dynamic
  `Normalize` stays its own stage. The planner is exposed as the testable
  `Wave._build_plan`.
- **`torchfx.realtime.CudaGraphRunner`** — captures a fixed-shape fused-cascade GPU
  forward into a `torch.cuda.CUDAGraph` and replays it per chunk, collapsing the
  `~4·K` per-section kernel launches into one graph launch. The win is largest in
  the short-chunk / realtime regime where launch overhead dominates: **4.0× faster
  at 128-sample chunks** (52 vs 210 µs), 2.6× at 512, 1.8× at 1024 on an RTX 3070.
  Streaming DF1 state is carried across replays; `reset_state()` restarts a stream.

- **Realtime architecture: producer/consumer split.** `RealtimeProcessor`
  now runs DSP in a dedicated worker thread (`torchfx-realtime-dsp`),
  not inside the PortAudio callback. The audio callback only moves
  samples between the backend buffers and a pair of `TensorRingBuffer`s
  (input + output ring); the worker thread reads input chunks, runs the
  effect chain, and writes results into the output ring. The output
  ring is primed with one chunk of silence at start so the first
  callback never underflows.
- **Realtime instrumentation.** New properties / methods on
  `RealtimeProcessor`:
  - `latency_log_ns()` — snapshot of per-callback wall-clock durations.
  - `latency_stats_ms()` — `count`/`min`/`median`/`mean`/`p95`/`p99`/`max`.
  - `xrun_count`, `input_overflow_count`, `output_underflow_count`,
    `backend_xrun_count`, `callback_count` — granular xrun attribution.
  - `deadline_ms`, `reset_metrics()`.
- **`BackendStatus` dataclass** (`torchfx.realtime.backend`) — carries
  PortAudio's `paInputOverflow` / `paOutputUnderflow` etc. flags from
  the backend into `RealtimeProcessor` so xruns are counted at their
  source. `AudioCallback` accepts an optional 4th `status` argument;
  `SoundDeviceBackend` populates it from sounddevice's `CallbackFlags`.
- **`RealtimeProcessor.process_pending()`** — public method that drains
  the input ring synchronously. Used by `start_worker=False` callers
  (deterministic tests, host-scheduled embeddings).
- **`start_worker=True/False` constructor flag** on `RealtimeProcessor`
  to opt out of the worker thread.
- **`benchmarks/test_realtime_bench.py`** — deterministic realtime
  benchmarks driving `RealtimeProcessor` via a `DeterministicMockBackend`.
  Reports per-callback round-trip wall time and p50/p95/p99 latency vs
  buffer size and cascade depth, plus a `budget_p99 = p99 / deadline`
  ratio in `extra_info` for downstream plotting.
- **Realtime chunk-length validation.** The DSP worker raises
  `RealtimeError` if any effect changes the time dimension of a chunk,
  mirroring `StreamProcessor`'s check. Prevents `Delay`-style effects
  from corrupting the output ring.
- Tests for the new architecture (`tests/test_realtime.py`): latency
  log, xrun counters under synthetic backend status, chunk-length
  rejection in the realtime callback, worker-thread drainage.
- **IS² 2026 benchmark infrastructure** (under `IS22026/`,
  `benchmarks/`, and `tools/`):
  - `benchmarks/test_torchaudio_bench.py` — head-to-head comparators
    against `torchaudio.functional.biquad`, looped `biquad` cascades,
    and `torchaudio.functional.fftconvolve`. On Linux x86_64 / Python
    3.10 / torch 2.10.0+cu128 (local CPU run), TorchFX is **2.82×
    faster** than torchaudio on single biquad and **5.7×–8.9×** faster
    on order-4 / 8 / 16 IIR cascades thanks to fusion; FIR FFT is
    **2.06×** faster. The benchmarks skip cleanly if torchaudio is
    not installed.
  - `tools/count_kernel_launches.py` — counts native-extension
    dispatches (and CUDA `launchKernel` events when CUDA is available)
    for fused vs unfused IIR cascades at depths {2, 5, 10, 20, 50}.
    Validates claim C1 of the IS² 2026 plan: depth-K unfused chain
    issues K Python-level dispatches; the fused path collapses these
    to 1, with a measured 2.6×–2.8× wall-time speedup at depth 20–50.
  - `tools/aggregate_benchmarks.py` — flattens one or more
    pytest-benchmark JSON outputs into per-row p50 / p95 / p99 /
    p99.9 / max / IQR; emits `table`, `csv`, or `json`. Propagates
    each benchmark's `extra_info` so realtime jitter and budget
    fractions survive aggregation. Non-pytest-benchmark JSON inputs
    are skipped with a warning.
  - `IS22026/Makefile` — single-command runner. `make all` writes
    JSON per benchmark family into `IS22026/results/` and aggregates
    into `summary.csv` + `summary.txt`. Targets: `bench-cpu`,
    `bench-cuda`, `bench-comparators`, `bench-realtime`, `launches`,
    `aggregate`. Host-tagged outputs (`*-<hostname>.json`) so multi-
    machine results coexist.
- **`torchaudio>=2.6.0,<2.11`** added to the `dev` dependency group as
  an optional comparator. Not a runtime dependency. The upper bound is
  because torchaudio 2.11 requires CUDA 13 runtime libraries which
  break on systems whose torch is built against CUDA 12.x.
- `tests/test_wave.py::test_filter_coefficients_recompute_on_fs_change`
  — regression test locking in the existing correct behaviour that a
  filter object piped through two Waves with different sample rates
  recomputes its SOS coefficients (the `_has_computed_coeff` lazy
  guard does not skip when `fs` changes).
- **Energy + power measurement harness** (`tools/energy_meter.py`) —
  context manager and CLI wrapper that records Intel RAPL CPU package
  energy (via `/sys/class/powercap/intel-rapl:*/energy_uj`) and
  GPU power-over-time (via streaming `nvidia-smi -lms`, integrated by
  trapezoidal rule). Wraps any subprocess and emits a JSON with
  `cpu_joules`, `gpu_joules`, `duration_s`, `gpu_mean_w`, `gpu_peak_w`,
  and notes about any RAPL wraps. Degrades gracefully on systems
  without RAPL access or `nvidia-smi`. Required for table T10 of the
  IS² 2026 paper.
- **`PARALLEL_SCAN_THRESHOLD` ablation tool** (`tools/threshold_sweep.py`)
  — sweeps signal length `T` densely around the current 2048-sample
  threshold (256 → 32 768 by default) at several channel counts, times
  each point with CUDA sync, and emits per-(T, channels) median /
  p95 / min. Reads the empirical crossover off the curve so the paper
  can either justify 2048 or replace it with a measured value, per
  C-7 of the IS² 2026 plan.
- **Paper figure pipeline** (`IS22026/figs/plot.py` + matplotlib in
  the dev dep group): single CLI with F4 (kernel-launch reduction),
  F6 (TorchFX vs torchaudio pipeline), F7 (threshold crossover), F9
  (realtime latency CCDF), F11 (Pi 5 RTF) subcommands. Uses the Wong
  colour-blind-safe palette and emits PDF + companion PNG. Consumes
  the JSON written by the bench / tool targets directly.
- **Makefile extensions** in `IS22026/Makefile`: new targets
  `launches-cuda`, `threshold-cuda`, `figs` (PHONY: regenerates all
  available figures from `results/`), `paper-data` (end-to-end:
  benches → aggregate → figs), and `ENERGY=1` flag that wraps any
  target in the energy meter.
- **SLURM driver for the L40S** (`IS22026/slurm/is2_l40s.sbatch`):
  one-submission job that runs `bench-cuda`, `bench-comparators`,
  `launches-cuda`, `threshold-cuda`, and `aggregate` on the cluster.
- **`IS22026/RUN.md`** — copy-paste handoff for running the
  Alienware (i9 + RTX 3070) and L40S batches from a less-capable
  development box, plus rsync recipes for pulling JSON results back.

### Changed

- **CUDA SOS cascade is allocation-free per forward.** `sos_forward_cuda` now
  updates DF1 state in place and reuses one set of scratch buffers (forcing,
  ping-pong output, block-aggregate) across all cascade sections instead of
  allocating per section. Cuts the GPU streaming SOS path ~27% (e.g. a 4-section
  cascade at 1024-sample chunks: ~276 → ~202 µs/chunk on an RTX 3070).
- **`RealtimeProcessor` no longer processes effects in the audio
  callback.** This is a behavioural change: with the worker-thread
  architecture, processed output appears on the *next* callback after
  its input was pushed. The end-to-end latency therefore grows by one
  buffer (now reported correctly by `latency_ms`, which includes the
  priming chunk).
- **`AudioCallback` type alias** broadened to
  `Callable[..., None]`. New backends pass `(input, output, frames,
  status)`; legacy 3-arg callbacks are still supported by
  `SoundDeviceBackend` via a one-time signature inspection.

### Fixed

- **CUDA kernels now launch on the current stream.** The hand-written kernels
  launched with `<<<grid, block>>>` (the default stream); they now pass
  `c10::cuda::getCurrentCUDAStream()`. This is what lets `torch.cuda.graph` capture
  record them (capture runs on a side stream and ignores default-stream work — so a
  captured cascade previously replayed as a no-op), and it is also more correct in
  eager mode (work runs on PyTorch's stream rather than relying on implicit
  default-stream synchronisation).
- The previously unused `TensorRingBuffer` instances inside
  `RealtimeProcessor` are now actively the I/O path. The "lock-free
  SPSC" claim in the docstring now reflects how the buffers are used —
  one writer (callback) and one reader (worker), coordinated by an
  `Event`.
- **Stale-coefficient bug on sample-rate change.** `IIR` and `Biquad`
  filters now track the `fs` their coefficients were designed for
  (`_coeff_fs`) and recompute on the *direct* `forward()` path when
  `fs` changes — previously a filter reused across sample rates would
  silently keep applying coefficients designed for the old rate. The
  fs change also clears accumulated DF1 state, and `Wave.__or__`'s
  eager-compute path records `_coeff_fs` so it does not redundantly
  recompute on the first forward.
- **State leak across `Wave` reuse.** `Wave._materialize()` now resets
  the state of stateful modules before running the offline pipeline, so
  piping one filter instance into two different `Wave`s no longer bleeds
  DF1 state from the first into the second. Streaming
  (`StreamProcessor` / `RealtimeProcessor`) is unaffected — it drives
  effects directly and manages its own chunk-to-chunk state.
- **Half-precision inputs are now rejected** by the native filter
  dispatch (`_select_native_dtype`) with a `TypeError` instead of being
  silently upcast to `float64`. The IIR feedback recurrence is not
  numerically safe in `float16` / `bfloat16`; callers should cast to
  `float32` or `float64` explicitly.

## [0.5.4] - 2026-05-26

### Removed

- **`scipy` runtime dependency.** The `scipy.signal.{butter, cheby1, cheby2,
  ellip, firwin}` calls used by Butterworth, Chebyshev I/II, Elliptic,
  Linkwitz-Riley, and `DesignableFIR` are gone — coefficient design is now
  performed natively in pure PyTorch (see Added). `scipy` remains a *dev/test*
  dependency (in `[dependency-groups.dev]`) for numerical-equivalence
  verification only; downstream installs no longer pull it in.
- `_set_coefficients` helper on `Biquad`: replaced by `_finalize_coeffs`, which
  takes raw RBJ-cookbook ``(b0, b1, b2, a0, a1, a2)`` and normalizes by `a0`
  centrally instead of having every subclass repeat the divide.
- `_gain_db` and `Delay._extend_waveform` private helpers: each was called from
  a single site, now inlined.

### Added

- **`torchfx.filter._design`**: pure-PyTorch filter-design module containing
  native equivalents of every scipy.signal design function the package used:
  `design_butterworth_sos`, `design_cheby1_sos`, `design_cheby2_sos`,
  `design_ellip_sos`, and `design_firwin` (multi-band capable). All return
  canonical CPU `float64` SOS / FIR tensors compatible with the existing
  C++/CUDA kernels. Implementation follows the standard analog-prototype +
  bilinear-transform pipeline; the elliptic design is a faithful port of
  scipy's algorithm (Orfanidis, "Lecture Notes on Elliptic Filter Design").
- **`tests/test_native_design.py`**: 509 numerical-equivalence tests covering
  the full parameter sweep (orders 1-16, lowpass/highpass, the major windows,
  Rp/Rs combinations). Tolerances are `1e-12` (Butterworth, FIR), `1e-9`
  (Chebyshev), `1e-7` (Elliptic up to N=8), `1e-5` (Elliptic N=12-16).
- **`benchmarks/test_design_benchmarks.py`**: paired side-by-side benchmarks
  comparing native design time to scipy across all five filter types and
  orders 2-16. Run with
  `uv run pytest benchmarks/test_design_benchmarks.py --benchmark-enable`.
  Native design is **14-50× faster than scipy** for Butterworth, Chebyshev I/II
  (e.g. N=16: 26-37 µs native vs 470-490 µs scipy) and **7-20× faster** for
  Elliptic (N=16: 92 µs native vs 670 µs scipy). The IIR design pipeline runs
  on Python `complex` instead of `torch` tensors because per-op tensor dispatch
  dominates over arithmetic for small (N≤16) problems; `firwin` stays in torch
  because windows are length-N (~31..1024) where vectorization wins. Numerical
  equivalence to scipy is preserved (rtol 1e-10/1e-9/1e-7 across IIR types).
- Regression coverage for runtime edge cases across
  `tests/test_wave.py`, `tests/test_effects.py`, `tests/test_fir.py`,
  `tests/test_realtime.py`, and `tests/test_cli.py` (Wave shape/channel
  invariants, Delay sample-rate recomputation, Reverb batched/dtype handling,
  deferred `DesignableFIR` initialization, stream chunk-length constraints,
  and CLI invalid-parameter error paths).

### Changed

- Six biquad subclasses (`BiquadLPF`, `BiquadHPF`, `BiquadNotch`, `BiquadBPF`,
  `BiquadBPFPeak`, `BiquadAllPass`) now express coefficients in raw RBJ form
  and call the shared `_finalize_coeffs` for normalization, removing ~50 lines
  of duplicated `a0_inv` plumbing. Same applies to `Shelving`, `ParametricEQ`,
  `Notch`, and `AllPass` in `iir.py`.
- `LinkwitzRiley.compute_coefficients`: `np.vstack` replaced with `torch.cat`,
  removing the only `numpy.*` math call from `src/torchfx/`. After this work,
  numpy is used only at the I/O boundary (`soundfile`, `sounddevice`), never
  for signal-processing math in core.
- `FIR.__init__`: switched coefficient construction to
  `torch.as_tensor(...).clone().flip(0)` to silence the
  `torch.tensor(sourceTensor)` warning when the new native designs hand a
  tensor straight in.
- `Wave` now enforces internal `(channels, samples)` shape invariants,
  normalizes 1D mono inputs to `(1, T)`, preserves metadata/device in
  transforms and channel extraction, and keeps merge behavior explicit:
  split-channel merges zero-pad to longest length while mix mode requires
  matching channel counts.
- `_ops.delay_line_forward` now validates rank/dtype, supports tensors with
  arbitrary leading batch dimensions by flattening/restoring `(..., T)`, and
  safely round-trips non-native float dtypes via the extension's supported
  float32/float64 execution path.
- Native IIR/biquad dispatch now uses dtype-aware execution in `_ops` and the
  SOS forward path: CPU routes non-float64 inputs through float32, while CUDA
  remains on float64 for the current kernel generation. CPU native kernels in
  `_csrc/cpu/iir_cpu.cpp` now run in float32 or float64 based on input dtype
  instead of unconditionally upcasting to float64.
- `DesignableFIR` now initializes `nn.Module` state even when `fs=None`, then
  updates kernel coefficients in `compute_coefficients()` once sample rate is
  available.
- `Delay` and `Gain` input validation now consistently raises `ValueError` for
  invalid user parameters. BPM-synced `Delay` also recomputes
  `delay_samples` when `fs` changes after initialization.
- Realtime processing now resets stateful effects at file/chunk boundaries and
  rejects effects that change chunk length in `StreamProcessor`.
- CLI processing now surfaces config/effect parse failures as user-facing
  messages (including invalid constructor parameters) instead of raw tracebacks.

### Fixed

- `Wave.get_channel()` no longer returns flattened 1D tensors that break
  downstream assumptions about channel count and `len()` semantics.
- Delayed/BPM-synced effects now stay sample-accurate when reused across
  different sample rates.
- Stream processing no longer leaks state from one file into the next, and
  now fails fast when an effect violates fixed-size chunk assumptions.

## [0.5.3] - 2026-05-03

### Changed

- **Build system migration**: replaced hatchling + runtime JIT compilation with
  scikit-build-core + CMake. The C++/CUDA extension is now compiled at install
  time (no more 10-30s first-import delay, no compiler requirement for end users
  installing from wheels).
- **Removed pure-PyTorch fallback**: the C++ extension is now required. The slow
  Python fallback paths for IIR filtering and delay have been removed.

### Added

- **CPU delay C++ kernel** (`delay_cpu.cpp`): `delay_line_forward` now dispatches
  to a C++ kernel on CPU (previously CUDA-only, with Python fallback for CPU).
- **CI/CD wheel pipeline** (`.github/workflows/wheels.yml`): automated CPU wheel
  builds via `cibuildwheel` for Python 3.10-3.14 across manylinux x86_64,
  macOS x86_64 + arm64, and Windows x86_64, with PyPI publishing on tag.
  PyTorch's shared libraries are excluded from the repaired wheels so the
  install stays small and links against the user's PyTorch build.
- **Doctest in CI**: `pytest --doctest-modules` now runs in the test job and
  `make doctest` runs in the docs job, exercising docstring and `.. doctest::`
  examples across the codebase.
- **CUDA wheels for Linux x86_64** (`.github/workflows/wheels-cuda.yml`):
  cu124 and cu128 wheels for Python 3.10-3.14, published as a PEP 503 index
  on GitHub Pages at `https://matteospanio.github.io/torchfx/wheels/cuXXX/`
  on every tagged release. Wheels carry a `+cuXXX` local-version segment
  and exclude PyTorch / CUDA shared libraries from the bundle.
- **Auto-release on version bump** (`.github/workflows/release.yml`):
  pushes to `master` are scanned for a change in `pyproject.toml`'s
  `version`. When a new version lands, the workflow tags the commit
  `vX.Y.Z` and dispatches `wheels.yml` + `wheels-cuda.yml` against the
  tag, so a merged version-bump PR automatically yields a PyPI release
  and refreshed CUDA wheel index without any manual `git tag`.

### Removed

- `setuptools` runtime dependency (was only needed for JIT compilation).
- `_biquad_df1_fallback` Python DF1 loop (replaced by C++ `iir_cpu.cpp`).
- JIT compilation logic in `_ops.py` (`_load_extension`, `torch.utils.cpp_extension.load`).

## [0.5.2] - 2026-04-13

### Added

- **`FilterChain`** (`chain.py`): auto-flattening `nn.Sequential` subclass so
  that `(f1 | f2) | f3` produces a flat `FilterChain(f1, f2, f3)` rather than
  nested containers. Exported from the top-level `torchfx` package.
- **`FX.__or__`**: the pipe operator now works directly between filters/effects
  (`f1 | f2` → `FilterChain`), fixing the long-standing Phase 2 doc/API bug
  where only `Wave.__or__` was defined.
- **Performance baseline** (`docs/source/perf/baseline.md`): Phase 0 of the
  optimization campaign — captures pre-optimization coverage, CPU benchmark
  numbers, and CPU `torch.profiler` findings so every follow-up PR has a
  concrete before/after to diff against.
- **SLURM harness for CUDA runs** (`benchmarks/slurm/`): `run_profiles.sbatch`
  + README covering the `sbatch → rsync → pytest-benchmark compare` cycle for
  CUDA benchmarks and profiles measured on the cluster.
- **Profile scenarios** (`benchmarks/profiles/scenarios.py`) plus `run_cpu_profile.py`
  and `run_cuda_profile.py` so the same scenarios run on local CPU and on
  cluster GPUs.
- **Coverage gate**: `[tool.coverage.report] fail_under = 87` plus an HTML
  coverage CI job on Python 3.12. Local CPU coverage (with the sounddevice
  backend omitted) is now **88%** — up from the Phase 0 baseline of 74%.
- New test files closing Phase 0 coverage gaps:
  `tests/test_fused.py`, `tests/test_filter_base.py`,
  `tests/test_filter_utils.py`, `tests/test_ops_dispatch.py`,
  `tests/test_filterbank.py`, `tests/test_iir_gaps.py`. Per-module jumps:
  `filter/fused.py` 9% → 75%, `filter/utils.py` 0% → 91%,
  `filter/__base.py` 44% → 89%, `filter/filterbank.py` 73% → 100%,
  `filter/iir.py` 71% → 86%, `_ops.py` 64% → 84%.

### Changed

- **Deferred pipeline with auto-fusion** (`wave.py`): `Wave.__or__` now
  accumulates filters in a deferred pipeline instead of executing immediately.
  Materialization happens lazily on `.ys` access, at which point consecutive IIR
  and biquad filters are automatically fused into a single `FusedSOSCascade` —
  eliminating per-filter Python dispatch and kernel launch overhead (~2.5× faster
  for IIR chains). All three chaining syntaxes benefit: `wave | f1 | f2 | f3`,
  `wave | (f1 | f2 | f3)`, and `wave | nn.Sequential(f1, f2, f3)`.
- **Unified Biquad/IIR forward path** (`biquad.py`): `Biquad` now stores
  coefficients as a `[1, 6]` SOS tensor and delegates its `forward()` to the
  shared `_sos_cascade_forward` helper, eliminating ~150 lines of duplicated
  forward logic (shape normalization, state management, native dispatch, fallback).
  Biquad filters now auto-fuse with IIR filters in the deferred pipeline.
  Backward-compatible `b`/`a` properties provide read-only access to coefficients.
- **SOS coefficient caching** (`iir.py`, `fused.py`): device-matched SOS tensor
  is now cached between forward calls, eliminating per-call `.to()` transfers.
  The canonical CPU copy is passed directly to native kernels, avoiding a
  per-call CUDA→CPU synchronisation that was 21% of Self CPU in batch profiles.
- **Biquad coefficient caching** (`biquad.py`, `_ops.py`): feedback coefficients
  `a1`, `a2` are pre-extracted as Python floats at coefficient-computation time
  and passed through to native dispatch, eliminating per-forward `float()` calls
  that trigger GPU→CPU sync on CUDA.
- **In-place state updates** (`iir.py`, `fused.py`, `biquad.py`): replaced
  per-section `torch.stack([...])` allocations in fallback DF1 loops with
  `copy_()` into pre-existing state buffers.
- **Reverb op fusion** (`effect.py`): algebraically simplified the PyTorch
  fallback from 5 tensor ops (`pad → slice → mul → add → lerp`) to 2
  (`clone → add_` with alpha), eliminating one intermediate allocation.
- **Delay wet/dry mix** (`effect.py`): replaced `(1-mix)*x + mix*y` (3 ops)
  with `torch.lerp(x, y, mix)` (1 fused op).
- **Consolidated SOS cascade logic** (`iir.py`, `fused.py`): extracted the
  shared SOS forward path (device caching, state init, native dispatch, DF1
  fallback) into a single `_sos_cascade_forward` helper. `FusedSOSCascade` now
  delegates to the same code as `IIR`, eliminating ~80 lines of duplication.
- **Wave pipeline nested module support** (`wave.py`): `Wave.__or__` now walks
  all submodules via `nn.Module.modules()` instead of only one level of
  `nn.Sequential`, so arbitrarily nested module trees get their `fs` configured.

### Fixed

- `AbstractFilter._has_computed_coeff` returned `True` for IIR subclasses whose
  `_sos is None` and which have no `b`/`a` attributes — silently claiming
  coefficients were ready before `compute_coefficients()` had run. The fallthrough
  now correctly returns `False`.
- `ParallelFilterCombination.__init__` assigned `self.fs = fs` before
  `self.filters = filters`, crashing with `AttributeError` when `fs` was passed
  at construction. Swapped assignment order so the `fs` setter can safely
  iterate child filters.
- `iir_cpu.cpp` `sos_forward_cpu` used a hardcoded `double sec_sx0[16]` stack
  array for per-section state, silently overflowing for filters with >16 SOS
  sections (order >32). Now uses stack arrays for K≤16 and falls back to
  heap-allocated `std::vector<double>` for higher orders.
- Benchmark scripts (`test_pipeline_bench.py`, `test_biquad_bench.py`,
  `scenarios.py`) called the removed `move_coeff()` method on IIR filters.
  Replaced with standard `nn.Module.to(device)`.

## [0.5.1] - 2026-04-02

### Fixed

- Fixed `LinkwitzRiley` filter where the `order` parameter was overwritten in the constructor, causing incorrect coefficient computation
- Fixed `BadCoefficients` scipy warning on high-order IIR filters with low normalized cutoffs (e.g., 24th-order highpass at 50 Hz / 32 kHz) by computing SOS directly from scipy design functions instead of going through the numerically unstable `ba` intermediate

### Changed

- IIR filters now use SOS (second-order sections) as the sole coefficient representation, computed directly via `scipy.signal.butter(..., output='sos')` and friends — the `ba` transfer function form is no longer used
- `IIR.forward()` always uses the SOS cascade path; the old stateless `torchaudio.lfilter` path has been removed (SOS cascade is both more numerically stable and 1.5–4x faster for typical signal lengths)
- `LinkwitzRiley` now cascades filters by stacking SOS sections (`np.vstack`) instead of convolving `ba` polynomials
- Removed dead code from `IIR` base class: `_compute_ba_from_sos()`, `move_coeff()`, `_bootstrap_state()`, `_stateful` flag, and `a`/`b` attributes
- Removed `a`/`b` constructor parameters from `Butterworth` and `Chebyshev1`

## [0.5.0] - 2026-03-27

### Added

- CUDA kernel implementations for biquad, SOS cascade, and delay line filters using parallel prefix scan algorithm
- JIT-compiled C++/CUDA native extension (`torchfx._ops`) with automatic fallback to pure-PyTorch
- CPU-only C++ extension support: native kernels now compile and load without CUDA toolkit, providing ~2400x speedup for stateful IIR filtering on CPU
- `LogFilterBank` for logarithmically-spaced frequency band decomposition
- CUDA kernel tests and fallback behavior tests
- FFT-based 1D convolution (`fft_conv1d`) adapted from [Julius](https://github.com/adefossez/julius) (MIT License) for fast FIR filtering using the overlap-save method
- `conv_mode` parameter on `FIR` and `DesignableFIR` filters: `"fft"` (default), `"direct"`, or `"auto"`
- Benchmark suite for FFT vs direct convolution across kernel sizes (64–1024) and signal durations

### Changed

- Benchmarks migrated from standalone scripts (`benchmark/`) to pytest-benchmark suite (`benchmarks/`) with unified structure, numba CUDA baselines for fair GPU comparison, and `--benchmark-disable` by default
- Stateful biquad and IIR SOS fallback paths replaced sample-by-sample Python loops with vectorized `lfilter`-based zero-state/zero-input decomposition (~100-500x faster when C++ extension is unavailable)
- IIR SOS matrix (`_compute_sos`) is now computed eagerly after `compute_coefficients()` instead of lazily in the forward path
- State tensor device transfers in stateful filter paths are now guarded to avoid redundant `.to(device)` calls
- Removed synchronous CUDA calls from native kernels for improved GPU throughput
- Added `setuptools` as a runtime dependency (required by `torch.utils.cpp_extension` for JIT compilation)
- FIR filters now default to FFT convolution, up to 10x faster for kernel sizes ≥ 64
- CUDA parallel scan replaced Hillis-Steele algorithm with work-efficient Blelloch scan, reducing total work from O(N log N) to O(N) and shared memory usage from 48 KB to 24 KB per block
- Sequential CUDA biquad kernel now batches 128 channels per thread block instead of 1, improving GPU occupancy for short signals
- Eliminated GPU→CPU synchronization in CUDA biquad kernel by passing all biquad coefficients (`b0`, `b1`, `b2`, `a1`, `a2`) as scalar arguments instead of extracting them from device tensors
- Pre-computed SOS convolution kernels and cached constant tensors (`b_delta`) in Python fallback paths to avoid per-call tensor allocation
- C++ CPU extension now compiles with `-O3 -ffast-math -march=native` and OpenMP parallelization, achieving ~2x faster than scipy for multi-channel IIR filtering
- `TORCHFX_NO_CUDA` environment variable added to force CPU-only extension compilation

### Fixed

- Native C++ extension was unreachable on CPU-only machines due to `torch.cuda.is_available()` gate in `_ops.py`
- Segfault in CUDA biquad kernel caused by dereferencing device pointer on host when reading b-coefficients; fixed by passing b0/b1/b2 as CPU scalars from the SOS copy
- Various mypy and ruff errors



## [0.4.0] - 2026-02-15

### Added

- Input validation layer with custom exception hierarchy (`torchfx.validation` module)
  - Base exception `TorchFXError` for catching all library errors
  - Specific exceptions: `InvalidParameterError`, `InvalidSampleRateError`, `InvalidRangeError`, `InvalidShapeError`, `InvalidTypeError`, `AudioProcessingError`, `CoefficientComputationError`, `FilterInstabilityError`
  - Validator functions for sample rates, parameter ranges, tensor shapes, types, and audio-specific validation (cutoff frequency, filter order, Q factor)
- Logging infrastructure (`torchfx.logging` module)
  - Structured logging following Python best practices (NullHandler by default)
  - Convenience functions: `enable_logging()`, `enable_debug_logging()`, `disable_logging()`, `get_logger()`
  - Performance profiling: `log_performance()` context manager and `LogPerformance` decorator
  - Hierarchical logger support for fine-grained control
- Realtime processing module (`torchfx.realtime`) with streaming processors and audio backends
- Biquad filter implementations (LPF, HPF, BPF, BPF peak, notch, all-pass) with stateful and stateless processing paths
- CLI application (Epic 3 — Phase 1) with Typer framework
  - `torchfx process` command: single-file, batch (glob + progress bar), and Unix pipe processing
  - `torchfx info` command: display audio file metadata in a Rich table
  - `torchfx play` command: play audio through speakers with optional effects (requires `sounddevice`)
  - `torchfx record` command: record from microphone with duration/sample-rate/channels control
  - Effect-chain parser: `--effect "name:param1=val1,param2=val2"` syntax with 30+ registered effects/filters
  - TOML configuration file support for defining reusable effect chains
  - GPU acceleration via `--device cuda` global option

## [0.3.0] - 2026-01-12

### Added

- method `Wave.save` to save audio files with custom format, encoding and bits per sample
- LoShelving filter implementation based on Audio EQ Cookbook
- ParametricEQ
- Elliptic filters (HiElliptic, LoElliptic)
- deprecation logic to improve code maintainability and backward compatibility
- migration guide to help users transition between versions
- style guidelines for contributors to maintain code quality and consistency
- documentation blog section for project updates and announcements
- contribution guidelines to facilitate community involvement

### Changed

- type hint `BitRate` to include 8 bits per sample option
- uniform q naming across all filters (changed from Q to q), this is a breaking change!
- documentation theme to pydata-sphinx-theme for better readability and navigation

### Fixed

- `Wave.merge` had a bug due to incorrect tensor concatenation along the channel dimension

## [0.2.1] - 2025-12-13

### Added

- `Delay` effect with BPM synchronization option by @itsuzef, with many delay strategies available
- new examples in the `examples/` folder:
    - `delay.py` showcasing the new Delay effect
- citation file `CITATION.cff` for easy referencing of the library in academic works

### Changed

- the documentation to include the new Delay effect and update existing examples
- the github workflow to run checks in parallel jobs for faster feedback

### Fixed

- pre-commit configuration to properly run `mypy`, `docformatter` and `black`
- fix many type hints across the codebase

## [0.2.0] - 2025-09-04

### Added

- CustomNormalizationStrategy class to allow custom normalization functions
- ability to pass a callable as strategy to the Normalize effect
- `ParallelFilterCombination` to combine a set of filters in parallel
- add `torch.no_grad` decorators where possible to increase performance

### Changed

- change `effects` module name to `effect` to be consistent with `filter` module name

## [0.1.2] - 2025-06-30

### Added

- third-party acknowledgments section in README and LICENSE files
- effects tests
- reverb effect

## [0.1.1] - 2025-06-16

### Added

- filters:
    - LinkwitzRiley
    - HiLinkwitzRiley
    - LoLinkwitzRiley
- effects:
    - Normalization
    - Gain
- merge method for Wave class

### Fixed

- Shelving and Peaking filters now work as expected, they were missing some instance variables

## [0.1.0] - 2025-04-20

### Added
- sphinx documentation

### Changed
- old parameters in the benchmark script

## [0.1.0rc] - 2025-04-14

### Added

- documentation
- filters
- wave class
- torch support
