# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

> The *Unreleased* section is for changes that are not yet released, but are going to be released in the next version.

## [0.5.4] - 2026-05-26

### Removed

- **`scipy` runtime dependency.** The `scipy.signal.{butter, cheby1, cheby2,
  ellip, firwin}` calls used by Butterworth, Chebyshev I/II, Elliptic,
  Linkwitz-Riley, and `DesignableFIR` are gone — coefficient design is now
  performed natively in pure PyTorch (see Added). `scipy` remains a *dev/test*
  dependency (in `[dependency-groups.dev]`) for numerical-equivalence
  verification only; downstream installs no longer pull it in.
- `_set_coefficients` helper on `Biquad`: replaced by `_finalize_coeffs`, which
  takes raw RBJ-cookbook ``(b0, b1, b2, a0, a1, a2)`` and normalizes by `a0`
  centrally instead of having every subclass repeat the divide.
- `_gain_db` and `Delay._extend_waveform` private helpers: each was called from
  a single site, now inlined.

### Added

- **`torchfx.filter._design`**: pure-PyTorch filter-design module containing
  native equivalents of every scipy.signal design function the package used:
  `design_butterworth_sos`, `design_cheby1_sos`, `design_cheby2_sos`,
  `design_ellip_sos`, and `design_firwin` (multi-band capable). All return
  canonical CPU `float64` SOS / FIR tensors compatible with the existing
  C++/CUDA kernels. Implementation follows the standard analog-prototype +
  bilinear-transform pipeline; the elliptic design is a faithful port of
  scipy's algorithm (Orfanidis, "Lecture Notes on Elliptic Filter Design").
- **`tests/test_native_design.py`**: 509 numerical-equivalence tests covering
  the full parameter sweep (orders 1-16, lowpass/highpass, the major windows,
  Rp/Rs combinations). Tolerances are `1e-12` (Butterworth, FIR), `1e-9`
  (Chebyshev), `1e-7` (Elliptic up to N=8), `1e-5` (Elliptic N=12-16).
- **`benchmarks/test_design_benchmarks.py`**: paired side-by-side benchmarks
  comparing native design time to scipy across all five filter types and
  orders 2-16. Run with
  `uv run pytest benchmarks/test_design_benchmarks.py --benchmark-enable`.
  Native design is **14-50× faster than scipy** for Butterworth, Chebyshev I/II
  (e.g. N=16: 26-37 µs native vs 470-490 µs scipy) and **7-20× faster** for
  Elliptic (N=16: 92 µs native vs 670 µs scipy). The IIR design pipeline runs
  on Python `complex` instead of `torch` tensors because per-op tensor dispatch
  dominates over arithmetic for small (N≤16) problems; `firwin` stays in torch
  because windows are length-N (~31..1024) where vectorization wins. Numerical
  equivalence to scipy is preserved (rtol 1e-10/1e-9/1e-7 across IIR types).
- Regression coverage for runtime edge cases across
  `tests/test_wave.py`, `tests/test_effects.py`, `tests/test_fir.py`,
  `tests/test_realtime.py`, and `tests/test_cli.py` (Wave shape/channel
  invariants, Delay sample-rate recomputation, Reverb batched/dtype handling,
  deferred `DesignableFIR` initialization, stream chunk-length constraints,
  and CLI invalid-parameter error paths).

### Changed

- Six biquad subclasses (`BiquadLPF`, `BiquadHPF`, `BiquadNotch`, `BiquadBPF`,
  `BiquadBPFPeak`, `BiquadAllPass`) now express coefficients in raw RBJ form
  and call the shared `_finalize_coeffs` for normalization, removing ~50 lines
  of duplicated `a0_inv` plumbing. Same applies to `Shelving`, `ParametricEQ`,
  `Notch`, and `AllPass` in `iir.py`.
- `LinkwitzRiley.compute_coefficients`: `np.vstack` replaced with `torch.cat`,
  removing the only `numpy.*` math call from `src/torchfx/`. After this work,
  numpy is used only at the I/O boundary (`soundfile`, `sounddevice`), never
  for signal-processing math in core.
- `FIR.__init__`: switched coefficient construction to
  `torch.as_tensor(...).clone().flip(0)` to silence the
  `torch.tensor(sourceTensor)` warning when the new native designs hand a
  tensor straight in.
- `Wave` now enforces internal `(channels, samples)` shape invariants,
  normalizes 1D mono inputs to `(1, T)`, preserves metadata/device in
  transforms and channel extraction, and keeps merge behavior explicit:
  split-channel merges zero-pad to longest length while mix mode requires
  matching channel counts.
- `_ops.delay_line_forward` now validates rank/dtype, supports tensors with
  arbitrary leading batch dimensions by flattening/restoring `(..., T)`, and
  safely round-trips non-native float dtypes via the extension's supported
  float32/float64 execution path.
- Native IIR/biquad dispatch now uses dtype-aware execution in `_ops` and the
  SOS forward path: CPU routes non-float64 inputs through float32, while CUDA
  remains on float64 for the current kernel generation. CPU native kernels in
  `_csrc/cpu/iir_cpu.cpp` now run in float32 or float64 based on input dtype
  instead of unconditionally upcasting to float64.
- `DesignableFIR` now initializes `nn.Module` state even when `fs=None`, then
  updates kernel coefficients in `compute_coefficients()` once sample rate is
  available.
- `Delay` and `Gain` input validation now consistently raises `ValueError` for
  invalid user parameters. BPM-synced `Delay` also recomputes
  `delay_samples` when `fs` changes after initialization.
- Realtime processing now resets stateful effects at file/chunk boundaries and
  rejects effects that change chunk length in `StreamProcessor`.
- CLI processing now surfaces config/effect parse failures as user-facing
  messages (including invalid constructor parameters) instead of raw tracebacks.

### Fixed

- `Wave.get_channel()` no longer returns flattened 1D tensors that break
  downstream assumptions about channel count and `len()` semantics.
- Delayed/BPM-synced effects now stay sample-accurate when reused across
  different sample rates.
- Stream processing no longer leaks state from one file into the next, and
  now fails fast when an effect violates fixed-size chunk assumptions.

## [0.5.3] - 2026-05-03

### Changed

- **Build system migration**: replaced hatchling + runtime JIT compilation with
  scikit-build-core + CMake. The C++/CUDA extension is now compiled at install
  time (no more 10-30s first-import delay, no compiler requirement for end users
  installing from wheels).
- **Removed pure-PyTorch fallback**: the C++ extension is now required. The slow
  Python fallback paths for IIR filtering and delay have been removed.

### Added

- **CPU delay C++ kernel** (`delay_cpu.cpp`): `delay_line_forward` now dispatches
  to a C++ kernel on CPU (previously CUDA-only, with Python fallback for CPU).
- **CI/CD wheel pipeline** (`.github/workflows/wheels.yml`): automated CPU wheel
  builds via `cibuildwheel` for Python 3.10-3.14 across manylinux x86_64,
  macOS x86_64 + arm64, and Windows x86_64, with PyPI publishing on tag.
  PyTorch's shared libraries are excluded from the repaired wheels so the
  install stays small and links against the user's PyTorch build.
- **Doctest in CI**: `pytest --doctest-modules` now runs in the test job and
  `make doctest` runs in the docs job, exercising docstring and `.. doctest::`
  examples across the codebase.
- **CUDA wheels for Linux x86_64** (`.github/workflows/wheels-cuda.yml`):
  cu124 and cu128 wheels for Python 3.10-3.14, published as a PEP 503 index
  on GitHub Pages at `https://matteospanio.github.io/torchfx/wheels/cuXXX/`
  on every tagged release. Wheels carry a `+cuXXX` local-version segment
  and exclude PyTorch / CUDA shared libraries from the bundle.
- **Auto-release on version bump** (`.github/workflows/release.yml`):
  pushes to `master` are scanned for a change in `pyproject.toml`'s
  `version`. When a new version lands, the workflow tags the commit
  `vX.Y.Z` and dispatches `wheels.yml` + `wheels-cuda.yml` against the
  tag, so a merged version-bump PR automatically yields a PyPI release
  and refreshed CUDA wheel index without any manual `git tag`.

### Removed

- `setuptools` runtime dependency (was only needed for JIT compilation).
- `_biquad_df1_fallback` Python DF1 loop (replaced by C++ `iir_cpu.cpp`).
- JIT compilation logic in `_ops.py` (`_load_extension`, `torch.utils.cpp_extension.load`).

## [0.5.2] - 2026-04-13

### Added

- **`FilterChain`** (`chain.py`): auto-flattening `nn.Sequential` subclass so
  that `(f1 | f2) | f3` produces a flat `FilterChain(f1, f2, f3)` rather than
  nested containers. Exported from the top-level `torchfx` package.
- **`FX.__or__`**: the pipe operator now works directly between filters/effects
  (`f1 | f2` → `FilterChain`), fixing the long-standing Phase 2 doc/API bug
  where only `Wave.__or__` was defined.
- **Performance baseline** (`docs/source/perf/baseline.md`): Phase 0 of the
  optimization campaign — captures pre-optimization coverage, CPU benchmark
  numbers, and CPU `torch.profiler` findings so every follow-up PR has a
  concrete before/after to diff against.
- **SLURM harness for CUDA runs** (`benchmarks/slurm/`): `run_profiles.sbatch`
  + README covering the `sbatch → rsync → pytest-benchmark compare` cycle for
  CUDA benchmarks and profiles measured on the cluster.
- **Profile scenarios** (`benchmarks/profiles/scenarios.py`) plus `run_cpu_profile.py`
  and `run_cuda_profile.py` so the same scenarios run on local CPU and on
  cluster GPUs.
- **Coverage gate**: `[tool.coverage.report] fail_under = 87` plus an HTML
  coverage CI job on Python 3.12. Local CPU coverage (with the sounddevice
  backend omitted) is now **88%** — up from the Phase 0 baseline of 74%.
- New test files closing Phase 0 coverage gaps:
  `tests/test_fused.py`, `tests/test_filter_base.py`,
  `tests/test_filter_utils.py`, `tests/test_ops_dispatch.py`,
  `tests/test_filterbank.py`, `tests/test_iir_gaps.py`. Per-module jumps:
  `filter/fused.py` 9% → 75%, `filter/utils.py` 0% → 91%,
  `filter/__base.py` 44% → 89%, `filter/filterbank.py` 73% → 100%,
  `filter/iir.py` 71% → 86%, `_ops.py` 64% → 84%.

### Changed

- **Deferred pipeline with auto-fusion** (`wave.py`): `Wave.__or__` now
  accumulates filters in a deferred pipeline instead of executing immediately.
  Materialization happens lazily on `.ys` access, at which point consecutive IIR
  and biquad filters are automatically fused into a single `FusedSOSCascade` —
  eliminating per-filter Python dispatch and kernel launch overhead (~2.5× faster
  for IIR chains). All three chaining syntaxes benefit: `wave | f1 | f2 | f3`,
  `wave | (f1 | f2 | f3)`, and `wave | nn.Sequential(f1, f2, f3)`.
- **Unified Biquad/IIR forward path** (`biquad.py`): `Biquad` now stores
  coefficients as a `[1, 6]` SOS tensor and delegates its `forward()` to the
  shared `_sos_cascade_forward` helper, eliminating ~150 lines of duplicated
  forward logic (shape normalization, state management, native dispatch, fallback).
  Biquad filters now auto-fuse with IIR filters in the deferred pipeline.
  Backward-compatible `b`/`a` properties provide read-only access to coefficients.
- **SOS coefficient caching** (`iir.py`, `fused.py`): device-matched SOS tensor
  is now cached between forward calls, eliminating per-call `.to()` transfers.
  The canonical CPU copy is passed directly to native kernels, avoiding a
  per-call CUDA→CPU synchronisation that was 21% of Self CPU in batch profiles.
- **Biquad coefficient caching** (`biquad.py`, `_ops.py`): feedback coefficients
  `a1`, `a2` are pre-extracted as Python floats at coefficient-computation time
  and passed through to native dispatch, eliminating per-forward `float()` calls
  that trigger GPU→CPU sync on CUDA.
- **In-place state updates** (`iir.py`, `fused.py`, `biquad.py`): replaced
  per-section `torch.stack([...])` allocations in fallback DF1 loops with
  `copy_()` into pre-existing state buffers.
- **Reverb op fusion** (`effect.py`): algebraically simplified the PyTorch
  fallback from 5 tensor ops (`pad → slice → mul → add → lerp`) to 2
  (`clone → add_` with alpha), eliminating one intermediate allocation.
- **Delay wet/dry mix** (`effect.py`): replaced `(1-mix)*x + mix*y` (3 ops)
  with `torch.lerp(x, y, mix)` (1 fused op).
- **Consolidated SOS cascade logic** (`iir.py`, `fused.py`): extracted the
  shared SOS forward path (device caching, state init, native dispatch, DF1
  fallback) into a single `_sos_cascade_forward` helper. `FusedSOSCascade` now
  delegates to the same code as `IIR`, eliminating ~80 lines of duplication.
- **Wave pipeline nested module support** (`wave.py`): `Wave.__or__` now walks
  all submodules via `nn.Module.modules()` instead of only one level of
  `nn.Sequential`, so arbitrarily nested module trees get their `fs` configured.

### Fixed

- `AbstractFilter._has_computed_coeff` returned `True` for IIR subclasses whose
  `_sos is None` and which have no `b`/`a` attributes — silently claiming
  coefficients were ready before `compute_coefficients()` had run. The fallthrough
  now correctly returns `False`.
- `ParallelFilterCombination.__init__` assigned `self.fs = fs` before
  `self.filters = filters`, crashing with `AttributeError` when `fs` was passed
  at construction. Swapped assignment order so the `fs` setter can safely
  iterate child filters.
- `iir_cpu.cpp` `sos_forward_cpu` used a hardcoded `double sec_sx0[16]` stack
  array for per-section state, silently overflowing for filters with >16 SOS
  sections (order >32). Now uses stack arrays for K≤16 and falls back to
  heap-allocated `std::vector<double>` for higher orders.
- Benchmark scripts (`test_pipeline_bench.py`, `test_biquad_bench.py`,
  `scenarios.py`) called the removed `move_coeff()` method on IIR filters.
  Replaced with standard `nn.Module.to(device)`.

## [0.5.1] - 2026-04-02

### Fixed

- Fixed `LinkwitzRiley` filter where the `order` parameter was overwritten in the constructor, causing incorrect coefficient computation
- Fixed `BadCoefficients` scipy warning on high-order IIR filters with low normalized cutoffs (e.g., 24th-order highpass at 50 Hz / 32 kHz) by computing SOS directly from scipy design functions instead of going through the numerically unstable `ba` intermediate

### Changed

- IIR filters now use SOS (second-order sections) as the sole coefficient representation, computed directly via `scipy.signal.butter(..., output='sos')` and friends — the `ba` transfer function form is no longer used
- `IIR.forward()` always uses the SOS cascade path; the old stateless `torchaudio.lfilter` path has been removed (SOS cascade is both more numerically stable and 1.5–4x faster for typical signal lengths)
- `LinkwitzRiley` now cascades filters by stacking SOS sections (`np.vstack`) instead of convolving `ba` polynomials
- Removed dead code from `IIR` base class: `_compute_ba_from_sos()`, `move_coeff()`, `_bootstrap_state()`, `_stateful` flag, and `a`/`b` attributes
- Removed `a`/`b` constructor parameters from `Butterworth` and `Chebyshev1`

## [0.5.0] - 2026-03-27

### Added

- CUDA kernel implementations for biquad, SOS cascade, and delay line filters using parallel prefix scan algorithm
- JIT-compiled C++/CUDA native extension (`torchfx._ops`) with automatic fallback to pure-PyTorch
- CPU-only C++ extension support: native kernels now compile and load without CUDA toolkit, providing ~2400x speedup for stateful IIR filtering on CPU
- `LogFilterBank` for logarithmically-spaced frequency band decomposition
- CUDA kernel tests and fallback behavior tests
- FFT-based 1D convolution (`fft_conv1d`) adapted from [Julius](https://github.com/adefossez/julius) (MIT License) for fast FIR filtering using the overlap-save method
- `conv_mode` parameter on `FIR` and `DesignableFIR` filters: `"fft"` (default), `"direct"`, or `"auto"`
- Benchmark suite for FFT vs direct convolution across kernel sizes (64–1024) and signal durations

### Changed

- Benchmarks migrated from standalone scripts (`benchmark/`) to pytest-benchmark suite (`benchmarks/`) with unified structure, numba CUDA baselines for fair GPU comparison, and `--benchmark-disable` by default
- Stateful biquad and IIR SOS fallback paths replaced sample-by-sample Python loops with vectorized `lfilter`-based zero-state/zero-input decomposition (~100-500x faster when C++ extension is unavailable)
- IIR SOS matrix (`_compute_sos`) is now computed eagerly after `compute_coefficients()` instead of lazily in the forward path
- State tensor device transfers in stateful filter paths are now guarded to avoid redundant `.to(device)` calls
- Removed synchronous CUDA calls from native kernels for improved GPU throughput
- Added `setuptools` as a runtime dependency (required by `torch.utils.cpp_extension` for JIT compilation)
- FIR filters now default to FFT convolution, up to 10x faster for kernel sizes ≥ 64
- CUDA parallel scan replaced Hillis-Steele algorithm with work-efficient Blelloch scan, reducing total work from O(N log N) to O(N) and shared memory usage from 48 KB to 24 KB per block
- Sequential CUDA biquad kernel now batches 128 channels per thread block instead of 1, improving GPU occupancy for short signals
- Eliminated GPU→CPU synchronization in CUDA biquad kernel by passing all biquad coefficients (`b0`, `b1`, `b2`, `a1`, `a2`) as scalar arguments instead of extracting them from device tensors
- Pre-computed SOS convolution kernels and cached constant tensors (`b_delta`) in Python fallback paths to avoid per-call tensor allocation
- C++ CPU extension now compiles with `-O3 -ffast-math -march=native` and OpenMP parallelization, achieving ~2x faster than scipy for multi-channel IIR filtering
- `TORCHFX_NO_CUDA` environment variable added to force CPU-only extension compilation

### Fixed

- Native C++ extension was unreachable on CPU-only machines due to `torch.cuda.is_available()` gate in `_ops.py`
- Segfault in CUDA biquad kernel caused by dereferencing device pointer on host when reading b-coefficients; fixed by passing b0/b1/b2 as CPU scalars from the SOS copy
- Various mypy and ruff errors



## [0.4.0] - 2026-02-15

### Added

- Input validation layer with custom exception hierarchy (`torchfx.validation` module)
  - Base exception `TorchFXError` for catching all library errors
  - Specific exceptions: `InvalidParameterError`, `InvalidSampleRateError`, `InvalidRangeError`, `InvalidShapeError`, `InvalidTypeError`, `AudioProcessingError`, `CoefficientComputationError`, `FilterInstabilityError`
  - Validator functions for sample rates, parameter ranges, tensor shapes, types, and audio-specific validation (cutoff frequency, filter order, Q factor)
- Logging infrastructure (`torchfx.logging` module)
  - Structured logging following Python best practices (NullHandler by default)
  - Convenience functions: `enable_logging()`, `enable_debug_logging()`, `disable_logging()`, `get_logger()`
  - Performance profiling: `log_performance()` context manager and `LogPerformance` decorator
  - Hierarchical logger support for fine-grained control
- Realtime processing module (`torchfx.realtime`) with streaming processors and audio backends
- Biquad filter implementations (LPF, HPF, BPF, BPF peak, notch, all-pass) with stateful and stateless processing paths
- CLI application (Epic 3 — Phase 1) with Typer framework
  - `torchfx process` command: single-file, batch (glob + progress bar), and Unix pipe processing
  - `torchfx info` command: display audio file metadata in a Rich table
  - `torchfx play` command: play audio through speakers with optional effects (requires `sounddevice`)
  - `torchfx record` command: record from microphone with duration/sample-rate/channels control
  - Effect-chain parser: `--effect "name:param1=val1,param2=val2"` syntax with 30+ registered effects/filters
  - TOML configuration file support for defining reusable effect chains
  - GPU acceleration via `--device cuda` global option

## [0.3.0] - 2026-01-12

### Added

- method `Wave.save` to save audio files with custom format, encoding and bits per sample
- LoShelving filter implementation based on Audio EQ Cookbook
- ParametricEQ
- Elliptic filters (HiElliptic, LoElliptic)
- deprecation logic to improve code maintainability and backward compatibility
- migration guide to help users transition between versions
- style guidelines for contributors to maintain code quality and consistency
- documentation blog section for project updates and announcements
- contribution guidelines to facilitate community involvement

### Changed

- type hint `BitRate` to include 8 bits per sample option
- uniform q naming across all filters (changed from Q to q), this is a breaking change!
- documentation theme to pydata-sphinx-theme for better readability and navigation

### Fixed

- `Wave.merge` had a bug due to incorrect tensor concatenation along the channel dimension

## [0.2.1] - 2025-12-13

### Added

- `Delay` effect with BPM synchronization option by @itsuzef, with many delay strategies available
- new examples in the `examples/` folder:
    - `delay.py` showcasing the new Delay effect
- citation file `CITATION.cff` for easy referencing of the library in academic works

### Changed

- the documentation to include the new Delay effect and update existing examples
- the github workflow to run checks in parallel jobs for faster feedback

### Fixed

- pre-commit configuration to properly run `mypy`, `docformatter` and `black`
- fix many type hints across the codebase

## [0.2.0] - 2025-09-04

### Added

- CustomNormalizationStrategy class to allow custom normalization functions
- ability to pass a callable as strategy to the Normalize effect
- `ParallelFilterCombination` to combine a set of filters in parallel
- add `torch.no_grad` decorators where possible to increase performance

### Changed

- change `effects` module name to `effect` to be consistent with `filter` module name

## [0.1.2] - 2025-06-30

### Added

- third-party acknowledgments section in README and LICENSE files
- effects tests
- reverb effect

## [0.1.1] - 2025-06-16

### Added

- filters:
    - LinkwitzRiley
    - HiLinkwitzRiley
    - LoLinkwitzRiley
- effects:
    - Normalization
    - Gain
- merge method for Wave class

### Fixed

- Shelving and Peaking filters now work as expected, they were missing some instance variables

## [0.1.0] - 2025-04-20

### Added
- sphinx documentation

### Changed
- old parameters in the benchmark script

## [0.1.0rc] - 2025-04-14

### Added

- documentation
- filters
- wave class
- torch support
