# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

> The *Unreleased* section is for changes that are not yet released, but are going to be released in the next version.

## [0.5.2] - 2026-04-13

### Added

- **`FilterChain`** (`chain.py`): auto-flattening `nn.Sequential` subclass so
  that `(f1 | f2) | f3` produces a flat `FilterChain(f1, f2, f3)` rather than
  nested containers. Exported from the top-level `torchfx` package.
- **`FX.__or__`**: the pipe operator now works directly between filters/effects
  (`f1 | f2` → `FilterChain`), fixing the long-standing Phase 2 doc/API bug
  where only `Wave.__or__` was defined.
- **Performance baseline** (`docs/source/perf/baseline.md`): Phase 0 of the
  optimization campaign — captures pre-optimization coverage, CPU benchmark
  numbers, and CPU `torch.profiler` findings so every follow-up PR has a
  concrete before/after to diff against.
- **SLURM harness for CUDA runs** (`benchmarks/slurm/`): `run_profiles.sbatch`
  + README covering the `sbatch → rsync → pytest-benchmark compare` cycle for
  CUDA benchmarks and profiles measured on the cluster.
- **Profile scenarios** (`benchmarks/profiles/scenarios.py`) plus `run_cpu_profile.py`
  and `run_cuda_profile.py` so the same scenarios run on local CPU and on
  cluster GPUs.
- **Coverage gate**: `[tool.coverage.report] fail_under = 87` plus an HTML
  coverage CI job on Python 3.12. Local CPU coverage (with the sounddevice
  backend omitted) is now **88%** — up from the Phase 0 baseline of 74%.
- New test files closing Phase 0 coverage gaps:
  `tests/test_fused.py`, `tests/test_filter_base.py`,
  `tests/test_filter_utils.py`, `tests/test_ops_dispatch.py`,
  `tests/test_filterbank.py`, `tests/test_iir_gaps.py`. Per-module jumps:
  `filter/fused.py` 9% → 75%, `filter/utils.py` 0% → 91%,
  `filter/__base.py` 44% → 89%, `filter/filterbank.py` 73% → 100%,
  `filter/iir.py` 71% → 86%, `_ops.py` 64% → 84%.

### Changed

- **Deferred pipeline with auto-fusion** (`wave.py`): `Wave.__or__` now
  accumulates filters in a deferred pipeline instead of executing immediately.
  Materialization happens lazily on `.ys` access, at which point consecutive IIR
  and biquad filters are automatically fused into a single `FusedSOSCascade` —
  eliminating per-filter Python dispatch and kernel launch overhead (~2.5× faster
  for IIR chains). All three chaining syntaxes benefit: `wave | f1 | f2 | f3`,
  `wave | (f1 | f2 | f3)`, and `wave | nn.Sequential(f1, f2, f3)`.
- **Unified Biquad/IIR forward path** (`biquad.py`): `Biquad` now stores
  coefficients as a `[1, 6]` SOS tensor and delegates its `forward()` to the
  shared `_sos_cascade_forward` helper, eliminating ~150 lines of duplicated
  forward logic (shape normalization, state management, native dispatch, fallback).
  Biquad filters now auto-fuse with IIR filters in the deferred pipeline.
  Backward-compatible `b`/`a` properties provide read-only access to coefficients.
- **SOS coefficient caching** (`iir.py`, `fused.py`): device-matched SOS tensor
  is now cached between forward calls, eliminating per-call `.to()` transfers.
  The canonical CPU copy is passed directly to native kernels, avoiding a
  per-call CUDA→CPU synchronisation that was 21% of Self CPU in batch profiles.
- **Biquad coefficient caching** (`biquad.py`, `_ops.py`): feedback coefficients
  `a1`, `a2` are pre-extracted as Python floats at coefficient-computation time
  and passed through to native dispatch, eliminating per-forward `float()` calls
  that trigger GPU→CPU sync on CUDA.
- **In-place state updates** (`iir.py`, `fused.py`, `biquad.py`): replaced
  per-section `torch.stack([...])` allocations in fallback DF1 loops with
  `copy_()` into pre-existing state buffers.
- **Reverb op fusion** (`effect.py`): algebraically simplified the PyTorch
  fallback from 5 tensor ops (`pad → slice → mul → add → lerp`) to 2
  (`clone → add_` with alpha), eliminating one intermediate allocation.
- **Delay wet/dry mix** (`effect.py`): replaced `(1-mix)*x + mix*y` (3 ops)
  with `torch.lerp(x, y, mix)` (1 fused op).
- **Consolidated SOS cascade logic** (`iir.py`, `fused.py`): extracted the
  shared SOS forward path (device caching, state init, native dispatch, DF1
  fallback) into a single `_sos_cascade_forward` helper. `FusedSOSCascade` now
  delegates to the same code as `IIR`, eliminating ~80 lines of duplication.
- **Wave pipeline nested module support** (`wave.py`): `Wave.__or__` now walks
  all submodules via `nn.Module.modules()` instead of only one level of
  `nn.Sequential`, so arbitrarily nested module trees get their `fs` configured.

### Fixed

- `AbstractFilter._has_computed_coeff` returned `True` for IIR subclasses whose
  `_sos is None` and which have no `b`/`a` attributes — silently claiming
  coefficients were ready before `compute_coefficients()` had run. The fallthrough
  now correctly returns `False`.
- `ParallelFilterCombination.__init__` assigned `self.fs = fs` before
  `self.filters = filters`, crashing with `AttributeError` when `fs` was passed
  at construction. Swapped assignment order so the `fs` setter can safely
  iterate child filters.
- `iir_cpu.cpp` `sos_forward_cpu` used a hardcoded `double sec_sx0[16]` stack
  array for per-section state, silently overflowing for filters with >16 SOS
  sections (order >32). Now uses stack arrays for K≤16 and falls back to
  heap-allocated `std::vector<double>` for higher orders.
- Benchmark scripts (`test_pipeline_bench.py`, `test_biquad_bench.py`,
  `scenarios.py`) called the removed `move_coeff()` method on IIR filters.
  Replaced with standard `nn.Module.to(device)`.

## [0.5.1] - 2026-04-02

### Fixed

- Fixed `LinkwitzRiley` filter where the `order` parameter was overwritten in the constructor, causing incorrect coefficient computation
- Fixed `BadCoefficients` scipy warning on high-order IIR filters with low normalized cutoffs (e.g., 24th-order highpass at 50 Hz / 32 kHz) by computing SOS directly from scipy design functions instead of going through the numerically unstable `ba` intermediate

### Changed

- IIR filters now use SOS (second-order sections) as the sole coefficient representation, computed directly via `scipy.signal.butter(..., output='sos')` and friends — the `ba` transfer function form is no longer used
- `IIR.forward()` always uses the SOS cascade path; the old stateless `torchaudio.lfilter` path has been removed (SOS cascade is both more numerically stable and 1.5–4x faster for typical signal lengths)
- `LinkwitzRiley` now cascades filters by stacking SOS sections (`np.vstack`) instead of convolving `ba` polynomials
- Removed dead code from `IIR` base class: `_compute_ba_from_sos()`, `move_coeff()`, `_bootstrap_state()`, `_stateful` flag, and `a`/`b` attributes
- Removed `a`/`b` constructor parameters from `Butterworth` and `Chebyshev1`

## [0.5.0] - 2026-03-27

### Added

- CUDA kernel implementations for biquad, SOS cascade, and delay line filters using parallel prefix scan algorithm
- JIT-compiled C++/CUDA native extension (`torchfx._ops`) with automatic fallback to pure-PyTorch
- CPU-only C++ extension support: native kernels now compile and load without CUDA toolkit, providing ~2400x speedup for stateful IIR filtering on CPU
- `LogFilterBank` for logarithmically-spaced frequency band decomposition
- CUDA kernel tests and fallback behavior tests
- FFT-based 1D convolution (`fft_conv1d`) adapted from [Julius](https://github.com/adefossez/julius) (MIT License) for fast FIR filtering using the overlap-save method
- `conv_mode` parameter on `FIR` and `DesignableFIR` filters: `"fft"` (default), `"direct"`, or `"auto"`
- Benchmark suite for FFT vs direct convolution across kernel sizes (64–1024) and signal durations

### Changed

- Benchmarks migrated from standalone scripts (`benchmark/`) to pytest-benchmark suite (`benchmarks/`) with unified structure, numba CUDA baselines for fair GPU comparison, and `--benchmark-disable` by default
- Stateful biquad and IIR SOS fallback paths replaced sample-by-sample Python loops with vectorized `lfilter`-based zero-state/zero-input decomposition (~100-500x faster when C++ extension is unavailable)
- IIR SOS matrix (`_compute_sos`) is now computed eagerly after `compute_coefficients()` instead of lazily in the forward path
- State tensor device transfers in stateful filter paths are now guarded to avoid redundant `.to(device)` calls
- Removed synchronous CUDA calls from native kernels for improved GPU throughput
- Added `setuptools` as a runtime dependency (required by `torch.utils.cpp_extension` for JIT compilation)
- FIR filters now default to FFT convolution, up to 10x faster for kernel sizes ≥ 64
- CUDA parallel scan replaced Hillis-Steele algorithm with work-efficient Blelloch scan, reducing total work from O(N log N) to O(N) and shared memory usage from 48 KB to 24 KB per block
- Sequential CUDA biquad kernel now batches 128 channels per thread block instead of 1, improving GPU occupancy for short signals
- Eliminated GPU→CPU synchronization in CUDA biquad kernel by passing all biquad coefficients (`b0`, `b1`, `b2`, `a1`, `a2`) as scalar arguments instead of extracting them from device tensors
- Pre-computed SOS convolution kernels and cached constant tensors (`b_delta`) in Python fallback paths to avoid per-call tensor allocation
- C++ CPU extension now compiles with `-O3 -ffast-math -march=native` and OpenMP parallelization, achieving ~2x faster than scipy for multi-channel IIR filtering
- `TORCHFX_NO_CUDA` environment variable added to force CPU-only extension compilation

### Fixed

- Native C++ extension was unreachable on CPU-only machines due to `torch.cuda.is_available()` gate in `_ops.py`
- Segfault in CUDA biquad kernel caused by dereferencing device pointer on host when reading b-coefficients; fixed by passing b0/b1/b2 as CPU scalars from the SOS copy
- Various mypy and ruff errors



## [0.4.0] - 2026-02-15

### Added

- Input validation layer with custom exception hierarchy (`torchfx.validation` module)
  - Base exception `TorchFXError` for catching all library errors
  - Specific exceptions: `InvalidParameterError`, `InvalidSampleRateError`, `InvalidRangeError`, `InvalidShapeError`, `InvalidTypeError`, `AudioProcessingError`, `CoefficientComputationError`, `FilterInstabilityError`
  - Validator functions for sample rates, parameter ranges, tensor shapes, types, and audio-specific validation (cutoff frequency, filter order, Q factor)
- Logging infrastructure (`torchfx.logging` module)
  - Structured logging following Python best practices (NullHandler by default)
  - Convenience functions: `enable_logging()`, `enable_debug_logging()`, `disable_logging()`, `get_logger()`
  - Performance profiling: `log_performance()` context manager and `LogPerformance` decorator
  - Hierarchical logger support for fine-grained control
- Realtime processing module (`torchfx.realtime`) with streaming processors and audio backends
- Biquad filter implementations (LPF, HPF, BPF, BPF peak, notch, all-pass) with stateful and stateless processing paths
- CLI application (Epic 3 — Phase 1) with Typer framework
  - `torchfx process` command: single-file, batch (glob + progress bar), and Unix pipe processing
  - `torchfx info` command: display audio file metadata in a Rich table
  - `torchfx play` command: play audio through speakers with optional effects (requires `sounddevice`)
  - `torchfx record` command: record from microphone with duration/sample-rate/channels control
  - Effect-chain parser: `--effect "name:param1=val1,param2=val2"` syntax with 30+ registered effects/filters
  - TOML configuration file support for defining reusable effect chains
  - GPU acceleration via `--device cuda` global option

## [0.3.0] - 2026-01-12

### Added

- method `Wave.save` to save audio files with custom format, encoding and bits per sample
- LoShelving filter implementation based on Audio EQ Cookbook
- ParametricEQ
- Elliptic filters (HiElliptic, LoElliptic)
- deprecation logic to improve code maintainability and backward compatibility
- migration guide to help users transition between versions
- style guidelines for contributors to maintain code quality and consistency
- documentation blog section for project updates and announcements
- contribution guidelines to facilitate community involvement

### Changed

- type hint `BitRate` to include 8 bits per sample option
- uniform q naming across all filters (changed from Q to q), this is a breaking change!
- documentation theme to pydata-sphinx-theme for better readability and navigation

### Fixed

- `Wave.merge` had a bug due to incorrect tensor concatenation along the channel dimension

## [0.2.1] - 2025-12-13

### Added

- `Delay` effect with BPM synchronization option by @itsuzef, with many delay strategies available
- new examples in the `examples/` folder:
    - `delay.py` showcasing the new Delay effect
- citation file `CITATION.cff` for easy referencing of the library in academic works

### Changed

- the documentation to include the new Delay effect and update existing examples
- the github workflow to run checks in parallel jobs for faster feedback

### Fixed

- pre-commit configuration to properly run `mypy`, `docformatter` and `black`
- fix many type hints across the codebase

## [0.2.0] - 2025-09-04

### Added

- CustomNormalizationStrategy class to allow custom normalization functions
- ability to pass a callable as strategy to the Normalize effect
- `ParallelFilterCombination` to combine a set of filters in parallel
- add `torch.no_grad` decorators where possible to increase performance

### Changed

- change `effects` module name to `effect` to be consistent with `filter` module name

## [0.1.2] - 2025-06-30

### Added

- third-party acknowledgments section in README and LICENSE files
- effects tests
- reverb effect

## [0.1.1] - 2025-06-16

### Added

- filters:
    - LinkwitzRiley
    - HiLinkwitzRiley
    - LoLinkwitzRiley
- effects:
    - Normalization
    - Gain
- merge method for Wave class

### Fixed

- Shelving and Peaking filters now work as expected, they were missing some instance variables

## [0.1.0] - 2025-04-20

### Added
- sphinx documentation

### Changed
- old parameters in the benchmark script

## [0.1.0rc] - 2025-04-14

### Added

- documentation
- filters
- wave class
- torch support
