Chapter 5: Approximation Theory
“Every approximation has an error. The question is: can we bound it, minimize it, and understand when it fails?”
Connection to the Central Thesis
We’ve established that exact derivatives from data are impossible. We approximate instead. But how good is the approximation? When does it work? When does it fail?
Approximation theory provides the mathematical framework to answer these questions. It tells us:
Convergence rates: How fast does error decrease with more data?
Error bounds: What’s the worst-case error for a given method?
Optimality: Is there a better method, or are we at the limit?
This chapter gives you the tools to reason rigorously about approximation quality—essential for trusting your derivative estimates.
The Approximation Problem
Setup
We want to approximate an unknown function f from a class F (e.g., smooth functions) using a function f̂ from a class G (e.g., polynomials, splines).
Key quantities:
Approximation error: ‖f - f̂‖ in some norm
Best approximation: inf_{g ∈ G} ‖f - g‖
Achievable rate: How does error scale with complexity of G?
For Derivatives
If we approximate f with f̂, how well does f̂’ approximate f’?
Key insight: Derivative approximation is harder than function approximation. If ‖f - f̂‖ = O(hᵏ), then typically ‖f’ - f̂’‖ = O(hᵏ⁻¹).
Differentiation “costs” one order of accuracy.
Function Spaces
Smoothness Classes
Class |
Definition |
Example |
|---|---|---|
C⁰ |
Continuous |
|
C¹ |
Continuous first derivative |
x |
Cᵏ |
k continuous derivatives |
Polynomials |
C^∞ |
Infinitely differentiable |
sin(x), exp(x) |
Analytic |
Equals its Taylor series |
sin(x), exp(x) |
Smoother functions are easier to approximate.
Sobolev Spaces
Functions with derivatives in Lᵖ:
These spaces are natural for PDE theory and provide sharp approximation results.
Why Smoothness Matters
Theorem (Jackson): If f ∈ Cᵏ[a,b], then the best polynomial approximation of degree n satisfies:
Smoother functions (larger k) → faster convergence (higher power of n).
Polynomial Approximation
Weierstrass Theorem
Any continuous function on [a,b] can be uniformly approximated by polynomials.
Implication: Polynomials are “universal approximators” for continuous functions.
Rate of Convergence
For f ∈ Cᵏ:
Best polynomial of degree n: error O(n⁻ᵏ)
Interpolating polynomial at n+1 points: error O(n⁻ᵏ) if points chosen well
Runge’s Phenomenon
Equally-spaced interpolation points can cause divergence:
for some smooth functions (e.g., 1/(1+25x²)).
Solution: Use Chebyshev points (clustered near endpoints) or piecewise methods.
Spline Approximation
Why Splines Work
Instead of one high-degree polynomial, use many low-degree pieces.
Theorem: For f ∈ Cᵏ and cubic splines with n knots:
where h = 1/n is the knot spacing.
Optimal Rates
For smoothing splines minimizing ∫(f’’)² + λ⁻¹∑(yᵢ - f(xᵢ))²:
when f ∈ C² and noise is Gaussian. This is minimax optimal—no method can do better without additional assumptions.
Derivative Approximation
For cubic splines:
Function: O(h⁴)
First derivative: O(h³)
Second derivative: O(h²)
Each derivative costs one order.
Local Polynomial Regression
Bias-Variance for Local Methods
For kernel regression with bandwidth h:
where p is polynomial order, d is dimension, σ² is noise variance.
Optimal Bandwidth
Minimize MSE = Bias² + Variance:
For local linear (p=1) in 1D (d=1):
Same rate as smoothing splines—this is the optimal rate for C² functions.
For Derivatives
Estimating f’ with local polynomials:
For local quadratic (p=2): MSE(f’) = O(n^{-2/7}) ≈ O(n^{-0.29})
Slower than function estimation—derivatives are harder.
Neural Network Approximation
Universal Approximation
Theorem (Cybenko, Hornik): A feedforward network with one hidden layer can approximate any continuous function on a compact set to arbitrary accuracy.
Caveat: Says nothing about:
How many neurons needed
How to find the weights
Generalization to unseen data
Approximation Rates
For ReLU networks with L layers and W total weights:
for f ∈ C² in d dimensions. Deep networks can achieve better rates for certain function classes.
Derivatives via Autodiff
Neural network derivatives are exact (via automatic differentiation), but the network itself is an approximation. The derivative of an approximate function is an approximate derivative.
Error Analysis Framework
Total Error Decomposition
Approximation: Best possible in function class (bias)
Estimation: From finite, noisy data (variance)
Computational: From numerical precision (usually negligible)
For Derivative Estimation
where f̂_true is the best approximation to f in the function class.
First term: estimation error (from noise) Second term: approximation error (from function class)
Convergence Diagnostics
How to Know If Approximation Is Good
Residual analysis: Do residuals look like noise?
Cross-validation: Does error decrease with more data?
Sensitivity analysis: Do results change with smoothing parameter?
Physical plausibility: Do derivatives make sense?
Warning Signs
Derivatives change sign rapidly (noise dominating)
Derivatives are very different at nearby points (instability)
Cross-validation error increases with model complexity (overfitting)
Results sensitive to smoothing parameter (ill-conditioned)
Practical Guidelines
Choosing Approximation Method
Data Characteristics |
Recommended Method |
Expected Rate |
|---|---|---|
Smooth, low noise |
Splines |
O(n^{-4/5}) |
Smooth, high noise |
Smoothing splines |
O(n^{-4/5}) |
Non-smooth, low noise |
Local linear |
O(n^{-2/3}) |
Complex patterns |
Neural networks |
Depends on architecture |
High dimensional |
Neural networks |
Avoids curse of dimensionality |
Rules of Thumb
More data always helps (until computational limits)
Smoother functions are easier (higher convergence rates)
Derivatives are harder than values (lose one order)
Higher dimensions are harder (curse of dimensionality)
Noise hurts more for derivatives (amplification)
Key Takeaways
Approximation error is unavoidable but can be bounded
Smoothness determines convergence rate (smoother = faster)
Derivatives cost one order of accuracy compared to function values
Optimal rates exist (n^{-4/5} for C² functions in 1D)
Method choice matters but optimal methods achieve similar rates
Diagnostics are essential to verify approximation quality
Exercises
Jackson’s theorem: Verify numerically that polynomial approximation to sin(x) converges as O(n^{-k}) where k depends on which derivative you’re approximating.
Spline convergence: Generate data from a known C² function. Fit splines with increasing numbers of knots. Verify O(h⁴) convergence for function, O(h³) for derivative.
Optimal bandwidth: For local linear regression on sin(x) + noise, find the bandwidth that minimizes cross-validation error. Compare to theoretical n^{-1/5}.
Curse of dimensionality: Repeat exercise 3 in 2D and 3D. How does optimal bandwidth and error scale with dimension?
Previous: ← Multivariate Derivatives | Next: Differential Equations →