Chapter 3: Differentiation Rules
“We don’t compute derivatives from scratch. We use rules that let us differentiate complex functions by breaking them into simple parts.”
Why Rules Matter
Computing derivatives from the limit definition every time would be tedious and error-prone. Instead, we use a small set of rules that let us differentiate almost any function by breaking it into pieces.
This is exactly what automatic differentiation does—it applies these rules systematically through your computational graph.
The Basic Rules
Rule 1: Constant Rule
If f(x) = c (a constant), then f’(x) = 0.
Intuition: A constant doesn’t change, so its rate of change is zero.
# f(x) = 5
# f'(x) = 0
Rule 2: Power Rule
If f(x) = xⁿ, then f’(x) = n·xⁿ⁻¹.
Intuition: Bring down the exponent, reduce it by one.
# f(x) = x³
# f'(x) = 3x²
# f(x) = x^(-1) = 1/x
# f'(x) = -1·x^(-2) = -1/x²
# f(x) = √x = x^(1/2)
# f'(x) = (1/2)·x^(-1/2) = 1/(2√x)
Rule 3: Constant Multiple Rule
If f(x) = c·g(x), then f’(x) = c·g’(x).
Intuition: Constants factor out of derivatives.
# f(x) = 5x²
# f'(x) = 5·(2x) = 10x
Rule 4: Sum Rule
If f(x) = g(x) + h(x), then f’(x) = g’(x) + h’(x).
Intuition: Differentiate term by term.
# f(x) = x² + sin(x)
# f'(x) = 2x + cos(x)
The Product Rule
If f(x) = g(x)·h(x), then:
Mnemonic: “First times derivative of second, plus second times derivative of first.”
Example
# f(x) = x²·sin(x)
# g(x) = x², g'(x) = 2x
# h(x) = sin(x), h'(x) = cos(x)
# f'(x) = 2x·sin(x) + x²·cos(x)
ML Connection
The product rule appears when you have:
Attention mechanisms (query × key)
Gating mechanisms (gate × value)
Any multiplicative interaction between learned features
The Quotient Rule
If f(x) = g(x)/h(x), then:
Mnemonic: “Low d-high minus high d-low, over low squared.”
Example
# f(x) = sin(x)/x
# g(x) = sin(x), g'(x) = cos(x)
# h(x) = x, h'(x) = 1
# f'(x) = (cos(x)·x - sin(x)·1) / x²
# = (x·cos(x) - sin(x)) / x²
The Chain Rule (The Most Important Rule)
If f(x) = g(h(x))—a composition of functions—then:
Intuition: Multiply the derivatives along the chain.
Example
# f(x) = sin(x²)
# Outer function: g(u) = sin(u), g'(u) = cos(u)
# Inner function: h(x) = x², h'(x) = 2x
# f'(x) = cos(x²) · 2x = 2x·cos(x²)
The Chain Rule IS Backpropagation
When you call loss.backward() in PyTorch, you’re applying the chain rule through your entire network:
Input → Layer1 → Layer2 → ... → LayerN → Loss
x h₁ h₂ hₙ L
∂L/∂x = ∂L/∂hₙ · ∂hₙ/∂hₙ₋₁ · ... · ∂h₂/∂h₁ · ∂h₁/∂x
Each layer computes its local derivative, and they’re all multiplied together.
Why Deep Networks Have Gradient Problems
The chain rule multiplies many terms together:
If each term is < 1: gradients vanish (exponential decay)
If each term is > 1: gradients explode (exponential growth)
This is why:
ReLU is popular (derivative is exactly 1 for positive inputs)
Residual connections help (add identity, so gradient flows directly)
Normalization helps (keeps activations in reasonable range)
Derivatives of Special Functions
Exponential and Logarithm
Function |
Derivative |
Notes |
|---|---|---|
eˣ |
eˣ |
Only function equal to its own derivative! |
aˣ |
aˣ·ln(a) |
General exponential |
ln(x) |
1/x |
Natural log |
log_a(x) |
1/(x·ln(a)) |
General log |
Trigonometric Functions
Function |
Derivative |
|---|---|
sin(x) |
cos(x) |
cos(x) |
-sin(x) |
tan(x) |
sec²(x) = 1/cos²(x) |
Activation Functions
Function |
Formula |
Derivative |
|---|---|---|
Sigmoid |
σ(x) = 1/(1+e⁻ˣ) |
σ(x)(1-σ(x)) |
Tanh |
tanh(x) |
1 - tanh²(x) |
ReLU |
max(0,x) |
1 if x>0, 0 if x<0 |
Leaky ReLU |
max(αx,x) |
1 if x>0, α if x<0 |
Softplus |
ln(1+eˣ) |
σ(x) |
GELU |
x·Φ(x) |
Complex (see PyTorch docs) |
Putting It All Together: A Complex Example
Let’s differentiate a function that might appear in a neural network:
where σ is the sigmoid function.
Step by Step
Innermost: h₁(x) = w₁x + b₁, so h₁’(x) = w₁
ReLU: h₂ = ReLU(h₁), so h₂’ = 1 if h₁ > 0, else 0
Linear: h₃ = w₂h₂ + b₂, so ∂h₃/∂h₂ = w₂
Sigmoid: f = σ(h₃), so ∂f/∂h₃ = σ(h₃)(1-σ(h₃))
Chain Rule Application
This is exactly what PyTorch computes when you call .backward()!
Implicit Differentiation
Sometimes y is defined implicitly by an equation like:
To find dy/dx, differentiate both sides with respect to x, treating y as a function of x:
ML Connection
Implicit differentiation is used in:
Implicit layers (DEQ, Neural ODEs)
Constrained optimization (Lagrange multipliers)
Physics-informed networks (enforcing constraints)
Automatic Differentiation in Practice
PyDelt’s neural network interpolator uses autodiff:
from pydelt.interpolation import NeuralNetworkInterpolator
import numpy as np
# Data
x = np.linspace(0, 2*np.pi, 100)
y = np.sin(x)
# Neural network learns the function
nn_interp = NeuralNetworkInterpolator(
hidden_layers=[32, 32],
epochs=1000
)
nn_interp.fit(x, y)
# Autodiff computes exact derivatives
derivative_func = nn_interp.differentiate(order=1)
derivatives = derivative_func(x)
# Compare to analytical
print(f"Max error: {np.max(np.abs(derivatives - np.cos(x))):.4f}")
Common Mistakes to Avoid
Mistake 1: Forgetting the Chain Rule
# WRONG: d/dx[sin(x²)] = cos(x²)
# RIGHT: d/dx[sin(x²)] = cos(x²) · 2x = 2x·cos(x²)
Mistake 2: Confusing d/dx with ∂/∂x
d/dx: Total derivative (x is the only variable)
∂/∂x: Partial derivative (other variables held constant)
Mistake 3: Forgetting That Derivatives Are Functions
The derivative of f(x) = x² is f’(x) = 2x, not just “2x at some point.”
Key Takeaways
Basic rules (power, sum, constant) handle simple functions
Product and quotient rules handle combinations
Chain rule handles compositions—and IS backpropagation
Autodiff applies these rules automatically through computational graphs
Gradient problems (vanishing/exploding) come from chain rule multiplication
Exercises
Differentiate by hand:
f(x) = x³ - 3x² + 2x - 1
f(x) = e^(x²)
f(x) = ln(sin(x))
Verify with PyDelt: Use
SplineInterpolatorto numerically verify your answers.Trace backprop: For a simple 2-layer network f(x) = σ(w₂·σ(w₁x)), write out the full chain rule expression for ∂f/∂w₁.
Previous: ← Derivatives Intuition | Next: Integration Intuition →