Chapter 6: Multivariate Calculus

“Real ML has millions of parameters. Multivariate calculus extends derivatives to functions of many variables: gradients, Jacobians, Hessians.”

From One Variable to Many

So far, we’ve focused on functions of a single variable: f(x). But in machine learning:

  • A neural network has millions of parameters

  • An image has thousands of pixels

  • A dataset has many features

We need calculus for functions of many variables: f(x₁, x₂, …, xₙ) or f(x).

Partial Derivatives

The Idea

A partial derivative measures how the function changes when you vary one input while holding all others constant.

For f(x, y):

  • ∂f/∂x: How f changes when x changes (y fixed)

  • ∂f/∂y: How f changes when y changes (x fixed)

Example

\[f(x, y) = x^2 + xy + y^2\]
\[\frac{\partial f}{\partial x} = 2x + y \quad \text{(treat y as constant)}\]
\[\frac{\partial f}{\partial y} = x + 2y \quad \text{(treat x as constant)}\]

In Code

import numpy as np

def f(x, y):
    return x**2 + x*y + y**2

def partial_x(x, y, h=1e-8):
    """Numerical partial derivative with respect to x."""
    return (f(x + h, y) - f(x - h, y)) / (2 * h)

def partial_y(x, y, h=1e-8):
    """Numerical partial derivative with respect to y."""
    return (f(x, y + h) - f(x, y - h)) / (2 * h)

# At point (1, 2)
print(f"∂f/∂x at (1,2) = {partial_x(1, 2):.4f}")  # Should be 2(1) + 2 = 4
print(f"∂f/∂y at (1,2) = {partial_y(1, 2):.4f}")  # Should be 1 + 2(2) = 5

The Gradient

Definition

The gradient of a scalar function f: ℝⁿ → ℝ is the vector of all partial derivatives:

\[\begin{split}\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}\end{split}\]

Geometric Meaning

The gradient points in the direction of steepest ascent. Its magnitude tells you how steep that direction is.

  • To maximize f: move in direction of ∇f

  • To minimize f: move in direction of -∇f ← This is gradient descent!

Example

For f(x, y) = x² + y²:

\[\begin{split}\nabla f = \begin{bmatrix} 2x \\ 2y \end{bmatrix}\end{split}\]

At point (1, 1), the gradient is [2, 2]. This points away from the minimum at (0, 0).

PyDelt’s Gradient Computation

from pydelt.multivariate import MultivariateDerivatives
from pydelt.interpolation import SplineInterpolator
import numpy as np

# Generate 2D data: f(x,y) = x² + y²
x = np.linspace(-2, 2, 20)
y = np.linspace(-2, 2, 20)
X, Y = np.meshgrid(x, y)
Z = X**2 + Y**2

# Prepare data
input_data = np.column_stack([X.flatten(), Y.flatten()])
output_data = Z.flatten()

# Compute gradient
mv = MultivariateDerivatives(SplineInterpolator, smoothing=0.1)
mv.fit(input_data, output_data)
gradient_func = mv.gradient()

# Evaluate at specific points
test_points = np.array([[1.0, 1.0], [0.0, 0.0], [-1.0, 2.0]])
gradients = gradient_func(test_points)

print("Point (1,1): gradient =", gradients[0])   # Should be [2, 2]
print("Point (0,0): gradient =", gradients[1])   # Should be [0, 0]
print("Point (-1,2): gradient =", gradients[2])  # Should be [-2, 4]

The Jacobian

When Output Is a Vector

For a vector-valued function f: ℝⁿ → ℝᵐ, the Jacobian is the matrix of all partial derivatives:

\[\begin{split}J_{\mathbf{f}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}\end{split}\]

Row i, column j contains ∂fᵢ/∂xⱼ.

Example

For f(x, y) = [x² + y, xy]:

\[\begin{split}J = \begin{bmatrix} 2x & 1 \\ y & x \end{bmatrix}\end{split}\]

ML Connection: Backpropagation

In a neural network, each layer is a vector-valued function. The Jacobian of a layer tells you how each output depends on each input.

Backpropagation multiplies Jacobians:

\[\frac{\partial L}{\partial \mathbf{x}} = \frac{\partial L}{\partial \mathbf{h}_n} \cdot J_{h_n} \cdot J_{h_{n-1}} \cdot ... \cdot J_{h_1}\]

PyDelt’s Jacobian Computation

# Vector-valued function: f(x,y) = [x² + y, xy]
# We need multiple outputs
input_data = np.column_stack([X.flatten(), Y.flatten()])
output_data = np.column_stack([
    (X**2 + Y).flatten(),  # f1 = x² + y
    (X * Y).flatten()       # f2 = xy
])

mv = MultivariateDerivatives(SplineInterpolator, smoothing=0.1)
mv.fit(input_data, output_data)
jacobian_func = mv.jacobian()

# Evaluate Jacobian at (1, 2)
test_point = np.array([[1.0, 2.0]])
J = jacobian_func(test_point)
print("Jacobian at (1,2):")
print(J[0])
# Should be approximately:
# [[2, 1],   <- [∂f1/∂x, ∂f1/∂y] = [2x, 1] = [2, 1]
#  [2, 1]]   <- [∂f2/∂x, ∂f2/∂y] = [y, x] = [2, 1]

The Hessian

Second-Order Information

The Hessian is the matrix of second partial derivatives:

\[\begin{split}H_f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}\end{split}\]

What It Tells You

The Hessian describes the curvature of the function:

  • Positive definite H: Local minimum (bowl shape)

  • Negative definite H: Local maximum (dome shape)

  • Indefinite H: Saddle point

Example

For f(x, y) = x² + y²:

\[\begin{split}H = \begin{bmatrix} 2 & 0 \\ 0 & 2 \end{bmatrix}\end{split}\]

This is positive definite (both eigenvalues are 2 > 0), confirming (0, 0) is a minimum.

ML Connection: Second-Order Optimization

Newton’s method uses the Hessian:

\[\theta_{new} = \theta_{old} - H^{-1} \nabla L\]

This converges faster than gradient descent but requires computing and inverting the Hessian—expensive for large models!

Approximations:

  • L-BFGS: Approximate Hessian from gradient history

  • Adam: Diagonal approximation using second moments

  • Natural gradient: Fisher information matrix

PyDelt’s Hessian Computation

# Compute Hessian
hessian_func = mv.hessian()
H = hessian_func(np.array([[0.0, 0.0]]))
print("Hessian at (0,0):")
print(H[0])
# For f(x,y) = x² + y², should be [[2, 0], [0, 2]]

The Laplacian

Definition

The Laplacian is the trace of the Hessian (sum of diagonal elements):

\[\nabla^2 f = \sum_{i=1}^{n} \frac{\partial^2 f}{\partial x_i^2}\]

Physical Meaning

The Laplacian measures how much a point differs from its neighbors:

  • Positive Laplacian: Point is lower than average of neighbors (local minimum tendency)

  • Negative Laplacian: Point is higher than average of neighbors (local maximum tendency)

Applications

  • Heat equation: ∂u/∂t = α∇²u (heat flows from hot to cold)

  • Diffusion: Concentration spreads out over time

  • Image processing: Edge detection, smoothing

  • Graph neural networks: Graph Laplacian for message passing

PyDelt’s Laplacian Computation

laplacian_func = mv.laplacian()
lap = laplacian_func(np.array([[0.0, 0.0], [1.0, 1.0]]))
print("Laplacian values:", lap)
# For f(x,y) = x² + y², Laplacian = 2 + 2 = 4 everywhere

Directional Derivatives

Beyond Coordinate Directions

The gradient gives derivatives along coordinate axes. The directional derivative gives the derivative along any direction v:

\[D_{\mathbf{v}} f = \nabla f \cdot \mathbf{v} = |\nabla f| \cos(\theta)\]

where θ is the angle between ∇f and v.

Key Insight

  • Maximum directional derivative: along ∇f (θ = 0)

  • Zero directional derivative: perpendicular to ∇f (θ = 90°)

  • Minimum directional derivative: opposite to ∇f (θ = 180°)

This is why gradient descent works—we move in the direction of maximum decrease.

Chain Rule in Multiple Dimensions

Scalar Composition

If z = f(x, y) and x = g(t), y = h(t):

\[\frac{dz}{dt} = \frac{\partial f}{\partial x}\frac{dx}{dt} + \frac{\partial f}{\partial y}\frac{dy}{dt}\]

Vector Composition

If z = f(g(x)):

\[J_{\mathbf{z}} = J_{\mathbf{f}} \cdot J_{\mathbf{g}}\]

The Jacobians multiply! This is the multivariate chain rule—the foundation of backpropagation.

Optimization in High Dimensions

Critical Points

A critical point satisfies ∇f = 0. To classify it:

  1. Compute the Hessian H at the critical point

  2. Find eigenvalues of H:

    • All positive → local minimum

    • All negative → local maximum

    • Mixed signs → saddle point

The Saddle Point Problem

In high dimensions, saddle points are much more common than local minima:

  • For n dimensions, a random critical point has ~50% chance of each eigenvalue being positive

  • Probability of all n eigenvalues positive: ~(1/2)ⁿ

This is why gradient descent often works in deep learning—we’re unlikely to get stuck in local minima; we’re more likely at saddle points, which have escape directions.

Practical Example: Gradient Descent

import numpy as np
from pydelt.multivariate import MultivariateDerivatives
from pydelt.interpolation import SplineInterpolator

# Objective: minimize f(x,y) = (x-1)² + (y-2)²
# Minimum is at (1, 2)

# Generate training data
x = np.linspace(-3, 5, 30)
y = np.linspace(-2, 6, 30)
X, Y = np.meshgrid(x, y)
Z = (X - 1)**2 + (Y - 2)**2

input_data = np.column_stack([X.flatten(), Y.flatten()])
output_data = Z.flatten()

# Fit multivariate derivatives
mv = MultivariateDerivatives(SplineInterpolator, smoothing=0.1)
mv.fit(input_data, output_data)
gradient_func = mv.gradient()

# Gradient descent
point = np.array([[4.0, 5.0]])  # Starting point
learning_rate = 0.1
history = [point.copy()]

for i in range(20):
    grad = gradient_func(point)
    point = point - learning_rate * grad
    history.append(point.copy())
    
print(f"Final point: {point[0]}")  # Should be close to [1, 2]
print(f"True minimum: [1, 2]")

Key Takeaways

  1. Partial derivatives measure change in one variable at a time

  2. Gradient points toward steepest ascent—negate for descent

  3. Jacobian is the matrix of derivatives for vector functions

  4. Hessian captures curvature—crucial for optimization

  5. Laplacian measures deviation from neighbors

  6. Chain rule multiplies Jacobians—this IS backpropagation

Exercises

  1. Compute by hand: For f(x, y, z) = xyz, find ∇f and evaluate at (1, 2, 3).

  2. Classify critical points: For f(x, y) = x³ - 3xy², find critical points and classify them using the Hessian.

  3. Implement gradient descent: Use PyDelt to minimize f(x, y) = sin(x) + cos(y) starting from (0, 0).

  4. Explore saddle points: For f(x, y) = x² - y², verify that (0, 0) is a saddle point by computing the Hessian eigenvalues.


Previous: ← Approximation Theory | Next: Complex Analysis →