Differentiation in Multiple Dimensions #
1 - Introduction #
Differentiation provides the mathematical framework for understanding how functions change locally. While single-variable calculus introduces derivatives, most applications require working with functions of multiple variables. This chapter extends differentiation concepts to multivariate and matrix-valued functions, building the tools needed for optimization and analysis in higher dimensions.
2 - Monovariate Reminders #
Derivative of a Function #
Definition 0.1 (Derivative)
The derivative represents the instantaneous rate of change of the function at a specific point. Geometrically, $f’(x_0)$ gives the slope of the tangent line to the curve $y = f(x)$ at the point $(x_0, f(x_0))$. This tangent line provides the best linear approximation to the function near $x_0$.
For practical computation, we use two fundamental rules:
- Product rule: $(uv)’ = u’v + uv'$
- Chain rule: $(f(g(x)))’ = f’(g(x))g’(x)$
These rules allow us to differentiate complex expressions by breaking them down into simpler components.
3 - Extension to Multivariate Setup: $f:\mathbb{R}^d \to \mathbb{R}$ #
Limits and Continuity #
Definition 0.2 (Open Disk)
Definition 0.3 (Limit)
A function is continuous at a point $\mathbf{x}_0$ if $\lim_{\mathbf{x} \to \mathbf{x}_0} f(\mathbf{x}) = f(\mathbf{x}_0)$. These definitions generalize the single-variable concepts using the Euclidean norm to measure distances in $\mathbb{R}^d$.
Directional Derivative #
Definition 0.4 (Directional Derivative)
When $|\mathbf{v}|_2 = 1$, the directional derivative $Df(\mathbf{x}_0)[\mathbf{v}]$ represents the rate of change of $f$ in the direction of $\mathbf{v}$ at the point $\mathbf{x}_0$. This generalizes the concept of derivative to any direction in the input space.
We also use the notation $\nabla_{\mathbf{v}}f(\mathbf{x}_0)$ for the directional derivative.
Gradient #
Definition 0.5 (Gradient)
The gradient points in the direction of steepest ascent of the function $f$ at the point $\mathbf{x}_0$. It encodes all the first-order information about how the function changes locally.
For any vector $\mathbf{v} \in \mathbb{R}^d$, the directional derivative can be expressed as: $$Df(\mathbf{x}_0)[\mathbf{v}] = \nabla f(\mathbf{x}_0)^\mathrm{T} \mathbf{v}$$
This shows that the gradient contains all the information needed to compute directional derivatives in any direction.
Gradient and Partial Derivatives #
Definition 0.6 (Partial Derivative)
The gradient can be expressed in terms of partial derivatives as: $$\nabla f(\mathbf{x}_0) = \left( \frac{\partial f}{\partial x_1}(\mathbf{x}_0), \frac{\partial f}{\partial x_2}(\mathbf{x}_0), \ldots, \frac{\partial f}{\partial x_d}(\mathbf{x}_0) \right)^\mathrm{T}$$
This representation makes it clear that the gradient is a vector containing all the partial derivatives of the function at the point $\mathbf{x}_0$.
Gradient Properties and Practical Computation #
When computing gradients in practice, we use the following rules:
Theorem 0.1 (Product Rule for Gradients)
Theorem 0.2 (Chain Rule for Gradients)
For composition of functions, we have two main cases:
- If $f=h\circ g$ where $h:\mathbb{R}\to\mathbb{R}$ and $g:\mathbb{R}^d\to\mathbb{R}$, then: $$\nabla f(\mathbf{x}) = h’(g(\mathbf{x}))\nabla g(\mathbf{x})$$ where $h’$ is the derivative of $h$.
- If $f=h\circ g$ where $h:\mathbb{R}^d\to\mathbb{R}$ and $g:\mathbb{R}^{d’}\to\mathbb{R}^d$, we need the more general chain rule discussed later.
Hessian Matrix #
Definition 0.7 (Hessian Matrix)
The Hessian matrix captures the second-order behavior of the function, providing information about its curvature at the point $\mathbf{x}_0$.
Exercise 1: Compute the gradient and Hessian matrix of the function $f(x,y) = x^2 + 3xy + y^2$ at the point $(1,2)$.
Exercise 2: Using the chain rule, compute the gradient of $f(\mathbf{x}) = \left(\sum_{i=1}^{d}x_i^2\right)^{1/2}$.
Hessian Matrix Properties #
The Hessian matrix has several important properties:
Symmetry: If $f$ is twice continuously differentiable, then $\mathbf{H}(\mathbf{x}_0) = \mathbf{H}(\mathbf{x}_0)^\mathrm{T}$ because mixed partial derivatives are equal: $\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}$.
Curvature information: The eigenvalues of the Hessian determine the local curvature:
- All eigenvalues positive: $f$ is locally convex at $\mathbf{x}_0$
- All eigenvalues negative: $f$ is locally concave at $\mathbf{x}_0$
- Mixed positive and negative eigenvalues: $f$ has a saddle point at $\mathbf{x}_0$
Exercise 3 (Rosenbrock function): The Rosenbrock function is defined as: $$f(x,y) = (a - x)^2 + b(y - x^2)^2$$ where $a$ and $b$ are constants (commonly $a=1$ and $b=100$).
- Compute the gradient $\nabla f(x,y)$ and find stationary points.
- Compute the Hessian matrix $\mathbf{H}(x,y)$ and analyze local curvature at the stationary points.
4 - Multivariate Case: $f:\mathbb{R}^d \to \mathbb{R}^p$ #
Multivariate Functions #
Definition 0.8 (Vector-Valued Function)
Gradient and Jacobian #
For scalar-valued functions, we defined the gradient. For vector-valued functions, we need the Jacobian matrix.
Definition 0.9 (Jacobian Matrix)
The Jacobian matrix generalizes the gradient to vector-valued functions. Each row is the gradient of one component function.
Jacobian and Directional Derivative #
The directional derivative of a vector-valued function $f:\mathbb{R}^d \to \mathbb{R}^p$ in the direction of a vector $\mathbf{v} \in \mathbb{R}^d$ is: $$Df(\mathbf{x})[\mathbf{v}] = \mathbf{J}_f(\mathbf{x})\mathbf{v} = \begin{pmatrix} \nabla f_1(\mathbf{x})^T \mathbf{v} \\ \nabla f_2(\mathbf{x})^T \mathbf{v} \\ \vdots \\ \nabla f_p(\mathbf{x})^T \mathbf{v} \end{pmatrix} \in \mathbb{R}^p$$
This shows how the Jacobian matrix encodes all directional derivative information.
Chain Rule for Composition of Functions #
Theorem 0.3 (General Chain Rule)
Chain Rule: Special Cases #
Case 1: If $f:\mathbb{R}^d \to \mathbb{R}$ and $g:\mathbb{R}^m \to \mathbb{R}^d$, then for $h = f \circ g : \mathbb{R}^m \to \mathbb{R}$: $$\nabla h(\mathbf{y}) = \mathbf{J}_g(\mathbf{y})^T \nabla f(g(\mathbf{y}))$$
Case 2: If $f:\mathbb{R} \to \mathbb{R}$ and $g:\mathbb{R}^m \to \mathbb{R}$, then for $h = f \circ g : \mathbb{R}^m \to \mathbb{R}$: $$\nabla h(\mathbf{y}) = f’(g(\mathbf{y})) \nabla g(\mathbf{y})$$
Worked Examples #
Example 1: Given:
- $f(\mathbf{x}) = \mathbf{x}^T\mathbf{x}$ where $f: \mathbb{R}^2 \to \mathbb{R}$
- $g(\mathbf{y}) = \begin{pmatrix} y_1 + y_2 \\ y_1 - y_2 \end{pmatrix}$ where $g: \mathbb{R}^2 \to \mathbb{R}^2$
- $h = f \circ g$
Find $\nabla h(\mathbf{y})$ using the chain rule.
Solution:
- First, $\nabla f(\mathbf{x}) = 2\mathbf{x}$
- The Jacobian is $\mathbf{J}_g(\mathbf{y}) = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$
- Applying the chain rule: $$\nabla h(\mathbf{y}) = \mathbf{J}_g(\mathbf{y})^T \nabla f(g(\mathbf{y})) = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix} \cdot 2g(\mathbf{y})$$ $$= 2\begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}\begin{pmatrix} y_1 + y_2 \ y_1 - y_2 \end{pmatrix} = \begin{pmatrix} 4y_1 \ 4y_2 \end{pmatrix}$$
Example 2 (General Quadratic Forms): Given:
- $f(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x} + \mathbf{b}^T\mathbf{x}$ where $\mathbf{A}$ is symmetric
- $g(\mathbf{y}) = \mathbf{C}\mathbf{y}$ (linear transformation)
Find $\nabla h(\mathbf{y})$ for $h = f \circ g$.
Solution:
- $\nabla f(\mathbf{x}) = 2\mathbf{A}\mathbf{x} + \mathbf{b}$
- $\mathbf{J}_g(\mathbf{y}) = \mathbf{C}$
- Therefore: $$\nabla h(\mathbf{y}) = \mathbf{C}^T [2\mathbf{A}(\mathbf{C}\mathbf{y}) + \mathbf{b}] = 2\mathbf{C}^T\mathbf{A}\mathbf{C}\mathbf{y} + \mathbf{C}^T\mathbf{b}$$
5 - Matrix Functions: $f:\mathbb{R}^{m \times n} \to \mathbb{R}$ #
Fréchet Derivative #
Definition 0.10 (Fréchet Differentiability)
The Fréchet derivative can also be characterized using the Gateaux derivative: $$Df(\mathbf{X})[\mathbf{V}] = \left.\frac{d}{dt}\right|_{t=0} f(\mathbf{X}+t\mathbf{V}) = \lim_{t\to 0} \frac{f(\mathbf{X}+t\mathbf{V}) - f(\mathbf{X})}{t}$$
If this limit is not linear in $\mathbf{V}$, then $f$ is not Fréchet differentiable.
Often it is useful to see this derivative as a linear operator such that: $$ \mathbf{D} f(\mathbf{X})[\boldsymbol{\xi}] = f(\mathbf{X}+\mathbf{\xi}) - f(\mathbf{X}) + o(\lVert\boldsymbol{\xi}\rVert)$$
Matrix-to-Scalar Functions #
For a function $f:\mathbb{R}^{m \times n} \to \mathbb{R}$, the directional derivative at $\mathbf{X}$ in direction $\mathbf{V}$ is: $$Df(\mathbf{X})[\mathbf{V}] = \lim_{h \to 0} \frac{f(\mathbf{X} + h\mathbf{V}) - f(\mathbf{X})}{h}$$
Definition 0.11 (Matrix Gradient)
The gradient can be computed element-wise as: $$\nabla f(\mathbf{X}) = \begin{pmatrix} \frac{\partial f}{\partial X_{11}} & \frac{\partial f}{\partial X_{12}} & \cdots & \frac{\partial f}{\partial X_{1n}} \\ \frac{\partial f}{\partial X_{21}} & \frac{\partial f}{\partial X_{22}} & \cdots & \frac{\partial f}{\partial X_{2n}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial X_{m1}} & \frac{\partial f}{\partial X_{m2}} & \cdots & \frac{\partial f}{\partial X_{mn}} \end{pmatrix}$$
Examples of Matrix-to-Scalar Functions #
Example 1: $f(\mathbf{X}) = |\mathbf{X}|_F^2 = \mathrm{Tr}(\mathbf{X}^\mathrm{T}\mathbf{X})$
Using the Gateaux derivative: $$Df(\mathbf{X})[\mathbf{V}] = \left.\frac{d}{dt}\right|_{t=0} \mathrm{Tr}((\mathbf{X}+t\mathbf{V})^\mathrm{T}(\mathbf{X}+t\mathbf{V}))$$
Expanding and differentiating: $$= \left.\frac{d}{dt}\right|_{t=0} [\mathrm{Tr}(\mathbf{X}^\mathrm{T}\mathbf{X}) + 2t\mathrm{Tr}(\mathbf{X}^\mathrm{T}\mathbf{V}) + t^2\mathrm{Tr}(\mathbf{V}^\mathrm{T}\mathbf{V})]$$ $$= 2\mathrm{Tr}(\mathbf{X}^\mathrm{T}\mathbf{V})$$
Therefore: $\nabla f(\mathbf{X}) = 2\mathbf{X}$
Example 2: $f(\mathbf{X}) = \log\det(\mathbf{X})$ (for invertible $\mathbf{X}$)
For this function: $$Df(\mathbf{X})[\mathbf{V}] = \left.\frac{d}{dt}\right|_{t=0} \log\det(\mathbf{X}+t\mathbf{V}) = \mathrm{Tr}(\mathbf{X}^{-1}\mathbf{V})$$
Therefore: $\nabla f(\mathbf{X}) = \mathbf{X}^{-\mathrm{T}}$
6 - Matrix Functions: $f:\mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q}$ #
Matrix-to-Matrix Functions #
For a function $f:\mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q}$, the directional derivative $Df(\mathbf{X})[\mathbf{V}]$ is a linear mapping from $\mathbb{R}^{m \times n}$ to $\mathbb{R}^{p \times q}$.
Since $Df(\mathbf{X})$ is linear, there exists a matrix $\mathbf{M}_{\mathbf{X}} \in \mathbb{R}^{pq \times mn}$ such that: $$\mathrm{vec}(Df(\mathbf{X})[\mathbf{V}]) = \mathbf{M}_{\mathbf{X}} \mathrm{vec}(\mathbf{V})$$ where $\mathrm{vec}(\cdot)$ stacks matrix columns into a vector.
This representation transforms the problem of computing matrix derivatives into standard matrix-vector multiplication. The matrix $\mathbf{M}_{\mathbf{X}}$ is sometimes called the derivative matrix or Jacobian matrix of the vectorized function.
The power of this representation becomes clear when combined with the Kronecker product identity:
Theorem 0.4 (Kronecker Product Identity)
Example: Consider $f(\mathbf{X}) = \mathbf{A}\mathbf{X}\mathbf{B}$ where $\mathbf{A} \in \mathbb{R}^{p \times m}$ and $\mathbf{B} \in \mathbb{R}^{n \times q}$ are fixed matrices.
To find the derivative, we compute: $$Df(\mathbf{X})[\mathbf{V}] = f(\mathbf{X} + \mathbf{V}) - f(\mathbf{X}) = \mathbf{A}(\mathbf{X} + \mathbf{V})\mathbf{B} - \mathbf{A}\mathbf{X}\mathbf{B} = \mathbf{A}\mathbf{V}\mathbf{B}$$
Using the Kronecker product identity: $$\mathrm{vec}(Df(\mathbf{X})[\mathbf{V}]) = \mathrm{vec}(\mathbf{A}\mathbf{V}\mathbf{B}) = (\mathbf{B}^\mathrm{T} \otimes \mathbf{A}) \mathrm{vec}(\mathbf{V})$$
Therefore, $\mathbf{M}_{\mathbf{X}} = \mathbf{B}^\mathrm{T} \otimes \mathbf{A}$, which is independent of $\mathbf{X}$ since $f$ is linear.
Vectorization Identities #
Key identities for working with matrix derivatives:
- $\mathrm{vec}(\mathbf{A}\mathbf{B}\mathbf{C}) = (\mathbf{C}^\mathrm{T} \otimes \mathbf{A}) \mathrm{vec}(\mathbf{B})$
- $\mathrm{Tr}(\mathbf{A}\mathbf{B}) = \mathrm{vec}(\mathbf{A})^\mathrm{T}\mathrm{vec}(\mathbf{B})$
- $\mathrm{Tr}(\mathbf{A}^\mathrm{T}\mathbf{B}) = \mathrm{vec}(\mathbf{A})^\mathrm{T}\mathrm{vec}(\mathbf{B})$
where $\otimes$ denotes the Kronecker product.
Examples of Matrix-to-Matrix Functions #
Example 1: $f(\mathbf{X}) = \mathbf{X}^2$
Using the Gateaux derivative: $$Df(\mathbf{X})[\mathbf{V}] = \left.\frac{d}{dt}\right|_{t=0} (\mathbf{X}+t\mathbf{V})^2 = \mathbf{X}\mathbf{V} + \mathbf{V}\mathbf{X}$$
Example 2: $f(\mathbf{X}) = \mathbf{X}^{-1}$ (for invertible $\mathbf{X}$)
From the identity $\mathbf{X}\mathbf{X}^{-1} = \mathbf{I}$ and differentiating: $$Df(\mathbf{X})[\mathbf{V}] = -\mathbf{X}^{-1}\mathbf{V}\mathbf{X}^{-1}$$
Properties of Matrix Function Derivatives #
The derivatives of matrix functions follow familiar rules:
Linearity: For $f = \alpha g + \beta h$: $$Df(\mathbf{X})[\mathbf{V}] = \alpha , Dg(\mathbf{X})[\mathbf{V}] + \beta , Dh(\mathbf{X})[\mathbf{V}]$$
Product rule: For $f(\mathbf{X}) = g(\mathbf{X}) \cdot h(\mathbf{X})$: $$Df(\mathbf{X})[\mathbf{V}] = Dg(\mathbf{X})[\mathbf{V}] \cdot h(\mathbf{X}) + g(\mathbf{X}) \cdot Dh(\mathbf{X})[\mathbf{V}]$$
Chain rule: For $f(\mathbf{X}) = g(h(\mathbf{X}))$: $$Df(\mathbf{X})[\mathbf{V}] = Dg(h(\mathbf{X}))[Dh(\mathbf{X})[\mathbf{V}]]$$