BFGS and SR1 Quasi-Newton Methods #
Introduction #
Quasi-Newton methods are a class of optimization algorithms that approximate the Newton direction without requiring explicit computation of the Hessian matrix. These methods achieve superlinear convergence while avoiding the computational expense and potential numerical difficulties associated with second derivatives. The two most prominent quasi-Newton methods are the BFGS (Broyden-Fletcher-Goldfarb-Shanno) and SR1 (Symmetric Rank-One) methods.
Mathematical setup #
Consider the optimization problem: $\min_{\mathbf{x} \in \mathbb{R}^n} f(\mathbf{x})$
where $f: \mathbb{R}^n \to \mathbb{R}$ is twice continuously differentiable. At iteration $k$, we have:
- Current point: $\mathbf{x}_k$
- Hessian approximation: $\mathbf{B}_k$ (or inverse approximation $\mathbf{H}_k = \mathbf{B}_k^{-1}$)
- Search direction: $\mathbf{p}_k = -\mathbf{B}_k^{-1} \nabla f_k = -\mathbf{H}_k \nabla f_k$
The secant equation #
Quasi-Newton methods are based on the secant equation, which approximates the relationship between gradient changes and the Hessian:
$\mathbf{B}_{k+1} \mathbf{s}_k = \mathbf{y}_k$
where:
- $\mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k$ (step vector)
- $\mathbf{y}_k = \nabla f_{k+1} - \nabla f_k$ (gradient difference)
This equation ensures that the Hessian approximation captures the local curvature information from the most recent step.
BFGS method #
BFGS update formula #
The BFGS method uses a rank-two update to maintain positive definiteness:
$\mathbf{B}_{k+1} = \mathbf{B}_k - \frac{\mathbf{B}_k \mathbf{s}_k \mathbf{s}_k^{\mathrm{T}} \mathbf{B}_k}{\mathbf{s}_k^{\mathrm{T}} \mathbf{B}_k \mathbf{s}_k} + \frac{\mathbf{y}_k \mathbf{y}_k^{\mathrm{T}}}{\mathbf{y}_k^{\mathrm{T}} \mathbf{s}_k}$
Inverse BFGS formula #
For computational efficiency, the inverse approximation is often updated directly:
$\mathbf{H}_{k+1} = \left(\mathbf{I} - \rho_k \mathbf{s}_k \mathbf{y}_k^{\mathrm{T}}\right) \mathbf{H}_k \left(\mathbf{I} - \rho_k \mathbf{y}_k \mathbf{s}_k^{\mathrm{T}}\right) + \rho_k \mathbf{s}_k \mathbf{s}_k^{\mathrm{T}}$
where $\rho_k = \frac{1}{\mathbf{y}_k^{\mathrm{T}} \mathbf{s}_k}$.
Properties of BFGS #
- Maintains positive definiteness when $\mathbf{s}_k^{\mathrm{T}} \mathbf{y}_k > 0$
- Guarantees descent direction when positive definite
- Superlinear convergence rate
- Most robust and widely used quasi-Newton method
SR1 method #
SR1 update formula #
The Symmetric Rank-One method uses a simpler rank-one update:
$\mathbf{B}_{k+1} = \mathbf{B}_k + \frac{(\mathbf{y}_k - \mathbf{B}_k \mathbf{s}_k)(\mathbf{y}_k - \mathbf{B}_k \mathbf{s}_k)^{\mathrm{T}}}{(\mathbf{y}_k - \mathbf{B}_k \mathbf{s}_k)^{\mathrm{T}} \mathbf{s}_k}$
Properties of SR1 #
- Simpler structure (rank-one vs rank-two)
- Does not guarantee positive definiteness
- Can approximate indefinite Hessians better than BFGS
- May produce non-descent directions
- Often used in trust region methods
Algorithm steps #
Step 1: Initialize parameters #
- Choose initial point $\mathbf{x}_0$
- Set initial Hessian approximation $\mathbf{H}_0 = \mathbf{I}$ (or scaled identity)
- Set convergence tolerance $\epsilon > 0$
- Set $k = 0$
Step 2: Check convergence #
If $|\nabla f_k| \leq \epsilon$, stop and return $\mathbf{x}_k$.
Step 3: Compute search direction #
$\mathbf{p}_k = -\mathbf{H}_k \nabla f_k$
Step 4: Line search #
Find step size $\alpha_k$ using backtracking line search or other methods.
Step 5: Update iterate #
$\mathbf{x}_{k+1} = \mathbf{x}_k + \alpha_k \mathbf{p}_k$
Step 6: Update Hessian approximation #
Compute $\mathbf{s}_k = \mathbf{x}_{k+1} - \mathbf{x}_k$ and $\mathbf{y}_k = \nabla f_{k+1} - \nabla f_k$. Update $\mathbf{H}_{k+1}$ using BFGS or SR1 formula.
Step 7: Increment and repeat #
Set $k \leftarrow k + 1$ and go to Step 2.
Algorithmic description #
Algorithm: BFGS Quasi-Newton Method
Input: x₀, ε, max_iter
Output: x*
1. Set H₀ = I, k = 0
2. While ‖∇f_k‖ > ε and k < max_iter do
3. p_k = -H_k ∇f_k
4. α_k = backtracking_line_search(x_k, p_k)
5. x\_{k+1} = x_k + α_k p_k
6. s_k = x\_{k+1} - x_k
7. y_k = ∇f\_{k+1} - ∇f_k
8. if s_k^T y_k > 0 then
9. ρ_k = 1/(y_k^T s_k)
10. H\_{k+1} = (I - ρ_k s_k y_k^T) H_k (I - ρ_k y_k s_k^T) + ρ_k s_k s_k^T
11. else
12. H\_{k+1} = H_k // Skip update
13. k ← k + 1
14. End while
15. Return x_k
Theoretical properties #
Convergence rate #
Under suitable conditions:
- BFGS: Superlinear convergence when started sufficiently close to the solution
- SR1: Superlinear convergence in trust region frameworks
- Both methods reduce to Newton’s method when the Hessian approximation becomes exact
Curvature condition #
For BFGS, the condition $\mathbf{s}_k^{\mathrm{T}} \mathbf{y}_k > 0$ is crucial for maintaining positive definiteness. This is automatically satisfied for strongly convex functions with exact line search.
Implementation considerations #
Initial Hessian approximation #
Common choices for $\mathbf{H}_0$:
- Identity matrix: $\mathbf{H}_0 = \mathbf{I}$
- Scaled identity: $\mathbf{H}_0 = \gamma \mathbf{I}$ where $\gamma > 0$
- Diagonal scaling based on the gradient
Skipping updates #
When $|\mathbf{s}_k^{\mathrm{T}} \mathbf{y}_k|$ is too small, skip the update to avoid numerical instability:
if |s_k^T y_k| < epsilon_skip * ‖s_k‖ * ‖y_k‖:
H\_{k+1} = H_k // Skip update
Memory considerations #
For large-scale problems, consider:
- L-BFGS: Limited memory version storing only recent vector pairs
- Matrix-free implementations: Avoid storing full matrices
Applications #
Quasi-Newton methods are widely used in:
- Machine learning: Training neural networks, logistic regression
- Engineering optimization: Design optimization, parameter estimation
- Economics: Portfolio optimization, economic modeling
- Scientific computing: Parameter fitting, inverse problems
Example implementation #
import numpy as np
from scipy.optimize import line_search
def bfgs_optimizer(f, grad_f, x0, tol=1e-6, max_iter=100):
"""
BFGS quasi-Newton optimization algorithm
Parameters:
- f: objective function
- grad_f: gradient function
- x0: initial point
- tol: convergence tolerance
- max_iter: maximum iterations
Returns:
- x: optimal point
- history: optimization history
"""
n = len(x0)
x = x0.copy()
H = np.eye(n) # Initial inverse Hessian approximation
history = {'x': [x.copy()], 'f': [f(x)], 'grad_norm': [np.linalg.norm(grad_f(x))]}
for k in range(max_iter):
grad = grad_f(x)
# Check convergence
if np.linalg.norm(grad) < tol:
break
# Compute search direction
p = -H @ grad
# Line search (simplified backtracking)
alpha = backtracking_line_search(f, grad_f, x, p)
# Update iterate
x_new = x + alpha * p
grad_new = grad_f(x_new)
# Compute vectors for BFGS update
s = x_new - x
y = grad_new - grad
# BFGS update (if curvature condition satisfied)
if s @ y > 1e-10:
rho = 1.0 / (y @ s)
I = np.eye(n)
# BFGS inverse Hessian update
H = (I - rho * np.outer(s, y)) @ H @ (I - rho * np.outer(y, s)) + rho * np.outer(s, s)
# Update for next iteration
x = x_new
# Store history
history['x'].append(x.copy())
history['f'].append(f(x))
history['grad_norm'].append(np.linalg.norm(grad_new))
return x, history
def sr1_optimizer(f, grad_f, x0, tol=1e-6, max_iter=100):
"""
SR1 quasi-Newton optimization algorithm
Parameters similar to BFGS
"""
n = len(x0)
x = x0.copy()
H = np.eye(n)
history = {'x': [x.copy()], 'f': [f(x)], 'grad_norm': [np.linalg.norm(grad_f(x))]}
for k in range(max_iter):
grad = grad_f(x)
if np.linalg.norm(grad) < tol:
break
p = -H @ grad
alpha = backtracking_line_search(f, grad_f, x, p)
x_new = x + alpha * p
grad_new = grad_f(x_new)
s = x_new - x
y = grad_new - grad
# SR1 update
y_minus_Hs = y - H @ s
denominator = y_minus_Hs @ s
# Skip update if denominator too small
if abs(denominator) > 1e-10:
H = H + np.outer(y_minus_Hs, y_minus_Hs) / denominator
x = x_new
history['x'].append(x.copy())
history['f'].append(f(x))
history['grad_norm'].append(np.linalg.norm(grad_new))
return x, history
def backtracking_line_search(f, grad_f, x, p, alpha0=1.0, rho=0.5, c1=1e-4):
"""Simple backtracking line search"""
alpha = alpha0
f_x = f(x)
grad_x = grad_f(x)
while f(x + alpha * p) > f_x + c1 * alpha * (grad_x @ p):
alpha *= rho
if alpha < 1e-10: # Prevent infinite loop
break
return alpha
Exercises #
Exercise 1: Himmelblau’s Function #
Implement both BFGS and SR1 methods to minimize Himmelblau’s function: $$f(x_1, x_2) = (x_1^2 + x_2 - 11)^2 + (x_1 + x_2^2 - 7)^2$$
Starting points to try:
- $(0, 0)$
- $(1, 1)$
- $(-1, 1)$
Compare convergence rates and final solutions. Himmelblau’s function has four global minima.
Exercise 2: Mixed Function #
Minimize the function: $$f(x_1, x_2) = \frac{1}{2}x_1^2 + x_1 \cos(x_2)$$
Starting points:
- $(2, 0)$
- $(0, \pi)$
- $(1, \pi/2)$
Analyze how the cosine term affects convergence behavior and compare BFGS vs SR1 performance.
Exercise 3: Comparative Analysis #
For both test functions:
- Plot the convergence history (function values and gradient norms)
- Count the number of function and gradient evaluations
- Analyze the condition number of the final Hessian approximations
- Compare with gradient descent and Newton’s method (when applicable)
Exercise 4: Parameter Sensitivity #
Study the effect of:
- Different initial Hessian approximations ($\mathbf{H}_0 = \gamma \mathbf{I}$ for various $\gamma$)
- Line search parameters in the backtracking procedure
- Skipping criteria for Hessian updates