Optimization in Deep Learning, Slides of Artificial Intelligence

Optimization methods in deep learning, including descent direction iteration, gradient descent, stochastic gradient descent, minibatch gradient descent, and conjugate gradient descent. It also covers approaches to traditional optimization, such as random initializations, multi-starts, vanishing and exploding gradients, batch normalization, and bagging. insights into the trade-offs between speed of convergence and robustness, and the use of noisy update processes to avoid local minima. It also discusses the limitations of second-order and direct methods, as well as genetic algorithms.

Typology: Slides

2022/2023

Uploaded on 03/14/2023

ashnay
ashnay 🇺🇸

4.8

(9)

238 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Optimization in Deep Learning
Jes´us Fern´andez-Villaverde1and Galo Nu˜no2
September 1, 2022
1University of Pennsylvania
2Banco de Espa˜na
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download Optimization in Deep Learning and more Slides Artificial Intelligence in PDF only on Docsity!

Optimization in Deep Learning

Jes´us Fern´andez-Villaverde^1 and Galo Nu˜no^2 September 1, 2022 (^1) University of Pennsylvania

(^2) Banco de Espa˜na

Descent direction iteration

  • Most training of neural networks is done with (first-order) descent direction iteration methods.
  • Starting at point θ(1)^ (determined by domain knowledge), a descent direction algorithm generates sequence of steps (called iterates) that converge to a local minimum.
  • The descent direction iteration algorithm:
    1. At iteration k, check whether θ(k)^ satisfies termination condition. If so stop; otherwise go to step 2.
    2. Determine the descent direction d(k)^ using local information such as gradient or Hessian.
    3. Compute step size α(k).
    4. Compute the next candidate point: θ(k+1)^ ← θ(k)^ + α(k)d(k).
  • Choice of α and d determines the flavor of the algorithm.

Gradient descent method, II

Gradient descent method, III

  • One way to set the step size is to solve a line search:

αk^ = arg min α E(θ(k)^ + αd(k))

for example with the Brent-Dekker method.

  • Under this step size choice, it can be shown d(k+1)^ and d(k)^ are orthogonal.
  • In practice, line search can be costly and we settle for a fix α, a αk^ that geometrically decays, or an approximated line search.
  • Trade off between speed of convergence and robustness.

Heard in Minnesota Econ grad student lab If you do not know where you are going, at least go slowly.

SGD, I

  • Even with back propagation, evaluating the gradient for the whole training set can be costly: thousands of points to evaluate!
  • Stochastic gradient descent (SGD): We use only one data point to evaluate (an approximation to) the gradient.
  • We trade off slower convergence rate for faster computation and early insights in the network behavior.
  • Invented by Herbert Robbins and Sutton Monro: A Stochastic Approximation Method (1951).

SGD, II

  • Intuition from random algorithms: substitute sure convergence with almost sure convergence (think about Monte Carlo integration vs. quadrature).
  • Also, noisy update process can allow the model to avoid local minima (implicit regularization).
  • In fact, this feature can be improved using entropy SGD, sharpness aware minimization, and stochastic weight averaging (SWA).

SGD, III

  • SGD converges almost surely to a global minimum when the objective function is convex (and to a local minimum otherwise).
  • SGD converges exponentially fast to a neighborhood of the solution and, then, bounces around a “zone of confusion.” - Check https://fa.bianp.net/blog/2021/exponential-sgd/.
  • SGD can be modeled as a Markov chain with infinite states that makes monotonic progress towards its invariant distribution.
  • In practice, we do not need a global min (̸ = likelihood). Optimization is not an end in and of itself (also, subtle issue of non-uniqueness when models are over-parametrized).
  • You can flush the algorithm to a graphics processing unit (GPU) or a tensor processing unit (TPU) instead of a standard CPU.

SGD, IV

  • Example: https: //colab.research.google.com/drive/1o0Ds4FWpo8rEfHkKn0_8wkOZ6LMejkxL?usp=sharing.
  • Check, for a lot of practical ideas, Stochastic Gradient Descent Tricks, at https: //www.microsoft.com/en-us/research/wp-content/uploads/2012/01/tricks-2012.pdf.

Improving gradient descent

  • Gradient descent can perform poorly in narrow valleys (it may require many steps to make progress).
  • Famous example: Rosenbrock function → (a − x)^2 + b(y − x^2 )^2.
  • Unfortunately, these are not exotica.
  • We are often minimizing over hundreds of thousands of weights.

RAINING NEURAL NETWORKS A real example

Conjugate descent method, I

  • The conjugate gradient method overcomes this problem by constructing a direction conjugate to the old gradient, and to all previous directions traversed.
  • Define g (θ) = ∇E(θ).
  • In first iteration, set: d(1)^ = −g (θ(1)) and θ(2)^ = θ(1)^ + α(1)d(1). Here, α(1)^ is arbitrary.
  • Subsequent iterations set d(k+^1 )^ = −g (k+1)^ + β(k)d(k).