Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

For each uploaded document

Answer questions

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Optimization in Deep Learning, Slides of Artificial Intelligence

University of Pennsylvania (UPenn)Artificial Intelligence

Optimization methods in deep learning, including descent direction iteration, gradient descent, stochastic gradient descent, minibatch gradient descent, and conjugate gradient descent. It also covers approaches to traditional optimization, such as random initializations, multi-starts, vanishing and exploding gradients, batch normalization, and bagging. insights into the trade-offs between speed of convergence and robustness, and the use of noisy update processes to avoid local minima. It also discusses the limitations of second-order and direct methods, as well as genetic algorithms.

Typology: Slides

2022/2023

Uploaded on 03/14/2023

ashnay 🇺🇸

4.8

(9)

238 documents

1 / 26

This page cannot be seen from the preview

Don't miss anything!

bg1

Optimization in Deep Learning

Jes´us Fern´andez-Villaverde1and Galo Nu˜no2

September 1, 2022

1University of Pennsylvania

2Banco de Espa˜na

pf3

pf4

pf5

pf8

pf9

pfa

pfd

pfe

pff

pf12

pf13

pf14

pf15

pf16

pf17

pf18

pf19

pf1a

Discover Slides of Artificial Intelligence University of Pennsylvania (UPenn)

Related documents

UNDERSTANDING GRADIENT DESCENT AND BACK-PROPAGATION

(1)

Optimization (Lectures on Numerical Analysis for Economists III)

Optimizing EduBot Walking Gait with Gradient Descent - Prof. D. Palsetia

Connected Maps - Artificial Intelligence - Solved Exam

Perceptrons , Lecture Notes - Computer Science

Gradient Descent Algorithm

Optimization and Gradient Descent: INFO-4604, Applied Machine Learning

Gradient Descent Optimization

Gradient descent notes

Optimization Methods and Algorithms

1 Overview 2 Steepest Descent

Gradient Methods: A Comprehensive Guide to Optimization Techniques

Partial preview of the text

Download Optimization in Deep Learning and more Slides Artificial Intelligence in PDF only on Docsity!

Optimization in Deep Learning

Jes´us Fern´andez-Villaverde^1 and Galo Nu˜no^2 September 1, 2022 (^1) University of Pennsylvania

(^2) Banco de Espa˜na

Descent direction iteration

Most training of neural networks is done with (first-order) descent direction iteration methods.
Starting at point θ(1)^ (determined by domain knowledge), a descent direction algorithm generates sequence of steps (called iterates) that converge to a local minimum.
The descent direction iteration algorithm:
1. At iteration k, check whether θ(k)^ satisfies termination condition. If so stop; otherwise go to step 2.
2. Determine the descent direction d(k)^ using local information such as gradient or Hessian.
3. Compute step size α(k).
4. Compute the next candidate point: θ(k+1)^ ← θ(k)^ + α(k)d(k).
Choice of α and d determines the flavor of the algorithm.

Gradient descent method, II

Gradient descent method, III

One way to set the step size is to solve a line search:

αk^ = arg min α E(θ(k)^ + αd(k))

for example with the Brent-Dekker method.

Under this step size choice, it can be shown d(k+1)^ and d(k)^ are orthogonal.
In practice, line search can be costly and we settle for a fix α, a αk^ that geometrically decays, or an approximated line search.
Trade off between speed of convergence and robustness.

Heard in Minnesota Econ grad student lab If you do not know where you are going, at least go slowly.

SGD, I

Even with back propagation, evaluating the gradient for the whole training set can be costly: thousands of points to evaluate!
Stochastic gradient descent (SGD): We use only one data point to evaluate (an approximation to) the gradient.
We trade off slower convergence rate for faster computation and early insights in the network behavior.
Invented by Herbert Robbins and Sutton Monro: A Stochastic Approximation Method (1951).

SGD, II

Intuition from random algorithms: substitute sure convergence with almost sure convergence (think about Monte Carlo integration vs. quadrature).
Also, noisy update process can allow the model to avoid local minima (implicit regularization).
In fact, this feature can be improved using entropy SGD, sharpness aware minimization, and stochastic weight averaging (SWA).

SGD, III

SGD converges almost surely to a global minimum when the objective function is convex (and to a local minimum otherwise).
SGD converges exponentially fast to a neighborhood of the solution and, then, bounces around a “zone of confusion.” - Check https://fa.bianp.net/blog/2021/exponential-sgd/.
SGD can be modeled as a Markov chain with infinite states that makes monotonic progress towards its invariant distribution.
In practice, we do not need a global min (̸ = likelihood). Optimization is not an end in and of itself (also, subtle issue of non-uniqueness when models are over-parametrized).
You can flush the algorithm to a graphics processing unit (GPU) or a tensor processing unit (TPU) instead of a standard CPU.

SGD, IV

Example: https: //colab.research.google.com/drive/1o0Ds4FWpo8rEfHkKn0_8wkOZ6LMejkxL?usp=sharing.
Check, for a lot of practical ideas, Stochastic Gradient Descent Tricks, at https: //www.microsoft.com/en-us/research/wp-content/uploads/2012/01/tricks-2012.pdf.

Improving gradient descent

Gradient descent can perform poorly in narrow valleys (it may require many steps to make progress).
Famous example: Rosenbrock function → (a − x)^2 + b(y − x^2 )^2.
Unfortunately, these are not exotica.
We are often minimizing over hundreds of thousands of weights.

RAINING NEURAL NETWORKS A real example

Conjugate descent method, I

The conjugate gradient method overcomes this problem by constructing a direction conjugate to the old gradient, and to all previous directions traversed.
Define g (θ) = ∇E(θ).
In first iteration, set: d(1)^ = −g (θ(1)) and θ(2)^ = θ(1)^ + α(1)d(1). Here, α(1)^ is arbitrary.
Subsequent iterations set d(k+^1 )^ = −g (k+1)^ + β(k)d(k).