Advanced Machine Learning Lecture Slides Weeks 1-3, Lecture notes of Machine Learning

Lecture slides for the Advanced Machine Learning course taught by John Cunningham at Columbia University. The course covers topics such as computer vision, reinforcement learning, natural language processing, and neural networks. The slides also discuss the administrative reminders, catalysts for data, computational power, and software. an overview of the course content and the challenges associated with deep learning.

Typology: Lecture notes

2021/2022

Uploaded on 05/11/2023

eknath
eknath 🇺🇸

4.7

(29)

266 documents

1 / 94

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
STAT GR5242: Advanced Machine Learning
Lecture slides: Weeks 1-3
John Cunningham
Department of Statistics
Columbia University
Advanced Machine Learning 1 / 94
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e

Partial preview of the text

Download Advanced Machine Learning Lecture Slides Weeks 1-3 and more Lecture notes Machine Learning in PDF only on Docsity!

STAT GR5242: Advanced Machine Learning

Lecture slides: Weeks 1-

John Cunningham

Department of Statistics Columbia University

ADMINISTRATIVE REMINDERS

Welcome! Let’s discuss the syllabus...

BUT WHAT ABOUT ALL THE AI HYPE?

Modern AI/ML is the same recipe

  • (^) Gather data, choose F = {fθ : θ ∈ Θ}, specify loss, minimize empirical risk
  • (^) All the same potential issues exist (wrong F, under/overfitting, optimization issues,...)
  • (^) The same statistical and computational thinking is necessary

The four catalysts of the AI explosion

  1. Large and readily available datasets
  2. Massive and cheap computational power
  3. Flexible and general function families F
  4. Open-source ML software libraries with powerful abstractions

We will study some neural network families F. While neural networks are powerful, there is nothing magical or fundamentally different than what you already know.

CATALYST 1: DATA

Computer Vision

SVHN CIFAR10 ImageNet ...

Reinforcement Learning

OpenAI Breakout OpenAI Cartpole UCB Pacman ...

Natural Language Processing

Wikipedia (English) Twitter Jeopardy ...

And so much more...

  • (^) https://www.data.gov/
  • (^) https://opendata.cityofnewyork.us/
  • (^) https://github.com/caesar0301/awesome-public-datasets
  • (^) ...

CATALYST 3: NEURAL NETWORKS

σ^ P iw^0 ijx^0 i +b^1 j^ 

x^03

x^02

x^01

...

...

...

Neural unit

w^03 j

w^02 j

w^01 j x^1 j

x^01

x^02

x^03

x^11

x^12

x^13

x^24

x^21

x^22

x^23

x^24

x^31

Input layer Hidden layer 1 Hidden layer 2 Output layer

With enough layers and enough units per layer, the network is a universal function approximator: any function can be fit (given enough data...).

  • (^) Inputs x^0 i enter into unit j, weighted by edges w^0 ij, and are summed with bias b^1 j
  • (^) σ(·) provides elementwise nonlinearity
  • (^) The result x^1 j is transmitted to layer 2, the next layer

Learning/Training is then minimizing an empirical risk over the parameter set

θ =

n wℓ ij, bℓ j

o

i,j,ℓ

= {Wℓ, bℓ}ℓ

EXAMPLE: LOGISTIC REGRESSION → NEURAL NETWORKS

Logistic Regression

x W b^ fθ (x) σ(Wx + b)

Neural Network

x

W 1 b 1 f^ θ( 1 )(x) σ(W 1 x + b 1 )

W 2 b 2 f^ θ( 2 )(x) σ(W 2 f (^1 )(x) + b 2 )

...DEEP LEARNING IS HARD

  • (^) How do I choose f (^1 )^ , the number of units in the hidden layers?
  • (^) How do I choose L, the number of layers?
  • (^) How do I choose the activation function σ(·)? sigmoid tanh relu softplus softmax ... 1 1 +e−x

ex^ −e−x ex^ −e−x^ max(^0 ,^ x)^ log (^1 +^ e

x) Pexi k exk^ ...

  • (^) Are there other choices to make?
  • (^) What about overfitting?
  • (^) Will my optimizer converge?
  • (^) Is my problem solvable with a particular architecture F?
  • (^) Can my data be fit by a particular architecture F? MNIST vs. SVHN

Deep learning requires engineering skill, statistical thinking, and thoughtful empiricism.

CATALYST 4: SOFTWARE

Machine Learning libraries have abstracted {math, stats, optimization, ...} → engineering

...

Under the hood are several essential elements to understand:

  • (^) Neural networks in detail

(sounds obvious, but we’ll spend some time here...)

  • (^) Automatic differentiation
  • (^) Stochastic optimization

(much more to come here also...)

To understand modern ML, we need to understand why these work... and when they don’t.

ADMINISTRATIVE REMINDERS

  • (^) Slides and syllabus on courseworks (and Assignment 1 soon)
  • (^) A few comments about textbooks:
    • (^) There is no textbook for this course... for a good reason.
    • (^) When there is a relevant background reading or survey/review, I will note it in class.
    • (^) Mathematics for Machine Learning A. Aldo Faisal, Cheng Soon Ong, and Marc Peter Deisenroth
    • (^) Probabilistic Machine Learning Kevin P. Murphy
    • (^) Deep Learning Aaron Courville, Yoshua Bengio, Ian Goodfellow
    • (^) Pattern Recognition and Machine Learning Christopher Bishop
  • (^) Ask questions in class. Don’t wait until after class and then divide the impact of that question by 100x.
  • (^) Also, so you don’t think I’m just making stuff up, a DALL-E sample:

A MOST IMPORTANT REMINDER

A neural network represents a function fθ : Rd^1 → Rd^2.

READING NEURAL NETWORKS

f : R^3 → R^3 with input x =

x 1 x 2 x 3

w 11

w 12

w 13 w 21 w 22

w 23 w 31

w 32

w 33

f 1 (x) = ϕ 1 (⟨w 1 , x⟩) f 2 (x) = ϕ 2 (⟨w 2 , x⟩) f 3 (x) = ϕ 3 (⟨w 3 , x⟩)

x 1 x 2 x 3

ϕ 1 ϕ 2 ϕ 3

f (x) =

f 1 (x) f 2 (x) f 3 (x)

 (^) with fi(x) = ϕi

 X^3

j= 1

wjixj

(recall inner product ⟨wi, x⟩ = w⊤ i x = P j wjixj )

FEED-FORWARD NETWORKS

A feed-forward network is a neural network whose units can be arranged into groups L 1 ,... , LK so that connections (arrows) only pass from units in group Lk to units in group Lk+ 1. The groups are called layers. In a feed-forward network:

  • (^) There are no connections within a layer.
  • (^) There are no backwards connections.
  • (^) There are no connections that skip layers, e.g. from Lk to units in group Lk+ 2. (but see Huang...Weinberger 2017 CVPR)

feed-forward

L 1

L 2

L 3

not feed-forward not feed-forward (but still useful...)

LAYERS

w^111

w^112 w 1 (^21) w 1 22

ϕ^11 ϕ^12 f (^2 )

  • (^) Each layer represents a function, which takes the output values of the previous layers as its arguments.
  • (^) Suppose the output values of the two nodes at the top are y 1 , y 2.
  • (^) Then the second layer defines the (two-dimensional) function

f (^2 )(y) =

ϕ^11 ( w^11 , y ) ϕ^12 ( w^12 , y )

COMPOSITION OF FUNCTIONS

Basic composition Suppose f and g are two function R → R. Their composition g ◦ f is the function

g ◦ f (x) := g(f (x)).

For example: f (x) = x + 1 g(y) = y^2 g ◦ f (x) = (x + 1 )^2 We could combine the same functions the other way around:

f ◦ g(x) = x^2 + 1

In multiple dimensions Suppose f : Rd^1 → Rd^2 and g : Rd^2 → Rd^3. Then

g ◦ f (x) = g(f (x)) is a function Rd^1 → Rd^3.

For example: f (x) = ⟨x, v⟩ − c g(y) = sgn(y) g ◦ f (x) = sgn(⟨x, v⟩ − c)