Docsity
Docsity

Prepara tus exámenes
Prepara tus exámenes

Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity


Consigue puntos base para descargar
Consigue puntos base para descargar

Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium


Orientación Universidad
Orientación Universidad


Introducción a la Regresión Lineal y Métodos de Aprendizaje Automático, Apuntes de Introducción al Aprendizaje Automático

Apuntes Machine Learning Politecnico de Milan

Tipo: Apuntes

2019/2020

Subido el 15/05/2020

jorge-santisteban-1
jorge-santisteban-1 🇪🇸

5

(1)

2 documentos

1 / 38

Toggle sidebar

Esta página no es visible en la vista previa

¡No te pierdas las partes importantes!

bg1
Summary – Machine Learning
exercises
V 1.0
Index
Linear Regression.................................................................................................................................4
Model results analysis......................................................................................................................4
Normal vs Lasso vs Ridge...............................................................................................................5
Model Selection...............................................................................................................................5
Gradient Optimization (update parameters)....................................................................................6
Classification........................................................................................................................................6
Accuracy metrics.............................................................................................................................6
Gradient descent..............................................................................................................................7
Perceptron........................................................................................................................................7
Quantitative and Qualitative features..............................................................................................8
Parametric vs non parametric models..............................................................................................8
Bias-Variance dilemma.........................................................................................................................8
Dataset splitting...............................................................................................................................8
Model selection................................................................................................................................8
No free lunch theorem.....................................................................................................................9
VC Dimension.................................................................................................................................9
PAC Learning.................................................................................................................................10
Feature selection and Kernel Methods...............................................................................................11
Kernel methods..............................................................................................................................11
SVM..........................................................................................................................................12
Pca..................................................................................................................................................12
Markov Decision Process...................................................................................................................13
Prediction and Control problems...................................................................................................13
Prediction..................................................................................................................................13
Control......................................................................................................................................13
Reinforcement Learning................................................................................................................15
Components of learning agent..................................................................................................15
Policy....................................................................................................................................15
Value function.......................................................................................................................16
Discount factor.................................................................................................................16
Model....................................................................................................................................16
Reinforcement Learning.....................................................................................................................19
Temporal Difference......................................................................................................................19
SARSA......................................................................................................................................19
Q Learning................................................................................................................................20
MC first visit vs every visit.......................................................................................................22
Multi-Armed Bandit...........................................................................................................................22
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26

Vista previa parcial del texto

¡Descarga Introducción a la Regresión Lineal y Métodos de Aprendizaje Automático y más Apuntes en PDF de Introducción al Aprendizaje Automático solo en Docsity!

Summary – Machine Learning

 - V 1. exercises 
  • Linear Regression................................................................................................................................. Index
    • Model results analysis......................................................................................................................
    • Normal vs Lasso vs Ridge...............................................................................................................
    • Model Selection...............................................................................................................................
    • Gradient Optimization (update parameters)....................................................................................
  • Classification........................................................................................................................................
    • Accuracy metrics.............................................................................................................................
    • Gradient descent..............................................................................................................................
    • Perceptron........................................................................................................................................
    • Quantitative and Qualitative features..............................................................................................
    • Parametric vs non parametric models..............................................................................................
  • Bias-Variance dilemma.........................................................................................................................
    • Dataset splitting...............................................................................................................................
    • Model selection................................................................................................................................
    • No free lunch theorem.....................................................................................................................
    • VC Dimension.................................................................................................................................
    • PAC Learning.................................................................................................................................
  • Feature selection and Kernel Methods...............................................................................................
    • Kernel methods..............................................................................................................................
      • SVM..........................................................................................................................................
    • Pca..................................................................................................................................................
  • Markov Decision Process...................................................................................................................
    • Prediction and Control problems...................................................................................................
      • Prediction..................................................................................................................................
      • Control......................................................................................................................................
    • Reinforcement Learning................................................................................................................
      • Components of learning agent..................................................................................................
        • Policy....................................................................................................................................
        • Value function.......................................................................................................................
          • Discount factor.................................................................................................................
        • Model....................................................................................................................................
  • Reinforcement Learning.....................................................................................................................
    • Temporal Difference......................................................................................................................
      • SARSA......................................................................................................................................
      • Q Learning................................................................................................................................
      • MC first visit vs every visit.......................................................................................................
  • Multi-Armed Bandit...........................................................................................................................
  • EXAMS..............................................................................................................................................
    • 23-01-2019.....................................................................................................................................
    • 22-01-18.........................................................................................................................................
    • 19-06-2018.....................................................................................................................................

Linear Regression

  • Linear regression is extensible to non-linear problems with non-linear transformation (basis

functions) of the inputs M ( w )=

i = 1

M

ϕ

i

w

i

  • ϕ

0

  • If data is noisy, simpler models can outperform complex models as complex models during

training also learn the noise, in these cases regularization methods can help simplifying the

model (thus reducing the variance at the expense of the bias).

  • Simpler models can help understand more complex ones as they are easier to interpret and

give suggestions on more complex ones.

  • Fast computation of linear models depends largely on the used cost function we want to

minimize and if it solved the closed (if available) or an approximated one.

  • LS method is a viable cost function if the number of parameters is small (matrix inversion is

costly). Only derivable cost functions are compatible.

  • 95% confidence interval contains the true parameter 95% of the time performing the linear

regression. The confidence interval will contain the true parameter 95% of the time from

sampled data.

  • Linearly dependent inputs in linear models produce:

◦ badly conditioned design matrix

◦ high var since an infinite number of combination of parameters minimize the cost func

can be partially solved by Ridge Regression

Model results analysis

  • P-value << 0.5 → significant feature (even intercept, the bias)
  • F-statistic low (p-value ~ 0) → reject null hypothesis: constant model (all weights = 0) is

worse than the actual one. It means that at least one weight is significant.

RSS ( w )=

n

( t

n

real

t

n

predicted

2

TSS ( w )=

n

( t

n

real

t

mean

2

(var in data),

ESS ( w )=

n

( t

n

predicted

t

mean

2

(var explained by model wrt data) R

2

RSS
TSS

(var

explained by model: 1 explained, 0 not explained)

RMSE =

RSS
N

(average error on new data by model, same measure unit as data)

  • dfe = NM adj R

2

= R

2

N

dfe

If high error and high weights and p-value << 0.5 → probably the dataset has not been normalized.

Ideal number of observations #samples≥ 10

#parameters

Normal vs Lasso vs Ridge

Lasso and Ridge regression penalize weights of linear models in order to decrease over-fitting thus

increasing generalization of the model (by smoothing the model curve). Lasso can be considered as

a selection method as it zeroes some parameters of the feature vector.

Lasso Ridge

Zeroes some weights Reduce weights

Feature selection No feature selection

Simpler models (more bias, less variance) also

to interpret

Slightly simpler models

Formula:

L ( w )=

i = 1

N

t

i

w

T

⋅ ϕ ( x

i

2

RSS

λ

w

2

2

Ridge

λ

⋅|| w ||

1

Lasso

λ= 0 no penalization

  • λ →∞ strong penalization

After getting an optimal

λ with Ridge, as we start increasing

λ from 0:

  • RSS on training set: increases as the model becomes less complex and decreases overfitting
  • RSS on test set: decreases and then increases as we get to the optimal λ value
  • variance: decreases as we use simpler models
  • bias: increases as we use simpler models

Model Selection

Given the plot of the RSS wrt 2 model parameters, the best model is the one that lies in the lowest

point in the center.

sgn ( x )=

{

− 1 x < 0

0 x = 0

1 x > 0

}

are perfectly separable.

Logistic

Regression

y ( x

n

)= σ( w

T

x

n

σ ( x )=

1 + e

x

L ( w )=−

nM

[

C

n

ln y

n

+( 1 − C

n

)ln ( 1 − y

n

]

Gradient descent

Naive

Bayes

y ( x

n

)= arg max

k

p ( C

k

) N ( x

j

∣ μ

jk

, σ

jk

2

Log likelihood Maximum Likelihood

Estimation

KNN Not parametric.

Majority voting

between K nearest

points (Euclidean

distance)

The more K the smoother

(sort of regularization). K

can only be an integer.

2+ classes scenario:

  • Perceptron: create a binary classifier for each class and select the one that returns the highest

probability

  • Logistic regression: Softmax logistic regression y

k

( ϕ)=

e

w

k

T

ϕ

j

e

w j

T

ϕ

(at numerator only class

k)

  • Naive Bayes: no modification as it can already handled 2+ classes by modeling posterior

probability for each class.

  • KNN: no modification as it can already handled 2+ classes by performing a majority voting.

Important to decide how to break ties.

Gradient descent

  • Stochastic : update iteratively the weights ∼ w

( k + 1 )

= w

k

α

k

N

u = 1

N

( x

u

T

w

k

t

u

) x

u

  • Batch : sum the cost of each sample to update weights w

( k + 1 )

= w

k

− α

k

⋅( x

u

T

w

k

t

u

) x

u

(execute N times)

Perceptron

  • Shuffling is important to reduce bias wrt how training data is presented
  • Error is guaranteed not to increase through gradient descent
  • usually multiple infinite solutions with same loss performance
  • low learning rate do not miss any local minima, but it might take a long time to converge

Quantitative and Qualitative features

  • Quantitative : have an intrinsic order
  • Qualitative : don’t have an ordering but can be converted (example: one-hot-encoding)

Parametric vs non parametric models

Parametric models base the prediction solely on a function defined by its parameters (acquired

during a training phase), while non parametric models let define the prediction function by the

datas. Non parametric functions can be approximated by an infinite parameters model.

  • Big dataset → parametric: data is summarized as a parametric function, while a non

parametric method would need to load the whole dataset.

  • Embedded system → non parametric: if the training phase is performed on the device

because its costly. parametric: if trained first on another machine and then deployed on the

embedded device would even be better because of the less costly prediction of parametric

models.

  • Prior information on data distribution → parametric: prior information is easier to be

included into the model then directly into the data.

  • Learning in Real-time scenario → non parametric: because it does not require a training

phase which affects real time performances. parametric: if an effective online learning

method is implemented.

Bias-Variance dilemma

Dataset splitting

  • training 50% : learn model parameters
  • validation 25% : model selection
  • test 25% : test model performance
OR
  • cross validation : split dataset in k parts, train on k-1 and validate on remaining one for all k

parts. (LOO: cross validation with k=#samples)

Model selection

AIC = 2 k

#params

− 2 ln( L

max of likelihood

penalize the number of parameters wrt the error

(penalizes complex models). The lower the better.

d = 2

2

H shatters 2

points.

d = 2

3

H shatters 3

points (at least

This one works:

This one doesn’t work:

d = 2

4

H DO NOT

shatters 4 points

No way we can find a dataset which can be shattered by a line.

Hence VC dimension of a line is 3.

PAC Learning

Estimates how many examples are necessary to avoid overfitting.

P (∃ hH , L

true

( h )> ϵ)≤

H

e

N ϵ

= δ → ϵ≥

N

( ln

H

+ln( 1 / δ)) to halve the error, double the

samples

Feature selection and Kernel Methods

Hp space Loss measure Optimization method

SVM

y ( x

n

)= sgn ( w

m

T

x

n

  • b ) || w ||

2

+ C

i

ζ

i

t

n

( w

T

x

n

  • b )≥ 1 − ζ

i

n

C: bias-variance trade-off

ζ

: slack variable (violate

constraint with a cost)

Quadratic optimization

  • Feature selection is not simple, usually requires apriori knowledge.
  • Adding features that are not useful increases the var of the model without increasing its bias
  • Adding features increases prediction time (less with kernel methods) and increases training

time (only if supervised methods)

  • Not important to know the exact needed features as kernel methods can project features in

higher dimension space

Kernel methods

Mercer’s theorem : any continuous, symmetric, positive semi-definite kernel function k(x,y) can be

expressed as a dot product in a high-dimensional space.

Kernel can be defined over data which can be measured in the space and apply similarity functions

(colors → binary code, graphs → similarity metric).

  • Better to have a large HDD to store samples since kernels need data to perform prediction

and computations are not as intense thanks to the kernel trick.

  • If mapping to a linearly separable dimension is known for data, then use kernels if

high/infinite dimensions, otherwise transform the inputs.

a) Gaussian rbf kernel

b) exploit polar coordinates by transforming

the data (better wrt to kernel if it’s a

simple case)

c) linearly separable

d) structured data, but not easy to find the

correlation → resort to kernels and

“hope”

Markov Decision Process

Used for:

  • Prediction : given a policy and a specific starting state, estimate reward
  • Control : learn optimal policy to max reward

Prediction and Control problems

Prediction

Goal: given actions for each state, compute value of the state (and best action).

Can either use:

  • Bellman Expectation Equation: V

π

=( I − γ P

π

− 1

⋅ R

π

(need to invert matrix)

  • Bellman Expectation equation (recursive):
V

π

( s )=

aA

π( as )⋅

[

R ( s , a )+ γ

s 'S

P ( s 's , a )⋅ V

π

( s' )

]

both converge as they are contraction operators.

Use operator if state space is large (iterate), otherwise exact solution.

Control

  • Brute force : enumerate and evaluate all policies → choose the best one

Check if policy space is too wide for brute-force

S

| A |

  • Policy iteration : iteratively compute greedy values (Bellman optimality equation:
V

( s )= max

aA

{

R ( s , a )+ γ

s 'S

P ( s 's , a )⋅ V

( s' )

}

) for a policy and modify policy

accordingly.

Can either:

◦ use equation as is

◦ or split process in 2 phases

  1. Policy evaluation: compute policy value
  2. Policy improvement: change policy according to newly estimated value
  • Value iteration : iteratively apply Bellman optimality equation in closed form (cannot if

max operator is not linear)

RL is always an option for MDP (eg: computational problems).

Policies applied to an MDP are influenced also by other learning processes if non-stationary MDP.

Examples:

Robotic

navigation in

grid world

Stock Investment Robotic Soccer Carcassonne

(boardgame)

S (states) Position of the

robot on the grid

Balance, owned

stocks

Robots position,

ball position

Map, points

A (actions) Up, down, left,

right, no action in

goal state

Amount of stocks

to sell or buy at

next time instant

Move, kick for

each robot

Place tile + place

guy, place tile

P (probabilities) For each, 1 if can

get to tile and 0

otherwise

deterministic Stochastic because

of non-

deterministic

behaviour of the

ball and

environment

deterministic

R (rewards) 1 in goals and 0

otherwise

Lost and gained

value

Number of goals Obtained points

γ

(discount

factor)

High since a path

need to be found

High, depends on

the trends and

seasonalities that

need to be

observed

1 since game ends

after amount of

time

1 since tiles are

finite

μ

i

0

(initial

probabilities)

1 in initial

location, 0

otherwise

1 in state with

initial value, 0

otherwise

1 in state with

initial disposition

of ball and players

1 if empty board

and no guys on

board and 0 points

for all, 0 otherwise

Correspondence between classification and MDP makes sense, however, MDP are unnecessarily

complex (eg: consider time).

Examples of problems for MDP dichotomies:

Actions Finite

Robotic navigation

Infinite

pole balancing with continuous space applied force

Transitions Deterministic

chess

Stochastic

blackjack

Rewards Deterministic

robotic navigation

Stochastic

ad banner allocation

Stationary: doesn’t depend on the time Non-stationary: depends on time

Policy examples:

  • Markovian, Stochastic, Non-stationary: π( s

i

, i )= P ( a

i

s

i

, i )

  • History based, Stochastic, Stationary: π( h

i

)= P ( a

i

h

i

  • Markovian, Deterministic, Stationary: π( h

i

)= a

i

Value function

Value function : how good is each state and/or action. Expected future rewards used by the agent to

evaluate how good is a state thus selecting the appropriate action.

Discount factor

Far-sighted ( γ∼ 1 ) vs myopic ( γ∼ 0 ) policies. γ related to problem and not to policy (not

a policy hyperparameter), but different γ values may give different optimal policies.

γ= 1 for infinite horizon MDP not good, may lead to infinite cumulated reward, but good for

finite horizon MDP, guarantee of convergence.

Model

Agent representation of the environment.

Learnable transition function P which predicts the next state or the dynamics of the environment

(probability distribution over the next state).

  • v

π

( s ) how good is it to be in a particular state

  • q

π

( s , a ) how god is it to take a particular action from a given state (same but also wrt

actions)

State evaluation :

v

π

( s )

returns the value given a state q

π

( s , a )

return the value given an action and a

state

Solution evaluation (optimal values) :

Optimal state-value function V*(s) is the max

value function over all policies

Optimal action-value function q*(s,a) is the

maximum action value function over all policies

Tells best action to take Tells the best value it can be achieved

Optimal policy : best stochastic/deterministic mapping between states and actions.

Partial ordering between policies exists (one policy can be better than another one):

For any MDP exists an optimal policy that is better than or equal to all other policies.

Optimal policy is found by maximizing over q.

Reinforcement Learning

  • Model dependent learning (value/policy iteration) : agent learns from already available

transition table

  • Model free learning (Temporal Difference) : agent learns through actual experience not

knowing in advance states, transitions, …

Temporal Difference

Agent learns from every action rather than on every episode (reaching goal).

New EstimateOldEstimate + StepSize

learning rate

⋅[ TargetOldEstimate ]

target error

  • StepSize: learning rate of the agent, how quickly it learns at every step.
  • Target: utility of a state (Bellman eq), the expected value of the discounted reward.

Uses a modified Bellman operator to compute the value of a state.

Updates prediction when new tuple (s,a,r) is available.

Value can be computed incrementally.

V ( s

t

)← V ( s

t

)+ α ( r

t + 1

  • γ V ( s

t + 1

)− V ( s

t

SARSA

SARSA (State Action Reward State Action) : on policy TD method. Chooses action for each state

during learning by following a policy.

Goal: estimate Q

π

( s , a ) for current policy and all state-action pairs

Update rule: Q ( s , a )= Q ( s , a )+ α⋅ [

R ' + γ⋅ Q ( s ' , a' )− Q ( s , a ) ]

ε-greedy policy

  1. Generate a random number r ∈[0,1]
  2. If r>ε choose a random action

Else choose an action derived from the Q values (which yields the maximum utility)

ε value:

  • large ε : less likely to get random actions, which means less exploration, more exploitation
  • small ε : more likely to get random actions, which means more exploration, less exploitation

Q Learning

Same as SARSA, but next action not chosen by policy but by utility of next state (off policy

method).

Goal: estimate Q

π

( s , a ) for current policy and all state-action pairs

Update rule: Q ( s , a )= Q ( s , a )+ α⋅

[

R ' + γ⋅ max

a' '

Q ( s ' , a ' ' )− Q ( s , a )

]