






























Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity
Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium
Prepara tus exámenes
Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity
Prepara tus exámenes con los documentos que comparten otros estudiantes como tú en Docsity
Encuentra los documentos específicos para los exámenes de tu universidad
Estudia con lecciones y exámenes resueltos basados en los programas académicos de las mejores universidades
Responde a preguntas de exámenes reales y pon a prueba tu preparación
Consigue puntos base para descargar
Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium
Comunidad
Pide ayuda a la comunidad y resuelve tus dudas de estudio
Ebooks gratuitos
Descarga nuestras guías gratuitas sobre técnicas de estudio, métodos para controlar la ansiedad y consejos para la tesis preparadas por los tutores de Docsity
Apuntes Machine Learning Politecnico de Milan
Tipo: Apuntes
1 / 38
Esta página no es visible en la vista previa
¡No te pierdas las partes importantes!































- V 1. exercises functions) of the inputs M ( w )=
i = 1
M
ϕ
i
⋅ w
i
0
training also learn the noise, in these cases regularization methods can help simplifying the
model (thus reducing the variance at the expense of the bias).
give suggestions on more complex ones.
minimize and if it solved the closed (if available) or an approximated one.
costly). Only derivable cost functions are compatible.
regression. The confidence interval will contain the true parameter 95% of the time from
sampled data.
◦ badly conditioned design matrix
◦ high var since an infinite number of combination of parameters minimize the cost func
can be partially solved by Ridge Regression
worse than the actual one. It means that at least one weight is significant.
RSS ( w )=
n
( t
n
real
− t
n
predicted
2
TSS ( w )=
n
( t
n
real
− t
mean
2
(var in data),
ESS ( w )=
n
( t
n
predicted
− t
mean
2
(var explained by model wrt data) R
2
(var
explained by model: 1 explained, 0 not explained)
√
(average error on new data by model, same measure unit as data)
2
2
dfe
If high error and high weights and p-value << 0.5 → probably the dataset has not been normalized.
Ideal number of observations #samples≥ 10
#parameters
Lasso and Ridge regression penalize weights of linear models in order to decrease over-fitting thus
increasing generalization of the model (by smoothing the model curve). Lasso can be considered as
a selection method as it zeroes some parameters of the feature vector.
Lasso Ridge
Zeroes some weights Reduce weights
Feature selection No feature selection
Simpler models (more bias, less variance) also
to interpret
Slightly simpler models
Formula:
L ( w )=
i = 1
N
t
i
− w
T
⋅ ϕ ( x
i
2
RSS
λ
w
2
2
Ridge
λ
⋅|| w ||
1
Lasso
λ= 0 no penalization
After getting an optimal
λ with Ridge, as we start increasing
λ from 0:
Given the plot of the RSS wrt 2 model parameters, the best model is the one that lies in the lowest
point in the center.
sgn ( x )=
{
− 1 x < 0
0 x = 0
1 x > 0
}
are perfectly separable.
Logistic
Regression
y ( x
n
)= σ( w
T
⋅ x
n
σ ( x )=
1 + e
− x
L ( w )=−
n ∈ M
n
ln y
n
n
)ln ( 1 − y
n
Gradient descent
Naive
Bayes
y ( x
n
)= arg max
k
p ( C
k
) N ( x
j
∣ μ
jk
, σ
jk
2
Log likelihood Maximum Likelihood
Estimation
KNN Not parametric.
Majority voting
between K nearest
points (Euclidean
distance)
The more K the smoother
(sort of regularization). K
can only be an integer.
2+ classes scenario:
probability
k
( ϕ)=
e
⃗
w
k
T
ϕ
j
e
w j
T
ϕ
(at numerator only class
k)
probability for each class.
Important to decide how to break ties.
( k + 1 )
= w
k
α
k
u = 1
N
( x
u
T
w
k
− t
u
) x
u
( k + 1 )
= w
k
− α
k
⋅( x
u
T
w
k
− t
u
) x
u
(execute N times)
Parametric models base the prediction solely on a function defined by its parameters (acquired
during a training phase), while non parametric models let define the prediction function by the
datas. Non parametric functions can be approximated by an infinite parameters model.
parametric method would need to load the whole dataset.
because its costly. parametric: if trained first on another machine and then deployed on the
embedded device would even be better because of the less costly prediction of parametric
models.
included into the model then directly into the data.
phase which affects real time performances. parametric: if an effective online learning
method is implemented.
parts. (LOO: cross validation with k=#samples)
AIC = 2 k
#params
− 2 ln( L
max of likelihood
penalize the number of parameters wrt the error
(penalizes complex models). The lower the better.
d = 2
2
H shatters 2
points.
d = 2
3
H shatters 3
points (at least
This one works:
This one doesn’t work:
d = 2
4
shatters 4 points
No way we can find a dataset which can be shattered by a line.
Hence VC dimension of a line is 3.
Estimates how many examples are necessary to avoid overfitting.
P (∃ h ∈ H , L
true
( h )> ϵ)≤
e
− N ϵ
= δ → ϵ≥
samples
Hp space Loss measure Optimization method
y ( x
n
)= sgn ( w
m
T
⋅ x
n
2
i
ζ
i
t
n
( w
T
x
n
i
∀ n
C: bias-variance trade-off
ζ
: slack variable (violate
constraint with a cost)
Quadratic optimization
time (only if supervised methods)
higher dimension space
Mercer’s theorem : any continuous, symmetric, positive semi-definite kernel function k(x,y) can be
expressed as a dot product in a high-dimensional space.
Kernel can be defined over data which can be measured in the space and apply similarity functions
(colors → binary code, graphs → similarity metric).
and computations are not as intense thanks to the kernel trick.
high/infinite dimensions, otherwise transform the inputs.
a) Gaussian rbf kernel
b) exploit polar coordinates by transforming
the data (better wrt to kernel if it’s a
simple case)
c) linearly separable
d) structured data, but not easy to find the
correlation → resort to kernels and
“hope”
Used for:
Goal: given actions for each state, compute value of the state (and best action).
Can either use:
π
=( I − γ P
π
− 1
π
(need to invert matrix)
π
( s )=
a ∈ A
π( a ∣ s )⋅
[
R ( s , a )+ γ
s ' ∈ S
P ( s ' ∣ s , a )⋅ V
π
( s' )
]
both converge as they are contraction operators.
Use operator if state space is large (iterate), otherwise exact solution.
Check if policy space is too wide for brute-force
| A |
( s )= max
a ∈ A
{
R ( s , a )+ γ
s ' ∈ S
P ( s ' ∣ s , a )⋅ V
( s' )
}
) for a policy and modify policy
accordingly.
Can either:
◦ use equation as is
◦ or split process in 2 phases
max operator is not linear)
RL is always an option for MDP (eg: computational problems).
Policies applied to an MDP are influenced also by other learning processes if non-stationary MDP.
Examples:
Robotic
navigation in
grid world
Stock Investment Robotic Soccer Carcassonne
(boardgame)
S (states) Position of the
robot on the grid
Balance, owned
stocks
Robots position,
ball position
Map, points
A (actions) Up, down, left,
right, no action in
goal state
Amount of stocks
to sell or buy at
next time instant
Move, kick for
each robot
Place tile + place
guy, place tile
P (probabilities) For each, 1 if can
get to tile and 0
otherwise
deterministic Stochastic because
of non-
deterministic
behaviour of the
ball and
environment
deterministic
R (rewards) 1 in goals and 0
otherwise
Lost and gained
value
Number of goals Obtained points
γ
(discount
factor)
High since a path
need to be found
High, depends on
the trends and
seasonalities that
need to be
observed
1 since game ends
after amount of
time
1 since tiles are
finite
μ
i
0
(initial
probabilities)
1 in initial
location, 0
otherwise
1 in state with
initial value, 0
otherwise
1 in state with
initial disposition
of ball and players
1 if empty board
and no guys on
board and 0 points
for all, 0 otherwise
Correspondence between classification and MDP makes sense, however, MDP are unnecessarily
complex (eg: consider time).
Examples of problems for MDP dichotomies:
Actions Finite
Robotic navigation
Infinite
pole balancing with continuous space applied force
Transitions Deterministic
chess
Stochastic
blackjack
Rewards Deterministic
robotic navigation
Stochastic
ad banner allocation
Stationary: doesn’t depend on the time Non-stationary: depends on time
Policy examples:
i
, i )= P ( a
i
∣ s
i
, i )
i
)= P ( a
i
∣ h
i
i
)= a
i
Value function : how good is each state and/or action. Expected future rewards used by the agent to
evaluate how good is a state thus selecting the appropriate action.
Discount factor
Far-sighted ( γ∼ 1 ) vs myopic ( γ∼ 0 ) policies. γ related to problem and not to policy (not
a policy hyperparameter), but different γ values may give different optimal policies.
γ= 1 for infinite horizon MDP not good, may lead to infinite cumulated reward, but good for
finite horizon MDP, guarantee of convergence.
Agent representation of the environment.
Learnable transition function P which predicts the next state or the dynamics of the environment
(probability distribution over the next state).
π
( s ) how good is it to be in a particular state
π
( s , a ) how god is it to take a particular action from a given state (same but also wrt
actions)
State evaluation :
v
π
( s )
returns the value given a state q
π
( s , a )
return the value given an action and a
state
Solution evaluation (optimal values) :
Optimal state-value function V*(s) is the max
value function over all policies
Optimal action-value function q*(s,a) is the
maximum action value function over all policies
Tells best action to take Tells the best value it can be achieved
Optimal policy : best stochastic/deterministic mapping between states and actions.
Partial ordering between policies exists (one policy can be better than another one):
For any MDP exists an optimal policy that is better than or equal to all other policies.
Optimal policy is found by maximizing over q.
transition table
knowing in advance states, transitions, …
Agent learns from every action rather than on every episode (reaching goal).
New Estimate ← OldEstimate + StepSize
learning rate
⋅[ Target − OldEstimate ]
target error
Uses a modified Bellman operator to compute the value of a state.
Updates prediction when new tuple (s,a,r) is available.
Value can be computed incrementally.
V ( s
t
)← V ( s
t
)+ α ( r
t + 1
t + 1
)− V ( s
t
SARSA (State Action Reward State Action) : on policy TD method. Chooses action for each state
during learning by following a policy.
Goal: estimate Q
π
( s , a ) for current policy and all state-action pairs
Update rule: Q ( s , a )= Q ( s , a )+ α⋅ [
R ' + γ⋅ Q ( s ' , a' )− Q ( s , a ) ]
ε-greedy policy
Else choose an action derived from the Q values (which yields the maximum utility)
ε value:
Same as SARSA, but next action not chosen by policy but by utility of next state (off policy
method).
Goal: estimate Q
π
( s , a ) for current policy and all state-action pairs
Update rule: Q ( s , a )= Q ( s , a )+ α⋅
[
R ' + γ⋅ max
a' '
Q ( s ' , a ' ' )− Q ( s , a )
]