




























































































Studia grazie alle numerose risorse presenti su Docsity
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Prepara i tuoi esami
Studia grazie alle numerose risorse presenti su Docsity
Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity
Trova i documenti specifici per gli esami della tua università
Preparati con lezioni e prove svolte basate sui programmi universitari!
Rispondi a reali domande d’esame e scopri la tua preparazione
Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali
Studia con prove svolte, tesine e consigli utili
Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te
Esplora i documenti più scaricati per gli argomenti di studio più popolari
Ottieni i punti per scaricare
Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium
Il documento presenta i tools per l'apprendimento non supervisionato (Unsupervised Statistical Learning) con spiegazioni, formule, esempi. Contenuti: 1) Univariate Statistical modelling, 2) Basics of Matrices, 3) Basics of Multivariate Statistics, 4) Principal Component Analysis, 5) Cluster Analysis, 6) Cluster Validation, 7) Model-Based Clustering
Tipologia: Appunti
1 / 140
Questa pagina non è visibile nell’anteprima
Non perderti parti importanti!





























































































A random variable (rv) is a variable whose numerical values are determined by the
outcome of a random experiment. A rv can be: 1) discrete, if it can take no more than
a countable number of values; 2) continuous, if it can take any value in an interval.
Probability (mass) function : The probability mass function (pmf) 𝑝(𝑥) of a discrete rv
𝑋 expresses the probability that 𝑋 takes the value 𝑥, as a function of 𝑥. The pmf
𝑝 ∶ ℝ → [ 0 , 1 ] is defined as 𝑝 (𝑥) = 𝑃 (𝑋 = 𝑥)
Support : the support 𝑆
𝑥
of a discrete rv 𝑋 is defined as the set of possible values
of 𝑋, that is 𝑆
𝑥
Properties : The pmf satisfies the following properties:
𝑥
𝑥
𝑥∈𝑆
𝑥
Cumulative probability function : The cumulative distribution function (cdf) 𝐹(𝑥) of a
discrete rv 𝑋 expresses the probability that 𝑋 does not exceed the value 𝑥, as a
function of 𝑥. The cdf 𝐹 ∶ ℝ → [ 0 , 1 ] is defined as
{𝑡∈𝑆
𝑥
;𝑡≤𝑥}
where the notation indicates that the summation is over all possible values 𝑡 ∈ 𝑆
𝑥
that are less than or equal to 𝑥.
Properties : The cdf satisfies the following properties:
1 ) 0 ≤ 𝐹 (𝑥) ≤ 1 for every 𝑥 ∈ ℝ;
2 ) if x 0
and x 1
are two numbers such that x 0
< x 1
, then 𝐹 (𝑥
0
1
Expectation : The expectation (also called mean or expected value) of a discrete rv 𝑋
is defined as 𝐸(𝑋) = 𝜇
𝑋
𝑥∈𝑆
𝑥
, where the summation is over all possible
values 𝑥 ∈ 𝑆
𝑥
Generalization : Let 𝑋 be a discrete rv with pmf 𝑝(𝑥). Moreover, let 𝑔(𝑋) be
some function of 𝑋. The expected value of 𝑔(𝑋) is defined as
𝑥∈𝑆
𝑥
Variance : The expectation of the squared discrepancy about the mean (𝑋 − 𝜇 𝑋
2
is
called the variance, commonly denoted by 𝜎
𝑋
2
, and it is given by
𝑋
2
𝑋
2
𝑋
2
𝑥∈𝑆
𝑥
Probability density function: Let 𝑋 be a continuous rv. The probability density
function (pdf) of 𝑋 is a function 𝑓 ∶ ℝ → ℝ
0
with the following properties
𝑏
𝑎
for any 𝑎, 𝑏 ∈ ℝ;
∞
−∞
If 𝑋 is a continuous rv, then the probability of a single point x 0
is null, that is
0
Cumulative distribution function : The cumulative distribution function 𝐹(𝑥) of a
continuous rv 𝑋 is 𝐹 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) = ∫
𝑥
−∞
It easily follows that 𝑓 (𝑥) =
𝑑𝐹(𝑥)
𝑑𝑥
Expectation : The expectation of a continuous rv 𝑋 is defined as
𝑋
∞
−∞
∞
−∞
Generalization : Let 𝑋 be a discrete rv with pdf 𝑓(𝑥). Moreover, let 𝑔(𝑋) be
some function of 𝑋. The expected value of 𝑔(𝑋) is defined as
Beta: suited for phenomena with support within an interval (for instance, a
rate). (sotto)
Exponential: suited for describing time between event data (births, deaths,
etc).
Uniform: suited to describe phenomena with maximum uncertainty.
Random sample
A random sample of size 𝑛 is the set 𝑋
1
𝑖
𝑛
of rv’s associated to 𝑛
independent and identically distributed (iid) observations of the rv 𝑋.
Observed sample : an observed sample of size n is the set 𝑥
1
𝑖
𝑛
constituting the realizations of the rv’s 𝑋
1
𝑖
𝑛
through the 𝑛 sample units.
Sample (joint) distribution : the sample (joint) distribution is the joint distribution 𝑓
𝑛
(or 𝑝
𝑛
) of 𝑋
1
𝑖
𝑛
If 𝑋 is a continuous rv and has pdf 𝑓 (𝑥; 𝜗), the joint density of the sample is given by
𝑛
1
𝑖
𝑛
𝑖
𝑛
𝑖= 1
If 𝑋 is a discrete rv and has pmf 𝑝(𝑥; 𝜗), the joint probability of the sample is given
by
𝑛
1
𝑖
𝑛
𝑖
𝑛
𝑖= 1
Parametric inference
Parametric estimation : let 𝑋 1
𝑖
𝑛
be a random sample from a continuous
rv 𝑋 having pdf 𝑓 (𝑥; 𝜗), where 𝜗 = (𝜗
1
𝑟
𝑘
The parameter vector 𝜗 is unknown and we wish to “estimate” it based on the
random sample. This is the typical context for parametric estimation. In other words,
the functional form 𝑓 is known (or at least assumed such!), but 𝜗is not!
Example : Let 𝜇 be the (population) mean of 𝑋. Suppose 𝜇 is unknown and we
want to estimate it using the random sample. Note that here no functional form
𝑓 is assumed for the pdf of 𝑋!
This easily extends to discrete rv’s.
Point estimation
Let 𝑋
1
𝑖
𝑛
be a random sample from a population with density 𝑓(𝑥; 𝜗). We
will indicate with 𝜏(𝜗) a function of the unknown parameters 𝜗 that we wish to
estimate using the random sample.
Example : Let 𝑓
2
1
√ 2 𝜋𝜎
−
1
2
(
𝑥− 𝜇
𝜎
)
2
. In this case 𝜗 =
2
′
and
we might want to estimate: 1) the coefficient of variation 𝜏(𝜇, 𝜎
2
√𝜎
2
𝜇
2
) = 𝜇; 3) the variance 𝜏(𝜇, 𝜎
2
2
Estimator : An estimator of 𝜏(𝜗) is defined as some (hopefully appropriate) function
1
𝑖
𝑛
) of the sample variables. Remark that the estimator T, being
a function of random variables, is itself a random variable!
Unfortunately, an estimator with 𝑀𝑆𝐸(𝑇) uniformly smaller with respect to 𝜗 than
any other estimator does not exist.
Our goal: set up “desirable” conditions 𝑇 must fulfil to restrict the range of admissible
estimators- hopefully there will be one better than the others!
Unbiased estimator (quelli che noi studieremo)
Unbiased estimator: the estimator 𝑇 of 𝜏(𝜗) is unbiased if 𝐸(𝑇) = 𝜏(𝜗), for every 𝜗.
The “central value of 𝑇”, that is 𝐸(𝑇), coincides with 𝜏(𝜗).
This property allows us to find a subset of estimators, the unbiased estimators, of
much interest. On top of this, given that their MSE coincides with their variance, to
choose among unbiased estimators will be enough to look for the lowest variance.
Bias of an estimator
If 𝐸(𝑇) ≠ 𝜏(𝜗), we say that the estimator is biased
Bias of an estimator : the bias of an estimator T of 𝜏(𝜗) is the difference
If 𝐸(𝑇) > 𝜏(𝜗) the estimator T is positively biased, with bias
𝑇 tends to overestimate 𝜏(𝜗);
If 𝐸(𝑇) < 𝜏(𝜗) the estimator T is negatively biased, with bias
𝑇 tends to underestimate 𝜏(𝜗).
Cramèr-Rao Lower Bound (legato agli unbiased estimators)
Within the class of unbiased estimators, the most important feature is variance.
Under certain regularity conditions it is possible to show that the variance of any
unbiased estimator is greater than, or equal to, a quantity, which is the lower bound
of the variance of unbiased estimators.
An unbiased estimator T is efficient if its variance reaches the CR lower bound.
𝑿 - > Perhaps ML is the most widespread method for building up estimators,
although arguably not the easiest one (see the method of moments).
Operationally, ML method can be summarized into 3 steps
Maximum likelihood method
- > Step 1
Keeping in mind that we are dealing with a random sample of iid observations, first
we compute the likelihood function
1
𝑛
𝑖
𝑛
𝑖= 1
Remark : note that the formula is the same as the sample joint density
𝑛
1
𝑛
; 𝜗), but this is now a function of 𝜗 given the sample.
Interpretation: 𝐿 (𝜗; 𝑥 1
𝑛
) indicates the probability, or likelihood if you prefer,
that the observed sample originates from 𝑋 having density 𝑓 (𝑥; 𝜗).
Objective : our goal is to maximize 𝐿 (𝜗; 𝑥
1
𝑛
) with respect to 𝜗.
- > Step 2
From the likelihood function we get to the log-likelihood function
for two main reasons:
1
𝑛
) does not alter the minima and the maxima of 𝐿 (𝜗; 𝑥
1
𝑛
since the logarithm is an increasing monotonic function;
a) it easily handles the exponentials (𝑙𝑛 𝑎
𝑥
= 𝑥 𝑙𝑛 𝑎 and 𝑙𝑛 𝑒
𝑥
= 𝑥) that
most widespread distributions have (normal, gamma, etc);
b) it easily handles products ( ln ∏ 𝑓(𝑥
𝑖
) = ∑ ln 𝑓(𝑥
𝑖
𝑛
𝑖= 1
𝑛
𝑖= 1
like those of
1
𝑛
c) exalts differences (useful from a computational point of view) (vedere
imagine qui sotto)
- > Step 3 (forse non troppo importante)
The ML estimator of 𝜗 is found maximizing 𝑙 (𝜗; 𝑥
1
𝑛
) with respect to 𝜗 =
1
𝑟
𝑘
)′ that is by solving the system of k equations obtained by equating
the k partial derivatives to 0:
Goodness-of-fit tests
There are 3 main statistical tests to evaluate the goodness-of-fit of a statistical
(theoretical) model (that we put under the null hypothesis 𝐻
0
) to the empirical
distribution (namely the distribution of the observed sample) (ovvero ci sono 3
tipologie di test effettuabili per capire la bontà con la quale un set di dati è stato
descritto da una distribuzione che si è ipotizzata):
Note: The null hypothesis is an hypothesis which is assumed true until there’s a proof
of the contrary. The alternative hypothesis is an hypothesis which is opposite to the
null one which is accepted only if there is a strong proof in its favour.
Pearson’s chi-square test;
Kolmogorov-Smirnov test;
Likelihood-ratio test.
In Pearson’s chi-square goodness-of-fit test the sample data of size 𝑛, if of a
continuous type, are divided into 𝑠 intervals (or classes or bins) 𝐴 1
𝑠
. Then the
numbers of points 𝑛
1
𝑠
that fall into the intervals 𝐴
1
𝑠
are compared with
the expected numbers of points 𝑛
1
𝑠
(under the null model) in those intervals.
The null hypothesis 𝐻
0
assigns the probabilities 𝜋
1
𝑠
1
𝑠
Hypotheses: the null and alternative hypotheses are:
0
1
1
𝑠
𝑠
1
1
1
𝑠
𝑠
Where 𝑝
1
𝑠
represent the true but unknown probabilities of 𝐴
1
𝑠
Remark : it is natural to assume that the rejection of 𝐻
0
depends on the
discrepancy between 𝑛
1
𝑠
and 𝑛
1
𝑠
Remarks : - > The distribution of 𝜒
2
is asymptotic and evaluated under the null
hypothesis.
Let q be the number of unknown parameters of the null model. If these
parameters need to be estimated from the sample data, then
The critical region of level 𝛼 is defined as
2
2
where 𝑐 = 𝜒
2
[(𝑠− 1 ); 1 −𝛼]
is the quantile of order 1 − 𝛼 from 𝜒
2
(𝑠− 1 )
The Kolmogorov-Smirnov test (KS test) is a goodness-of-fit test that can be used with
two different purposes.
one-sample KS test : it is used to compare the empirical cumulative distribution
function 𝐹(𝑥) with the theoretical cumulative distribution function 𝜙
under
the null. The KS statistic quantifies the distance between 𝐹(𝑥) and 𝜙(𝑥) as
The critical region of level 𝛼 is defined as
where 𝑐 = 𝑥
[𝑚; 1 −𝛼]
2
is the quantile of order 1 − 𝛼 from 𝑥
𝑚
2
Interpretation of LR : → The numerator of LR corresponds to the likelihood of an
observed outcome under the null hypothesis.
→ The denominator of LR corresponds to the maximum likelihood of an observed
outcome varying parameters over the whole parameter space Θ.
The numerator of this ratio is, by definition, less than the denominator; so, LR is
between 0 and 1.
→ Low values of LR mean that the observed result was much less likely to occur under
0
as compared to 𝐻
1
→ High values (close to 1) of LR mean that the observed outcome was nearly as likely
to occur under the null hypothesis as the alternative, and so the null hypothesis
cannot be rejected.
Remark: The LR test requires to the competing models to be nested, i.e. the more
complex model can be transformed into the simpler model by imposing constraints
on the former’s parameters.
SLIDE 2 – Basics of Matrices
Triangular matrix : In the mathematical discipline of linear algebra, a triangular
matrix is a special kind of square matrix. A square matrix is called lower triangular if
all the entries above the main diagonal are equal to 0. Similarly, a square matrix is
called upper triangular if all the entries below the main diagonal are equal to 0.