Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Data Analysis - Unsupervised Statistical Learning (PCA, CA, Model-Based Clustering), Appunti di Analisi Dei Dati

Il documento presenta i tools per l'apprendimento non supervisionato (Unsupervised Statistical Learning) con spiegazioni, formule, esempi. Contenuti: 1) Univariate Statistical modelling, 2) Basics of Matrices, 3) Basics of Multivariate Statistics, 4) Principal Component Analysis, 5) Cluster Analysis, 6) Cluster Validation, 7) Model-Based Clustering

Tipologia: Appunti

2021/2022

In vendita dal 19/04/2022

crasssass
crasssass 🇮🇹

1 documento

1 / 140

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
1
Data Analysis
SLIDE 1 Univariate Statistical modelling
Random variable
A random variable (rv) is a variable whose numerical values are determined by the
outcome of a random experiment. A rv can be: 1) discrete, if it can take no more than
a countable number of values; 2) continuous, if it can take any value in an interval.
1) Discrete random variable
Probability (mass) function: The probability mass function (pmf) 𝑝(𝑥) of a discrete rv
𝑋 expresses the probability that 𝑋 takes the value 𝑥, as a function of 𝑥. The pmf
𝑝 [0, 1] is defined as 𝑝 (𝑥) = 𝑃 (𝑋 = 𝑥)
Support: the support 𝑆𝑥 of a discrete rv 𝑋 is defined as the set of possible values
of 𝑋, that is 𝑆𝑥= {𝑥 𝑝 (𝑥) > 0}.
Properties: The pmf satisfies the following properties:
1) 𝑝 (𝑥)> 0 𝑓𝑜𝑟 𝑥 𝑆𝑥 ;
2) 𝑝 (𝑥)= 0 𝑓𝑜𝑟 𝑥 𝑆𝑥 ;
3) 𝑝(𝑥)= 1
𝑥∈𝑆𝑥
Cumulative probability function: The cumulative distribution function (cdf) 𝐹(𝑥) of a
discrete rv 𝑋 expresses the probability that 𝑋 does not exceed the value 𝑥, as a
function of 𝑥. The cdf 𝐹 [0, 1] is defined as
𝐹(𝑥) = 𝑃 (𝑋 𝑥) = 𝑝(𝑡)
{𝑡∈𝑆𝑥;𝑡≤𝑥}
where the notation indicates that the summation is over all possible values 𝑡 𝑆𝑥
that are less than or equal to 𝑥.
Properties: The cdf satisfies the following properties:
1) 0 𝐹 (𝑥) 1 for every 𝑥 ;
2) if x0 and x1 are two numbers such that x0 < x1, then 𝐹 (𝑥0) 𝐹 (𝑥1)
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Anteprima parziale del testo

Scarica Data Analysis - Unsupervised Statistical Learning (PCA, CA, Model-Based Clustering) e più Appunti in PDF di Analisi Dei Dati solo su Docsity!

Data Analysis

SLIDE 1 – Univariate Statistical modelling

Random variable

A random variable (rv) is a variable whose numerical values are determined by the

outcome of a random experiment. A rv can be: 1) discrete, if it can take no more than

a countable number of values; 2) continuous, if it can take any value in an interval.

1) Discrete random variable

Probability (mass) function : The probability mass function (pmf) 𝑝(𝑥) of a discrete rv

𝑋 expresses the probability that 𝑋 takes the value 𝑥, as a function of 𝑥. The pmf

𝑝 ∶ ℝ → [ 0 , 1 ] is defined as 𝑝 (𝑥) = 𝑃 (𝑋 = 𝑥)

Support : the support 𝑆

𝑥

of a discrete rv 𝑋 is defined as the set of possible values

of 𝑋, that is 𝑆

𝑥

Properties : The pmf satisfies the following properties:

𝑥

𝑥

𝑥∈𝑆

𝑥

Cumulative probability function : The cumulative distribution function (cdf) 𝐹(𝑥) of a

discrete rv 𝑋 expresses the probability that 𝑋 does not exceed the value 𝑥, as a

function of 𝑥. The cdf 𝐹 ∶ ℝ → [ 0 , 1 ] is defined as

{𝑡∈𝑆

𝑥

;𝑡≤𝑥}

where the notation indicates that the summation is over all possible values 𝑡 ∈ 𝑆

𝑥

that are less than or equal to 𝑥.

Properties : The cdf satisfies the following properties:

1 ) 0 ≤ 𝐹 (𝑥) ≤ 1 for every 𝑥 ∈ ℝ;

2 ) if x 0

and x 1

are two numbers such that x 0

< x 1

, then 𝐹 (𝑥

0

1

Expectation : The expectation (also called mean or expected value) of a discrete rv 𝑋

is defined as 𝐸(𝑋) = 𝜇

𝑋

𝑥∈𝑆

𝑥

, where the summation is over all possible

values 𝑥 ∈ 𝑆

𝑥

Generalization : Let 𝑋 be a discrete rv with pmf 𝑝(𝑥). Moreover, let 𝑔(𝑋) be

some function of 𝑋. The expected value of 𝑔(𝑋) is defined as

[

)]

𝑥∈𝑆

𝑥

Variance : The expectation of the squared discrepancy about the mean (𝑋 − 𝜇 𝑋

2

is

called the variance, commonly denoted by 𝜎

𝑋

2

, and it is given by

𝐸[(𝑋 − 𝜇

𝑋

2

] = 𝜎

𝑋

2

𝑋

2

𝑥∈𝑆

𝑥

2) Continuous random variables

Probability density function: Let 𝑋 be a continuous rv. The probability density

function (pdf) of 𝑋 is a function 𝑓 ∶ ℝ → ℝ

0

with the following properties

  1. 𝑓(𝑥) ≥ 0 for any 𝑥 ∈ ℝ

𝑏

𝑎

for any 𝑎, 𝑏 ∈ ℝ;

−∞

If 𝑋 is a continuous rv, then the probability of a single point x 0

is null, that is

0

Cumulative distribution function : The cumulative distribution function 𝐹(𝑥) of a

continuous rv 𝑋 is 𝐹 (𝑥) = 𝑃 (𝑋 ≤ 𝑥) = ∫

𝑥

−∞

It easily follows that 𝑓 (𝑥) =

𝑑𝐹(𝑥)

𝑑𝑥

Expectation : The expectation of a continuous rv 𝑋 is defined as

𝑋

−∞

−∞

Generalization : Let 𝑋 be a discrete rv with pdf 𝑓(𝑥). Moreover, let 𝑔(𝑋) be

some function of 𝑋. The expected value of 𝑔(𝑋) is defined as

  • Beta: suited for phenomena with support within an interval (for instance, a

rate). (sotto)

  • Exponential: suited for describing time between event data (births, deaths,

etc).

  • Uniform: suited to describe phenomena with maximum uncertainty.

Random sample

A random sample of size 𝑛 is the set 𝑋

1

𝑖

𝑛

of rv’s associated to 𝑛

independent and identically distributed (iid) observations of the rv 𝑋.

Observed sample : an observed sample of size n is the set 𝑥

1

𝑖

𝑛

constituting the realizations of the rv’s 𝑋

1

𝑖

𝑛

through the 𝑛 sample units.

Sample (joint) distribution : the sample (joint) distribution is the joint distribution 𝑓

𝑛

(or 𝑝

𝑛

) of 𝑋

1

𝑖

𝑛

If 𝑋 is a continuous rv and has pdf 𝑓 (𝑥; 𝜗), the joint density of the sample is given by

𝑛

1

𝑖

𝑛

𝑖

𝑛

𝑖= 1

If 𝑋 is a discrete rv and has pmf 𝑝(𝑥; 𝜗), the joint probability of the sample is given

by

𝑛

1

𝑖

𝑛

𝑖

𝑛

𝑖= 1

Parametric inference

Parametric estimation : let 𝑋 1

𝑖

𝑛

be a random sample from a continuous

rv 𝑋 having pdf 𝑓 (𝑥; 𝜗), where 𝜗 = (𝜗

1

𝑟

𝑘

The parameter vector 𝜗 is unknown and we wish to “estimate” it based on the

random sample. This is the typical context for parametric estimation. In other words,

the functional form 𝑓 is known (or at least assumed such!), but 𝜗is not!

Example : Let 𝜇 be the (population) mean of 𝑋. Suppose 𝜇 is unknown and we

want to estimate it using the random sample. Note that here no functional form

𝑓 is assumed for the pdf of 𝑋!

This easily extends to discrete rv’s.

Point estimation

Let 𝑋

1

𝑖

𝑛

be a random sample from a population with density 𝑓(𝑥; 𝜗). We

will indicate with 𝜏(𝜗) a function of the unknown parameters 𝜗 that we wish to

estimate using the random sample.

Example : Let 𝑓

2

1

√ 2 𝜋𝜎

1

2

(

𝑥− 𝜇

𝜎

)

2

. In this case 𝜗 =

2

and

we might want to estimate: 1) the coefficient of variation 𝜏(𝜇, 𝜎

2

√𝜎

2

𝜇

  1. the mean 𝜏(𝜇, 𝜎

2

) = 𝜇; 3) the variance 𝜏(𝜇, 𝜎

2

2

Estimator : An estimator of 𝜏(𝜗) is defined as some (hopefully appropriate) function

1

𝑖

𝑛

) of the sample variables. Remark that the estimator T, being

a function of random variables, is itself a random variable!

Unfortunately, an estimator with 𝑀𝑆𝐸(𝑇) uniformly smaller with respect to 𝜗 than

any other estimator does not exist.

Our goal: set up “desirable” conditions 𝑇 must fulfil to restrict the range of admissible

estimators- hopefully there will be one better than the others!

Unbiased estimator (quelli che noi studieremo)

Unbiased estimator: the estimator 𝑇 of 𝜏(𝜗) is unbiased if 𝐸(𝑇) = 𝜏(𝜗), for every 𝜗.

The “central value of 𝑇”, that is 𝐸(𝑇), coincides with 𝜏(𝜗).

This property allows us to find a subset of estimators, the unbiased estimators, of

much interest. On top of this, given that their MSE coincides with their variance, to

choose among unbiased estimators will be enough to look for the lowest variance.

Bias of an estimator

If 𝐸(𝑇) ≠ 𝜏(𝜗), we say that the estimator is biased

Bias of an estimator : the bias of an estimator T of 𝜏(𝜗) is the difference

If 𝐸(𝑇) > 𝜏(𝜗) the estimator T is positively biased, with bias

𝑇 tends to overestimate 𝜏(𝜗);

If 𝐸(𝑇) < 𝜏(𝜗) the estimator T is negatively biased, with bias

𝑇 tends to underestimate 𝜏(𝜗).

Cramèr-Rao Lower Bound (legato agli unbiased estimators)

Within the class of unbiased estimators, the most important feature is variance.

Under certain regularity conditions it is possible to show that the variance of any

unbiased estimator is greater than, or equal to, a quantity, which is the lower bound

of the variance of unbiased estimators.

An unbiased estimator T is efficient if its variance reaches the CR lower bound.

𝑿 - > Perhaps ML is the most widespread method for building up estimators,

although arguably not the easiest one (see the method of moments).

Operationally, ML method can be summarized into 3 steps

Maximum likelihood method

- > Step 1

Keeping in mind that we are dealing with a random sample of iid observations, first

we compute the likelihood function

1

𝑛

𝑖

𝑛

𝑖= 1

Remark : note that the formula is the same as the sample joint density

𝑛

1

𝑛

; 𝜗), but this is now a function of 𝜗 given the sample.

Interpretation: 𝐿 (𝜗; 𝑥 1

𝑛

) indicates the probability, or likelihood if you prefer,

that the observed sample originates from 𝑋 having density 𝑓 (𝑥; 𝜗).

Objective : our goal is to maximize 𝐿 (𝜗; 𝑥

1

𝑛

) with respect to 𝜗.

- > Step 2

From the likelihood function we get to the log-likelihood function

for two main reasons:

1

𝑛

) does not alter the minima and the maxima of 𝐿 (𝜗; 𝑥

1

𝑛

since the logarithm is an increasing monotonic function;

2) the logarithmic function has some useful properties:

a) it easily handles the exponentials (𝑙𝑛 𝑎

𝑥

= 𝑥 𝑙𝑛 𝑎 and 𝑙𝑛 𝑒

𝑥

= 𝑥) that

most widespread distributions have (normal, gamma, etc);

b) it easily handles products ( ln ∏ 𝑓(𝑥

𝑖

) = ∑ ln 𝑓(𝑥

𝑖

𝑛

𝑖= 1

𝑛

𝑖= 1

like those of

1

𝑛

c) exalts differences (useful from a computational point of view) (vedere

imagine qui sotto)

- > Step 3 (forse non troppo importante)

The ML estimator of 𝜗 is found maximizing 𝑙 (𝜗; 𝑥

1

𝑛

) with respect to 𝜗 =

1

𝑟

𝑘

)′ that is by solving the system of k equations obtained by equating

the k partial derivatives to 0:

Goodness-of-fit tests

There are 3 main statistical tests to evaluate the goodness-of-fit of a statistical

(theoretical) model (that we put under the null hypothesis 𝐻

0

) to the empirical

distribution (namely the distribution of the observed sample) (ovvero ci sono 3

tipologie di test effettuabili per capire la bontà con la quale un set di dati è stato

descritto da una distribuzione che si è ipotizzata):

Note: The null hypothesis is an hypothesis which is assumed true until there’s a proof

of the contrary. The alternative hypothesis is an hypothesis which is opposite to the

null one which is accepted only if there is a strong proof in its favour.

  1. Pearson’s chi-square test;

  2. Kolmogorov-Smirnov test;

  3. Likelihood-ratio test.

1) Pearson’s chi-square test

In Pearson’s chi-square goodness-of-fit test the sample data of size 𝑛, if of a

continuous type, are divided into 𝑠 intervals (or classes or bins) 𝐴 1

𝑠

. Then the

numbers of points 𝑛

1

𝑠

that fall into the intervals 𝐴

1

𝑠

are compared with

the expected numbers of points 𝑛

1

𝑠

(under the null model) in those intervals.

The null hypothesis 𝐻

0

assigns the probabilities 𝜋

1

𝑠

1

𝑠

Hypotheses: the null and alternative hypotheses are:

0

1

1

𝑠

𝑠

1

1

1

𝑠

𝑠

Where 𝑝

1

𝑠

represent the true but unknown probabilities of 𝐴

1

𝑠

Remark : it is natural to assume that the rejection of 𝐻

0

depends on the

discrepancy between 𝑛

1

𝑠

and 𝑛

1

𝑠

Remarks : - > The distribution of 𝜒

2

is asymptotic and evaluated under the null

hypothesis.

  • Let q be the number of unknown parameters of the null model. If these

parameters need to be estimated from the sample data, then

The critical region of level 𝛼 is defined as

2

2

where 𝑐 = 𝜒

2

[(𝑠− 1 ); 1 −𝛼]

is the quantile of order 1 − 𝛼 from 𝜒

2

(𝑠− 1 )

2) Kolmogorov-Smirnov test

The Kolmogorov-Smirnov test (KS test) is a goodness-of-fit test that can be used with

two different purposes.

one-sample KS test : it is used to compare the empirical cumulative distribution

function 𝐹(𝑥) with the theoretical cumulative distribution function 𝜙

under

the null. The KS statistic quantifies the distance between 𝐹(𝑥) and 𝜙(𝑥) as

The critical region of level 𝛼 is defined as

where 𝑐 = 𝑥

[𝑚; 1 −𝛼]

2

is the quantile of order 1 − 𝛼 from 𝑥

𝑚

2

Interpretation of LR : → The numerator of LR corresponds to the likelihood of an

observed outcome under the null hypothesis.

→ The denominator of LR corresponds to the maximum likelihood of an observed

outcome varying parameters over the whole parameter space Θ.

The numerator of this ratio is, by definition, less than the denominator; so, LR is

between 0 and 1.

→ Low values of LR mean that the observed result was much less likely to occur under

0

as compared to 𝐻

1

→ High values (close to 1) of LR mean that the observed outcome was nearly as likely

to occur under the null hypothesis as the alternative, and so the null hypothesis

cannot be rejected.

Remark: The LR test requires to the competing models to be nested, i.e. the more

complex model can be transformed into the simpler model by imposing constraints

on the former’s parameters.

SLIDE 2 – Basics of Matrices

Triangular matrix : In the mathematical discipline of linear algebra, a triangular

matrix is a special kind of square matrix. A square matrix is called lower triangular if

all the entries above the main diagonal are equal to 0. Similarly, a square matrix is

called upper triangular if all the entries below the main diagonal are equal to 0.