Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


Analisi dei Dati: Univariate e Multivariate - Prof. Punzo, Sintesi del corso di Statistica

Una introduzione alla teoria e alle tecniche di analisi univariate e multivariate dei dati, compresi i concetti di variabile casuale continua e discreta, campione casuale, stima parametrica e analisi multivariate come PCA e cluster analysis. Vengono inoltre presentate diverse tecniche di analisi come la validazione delle tendenze al cluster e le diverse distribuzioni random come la normale, gamma e poisson.

Tipologia: Sintesi del corso

2020/2021

Caricato il 17/02/2022

AleStudioUni
AleStudioUni 🇮🇹

4.2

(9)

2 documenti

1 / 10

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
Data Analysis
Univariate analysis .............................................................................................................................................................................................................................................................................................................................................................. 2
Continuous Random Variable ................................................................................................................................................................................................................................................................................ 2
Discrete Random Variable ............................................................................................................................................................................................................................................................................................ 2
Random Sample ........................................................................................................................................................................................................................................................................................................................................................ 3
Parametric Estimation ............................................................................................................................................................................................................................................................................................................................ 3
Multivariate Analysis .....................................................................................................................................................................................................................................................................................................................................................5
Principal Component Analysis ...........................................................................................................................................................................................................................................................................................5
Cluster Analysis ............................................................................................................................................................................................................................................................................................................................................................. 6
One) Accessing cluster tendency (cluster validation technic): .................................................................................................................. 6
Two) Dissimilarity Matrix .................................................................................................................................................................................................................................................................................................. 6
Continuous ...................................................................................................................................................................................................................................................................................................................................................... 6
Categorical ...................................................................................................................................................................................................................................................................................................................................................... 6
Binary ............................................................................................................................................................................................................................................................................................................................................................................. 6
Mixed .............................................................................................................................................................................................................................................................................................................................................................................. 6
Three) Choosing k (cluster validation): ................................................................................................................................................................................................................................... 7
Four / one) Hierarchical clustering .................................................................................................................................................................................................................................................... 7
Four / two ) Partitioning k-means clustering ................................................................................................................................................................................................... 8
Four / three) Partitioning k-medoids clustering ...................................................................................................................................................................................... 8
Four / four) Model-based Clustering .......................................................................................................................................................................................................................................... 8
Five) Cluster validation ........................................................................................................................................................................................................................................................................................................... 9
Six) Choosing best algorithm .............................................................................................................................................................................................................................................................................. 9
pf3
pf4
pf5
pf8
pf9
pfa

Anteprima parziale del testo

Scarica Analisi dei Dati: Univariate e Multivariate - Prof. Punzo e più Sintesi del corso in PDF di Statistica solo su Docsity!

Data Analysis

  • Univariate analysis - Continuous Random Variable................................................................................................................................................................................................................................................................................ - Discrete Random Variable
    • Random Sample
    • Parametric Estimation
  • Multivariate Analysis.....................................................................................................................................................................................................................................................................................................................................................
    • Principal Component Analysis
    • Cluster Analysis.............................................................................................................................................................................................................................................................................................................................................................
      • One) Accessing cluster tendency (cluster validation technic):
      • Two) Dissimilarity Matrix
        • Continuous
        • Categorical......................................................................................................................................................................................................................................................................................................................................................
        • Binary
        • Mixed
      • Three) Choosing k (cluster validation):
      • Four / one) Hierarchical clustering
      • Four / two ) Partitioning k-means clustering
      • Four / three) Partitioning k-medoids clustering......................................................................................................................................................................................
      • Four / four) Model-based Clustering..........................................................................................................................................................................................................................................
      • Five) Cluster validation
      • Six) Choosing best algorithm..............................................................................................................................................................................................................................................................................
  • Probability Density Function: • Probability Mass Function
  • Properties: • Properties:
  1. f(x) > 0 for every x belonging to the

whole real line

  1. P (a < x < b) =

𝑏

𝑎

  • integral

(area below the interval) from a to

b of f(x) dx (for every a and b

belonging to the whole real line)

  1. The whole area below the curve is

equal to 1 - > ∫

+∞

−∞

integral from - infinite to +infinite of

f(x) dx =1 (area below the curve)

  1. Probability in a single point is equal

to zero - > P(X=x0) = 0

  1. p(x) = P(X=x) probability in a single

point

  1. p(x)>0 if x belongs to the support
  2. p(x)=0 if x doesn’t belong to the

support

𝑥∈Sx

  • the summatory

of all the probability for x belonging

to the support is equal to 1

  • Cumulative Density Function: • Cumulative Mass Function:

F(x) = P (X <x) = ∫

𝑥

−∞

  • integral

from - infinite to x of f(x) dx

P(X) = P(X<x)

0 < F(x) < 1

If x 0

< x 1

  • F(x 0

) < F(X

1

  • Expectation: • Expectation:

E(x) = ∫

+∞

−∞

  • integral from

  • infinite to +infinite of f(x) * x dx

E(X) =

𝑥∈𝑆𝑥

  • summatory of

x*p(x)

  • Variance • Variance

distance from the mean = ∫

[

+∞

−∞

)]

2

𝑑𝑥 - > integral from - infinite

to +infinite (x – E(x))

2

f(x) dx

[

]

^ 2

𝑥∈𝑆𝑥

Summatory of (x-E(x))

2

p(x)

  • Example of distributions: • Example of distributions:
  1. Normal distribution:

f (x; mu, sigma

2

mu belongs to R

sigma

2

  1. Gamma distribution (S: R+)

f (x; alpha, beta)

x>

alpha>

  1. Bernoulli (S: [0;1])

X

0

= 1-p

X

1

= p

E(x) = p

Var= p(1-p)

  1. Binomial (S: R+ finite)

p (x; m, p)

E(x) = mp

  • E(T) = theta
  • E(T) > theta
  • E(T) < theta

Unbiased estimator

Positively biased estimator (over)

Negatively biased estimator (under)

  1. Goodness of fit: Understand if the distribution (due

to the estimated parameters) best

fits our distribution from the

random sample

  • Pearson’s chi-square test:

X

2

and p-value

  • Kolmogorov-Smirnof test:

Measure the maximum distance

between one distribution and a

theoretical one or between two

distributions (delta and p-value)

  • Likelihood-ratio test:

Compare two nested model (LR and

p-value)

  • Dimentionality reduction
  • Eigendecomposition on the covariance matrix
  • $values - > diagonal - > variance
  • $vectors - > columns - > weights, loadings of Principal Component
  • Matrix multiplication - > Principal Component Scores
  • Original space: variables correlated and linearly dependent VS Principal Component

Space: orthogonal axis with high redundancy which carry as much information

(variance) as possible and are linearly independent

  • Biplot: Principal Component Space + variables (arrows):

o Between arrows: <90° positive correlation, =180° negative correlation, =0° no

correlation, length of the arrow= well explained

o Arrows and PC: <90° well explained by the PC, same direction = positive

correlation, opposite direction = negative correlation

o Origin: mean centered

  • Choose number of PC:

o Kaiser’s Rule: Variance >

o Proportion of variance explained: percentage of variance explained by the PC

o Scree plot: elbow method

continuous

variable

  • Elbow method:

o Choose random k

o Compute within sum of square of each cluster

o Plot the within sum of square versus clusters

o Search the elbow on the plot

  • Average Silhouette method:

o Choose random k

o Compute avg.sil for each cluster

o Plot avg.sil

o Choose maximum

  • Gap statistic:

o Compute within sum of square of random clustered

o Generate 500 uniform distribution and compute within sum of square

o Compute GAP(k) =

1

𝐵

∑ log(𝑊𝑆𝑆𝑘𝑏)−log (𝑊𝑆𝑆𝑘)

𝐵

𝑏= 1

o Choose the smallest k - > GAP(K) > GAP(K+1) - S k+

Clusters with low dissimilarity are merged together and we can choose the height we

prefer.

Linkage method:

  • Single linkage method: link cluster with smallest distance
  • Complete linkage method: link cluster with greatest distance
  • Average linkage method: link cluster with average distance
  • Centroid linkage method: link cluster with distance between centroid
  • Ward’s linkage method: link cluster with distance with minimum deviance

Algorithm:

  1. Compute Dissimilarity/Distance matrix and treat each observation as a cluster
  2. Merge nearest observation according to linkage method
  3. Update D
  4. Repeat till convergence

Validation:

Correlation between cophenetic distance and original distance - > [0,1] maximize

Each cluster is represented by the mean of all the observation in the cluster

Algorithm:

  • Specify k
  • Select randomly k centroid
  • Assign observation to the nearest centroid (Euclidean distance)
  • Compute new mean for each cluster
  • Repeat until convergence

Clusters are represented by a single point with minimum dissimilarity distance which

represents all the observations of the cluster.

Algorithm:

  • Specify k clusters
  • Select randomly k centroid
  • Assign each observation to the nearest centroid
  • Select new centroid with minimum dissimilarity
  • Repeat until convergence

Each cluster comes from its own distribution and has its own vector of parameters

(Mixture of Gaussian distribution)

p(x; theta) = weight 1

  • f

1

(x; theta

1

) + … + weight

k

  • f

k

(x; theta

k

theta = mu k

covariance matrix - > volume, orientation, shape (E = equal, V = Variable, I = axix-

aligned (orientation) – spherical (shape))

Choosing K: Maximize Bayesian Information Criterion