Prepara i tuoi esami
Ottieni punti
Guide e consigli
Vendi su Docsity
Docsity AI

Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity

Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium

Guide e consigli

Vendi su Docsity

Docsity AI

Accedi Registrati

Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity

Cerca documenti

Prepara i tuoi esami con i documenti condivisi da studenti come te su Docsity

Cerca la tua università

Trova i documenti specifici per gli esami della tua università

Video Corsi

Preparati con lezioni e prove svolte basate sui programmi universitari!

Quiz

Rispondi a reali domande d’esame e scopri la tua preparazione

Docsity AINEW

Riassumi i tuoi documenti, fagli domande, convertili in quiz e mappe concettuali

Maturità 2026

Studia con prove svolte, tesine e consigli utili

Esplora domande

Togliti ogni dubbio leggendo le risposte alle domande fatte da altri studenti come te

Argomenti di studio

Esplora i documenti più scaricati per gli argomenti di studio più popolari

Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium

Condividi documenti

20 Punti

Per ogni documento caricato

Rispondi alle domande

5 Punti

per ogni risposta data (max 1 al giorno)

Tutti i modi per ottenere punti gratis

Ottieni punti subito

Scegli un piano Premium con tutti i punti di cui hai bisogno

Opportunità di studio

Scegli il tuo prossimo programma di studio

Entra in contatto con le migliori università del mondo e scegli il tuo percorso di studi

Classifica delle migliori università

Scopri le migliori università italiane secondo gli studenti

Community

Chiedi alla community

Chiedi aiuto alla community e sciogli i tuoi dubbi legati allo studio

Guide Gratuite

I nostri eBook salva studente

Scarica gratuitamente le nostre guide sulle tecniche di studio, metodi per gestire l'ansia, dritte per la tesi realizzati da tutor Docsity

Statistics and Machine Learning, Dispense di Machine learning

Università Politecnica delle Marche (UNIVPM)Machine learning

Prof. Gloria Polinesi

Dispense formate da slide e appunti presi a lezione. Materia "Statistics and Machine Learning" della prof. Gloria Polinesi del corso DEB. Anno 2024/2025

Tipologia: Dispense

2024/2025

In vendita dal 30/07/2025

Gaiaaaa- 🇮🇹

3 documenti

1 / 17

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

Statistics and ML

Knowledge Discovery from Data (KDD)

DM is a step of a larger process of extracting knowledge from data known as KDD 

sequence of integrated activity.

KDD process:

1. Data Collection: organize data in a model  data matrix

2. Data Selection: choosing the relevant data for analysis (formalizing the research

goal)

3. Data Preprocessing: cleaning and transforming the data to remove errors

4. Data Mining: analysis of data in order to extract knowledge, patterns and trends

(core of the analysis)

5. Interpretation: analysing the results to extract knowledge

Data Mining (DM)

= is the use of efficient techniques for the analysis of our data in order to extract

knowledge that should be coherent with the goal of the research  is a sequence of

activities that goes from the collection of data to interpretation of results (through

table, graphs..)

We get info from the real word  so, we have to clean data and make a sort of pre-

processing analysis on data. Then, when we have a definitive model we can apply it,

analyses it and get information.

Data mining is needed because we have complex and huge amount of data.

Complex because:

- Several types of data like tables, images and graphs

- There are spatial and temporal aspects of the data

-INTERCONNECTED DATA of different types = datasets that combine multiple

forms of data and establish relationships between them. Examples: transaction

data, document data, network data and behavioral data.

Connection of DM with other areas

DM draws ideas from machine learning/AI, pattern recognition, statistics and database

systems. Data mining can be seen as a connection of different data:

-DATA WAREHOUSE: system that stores and organizes data from multiple

sources for analysis and reporting

-AI: involves systems that mimic the human behaviour and intelligence (can

learn)

-STATISTICS: allow us to interpret and analyse data; we use statistics mainly to

inference our data.

-MACHINE LEARNING: subset if AI, allow us to learn from pattern and schemes

of data

-VISUALIZATION TECHNINQUES: allow us to understand data in the first part

(=preliminary analysis) and to understand results in the last activity of data

mining

We can summarize the connection of the data mining with: statistics, AI, machine

learning and pattern recognition.

1

Scopri Dispense di Machine learning Università Politecnica delle Marche (UNIVPM)

Documenti correlati

Machine Learning and Data Mining

Diminuire la dimensionalità dei dati: Principale Componente Analisi (PCA)

Independent Component Analysis

Independent and Principal Component Analysis

Appunti Machine Learning

Dispensa esame machine learning

Trascrizioni Machine Learning

Machine learning for bioengineering

Anteprima parziale del testo

Scarica Statistics and Machine Learning e più Dispense in PDF di Machine learning solo su Docsity!

Statistics and ML

Knowledge Discovery from Data (KDD)

DM is a step of a larger process of extracting knowledge from data known as KDD  sequence of integrated activity. KDD process:

Data Collection : organize data in a model  data matrix
Data Selection : choosing the relevant data for analysis (formalizing the research goal)
Data Preprocessing : cleaning and transforming the data to remove errors
Data Mining : analysis of data in order to extract knowledge, patterns and trends (core of the analysis)
Interpretation : analysing the results to extract knowledge

Data Mining (DM)

= is the use of efficient techniques for the analysis of our data in order to extract knowledge that should be coherent with the goal of the research  is a sequence of activities that goes from the collection of data to interpretation of results (through table, graphs..) We get info from the real word  so, we have to clean data and make a sort of pre- processing analysis on data. Then, when we have a definitive model we can apply it, analyses it and get information. Data mining is needed because we have complex and huge amount of data. Complex because:

Several types of data like tables, images and graphs
There are spatial and temporal aspects of the data
INTERCONNECTED DATA of different types = datasets that combine multiple forms of data and establish relationships between them. Examples: transaction data, document data, network data and behavioral data.

Connection of DM with other areas

DM draws ideas from machine learning/AI, pattern recognition, statistics and database systems. Data mining can be seen as a connection of different data:

DATA WAREHOUSE : system that stores and organizes data from multiple sources for analysis and reporting
AI: involves systems that mimic the human behaviour and intelligence (can learn)
STATISTICS: allow us to interpret and analyse data; we use statistics mainly to inference our data.
MACHINE LEARNING : subset if AI, allow us to learn from pattern and schemes of data
VISUALIZATION TECHNINQUES : allow us to understand data in the first part (=preliminary analysis) and to understand results in the last activity of data mining We can summarize the connection of the data mining with: statistics, AI, machine learning and pattern recognition.

Machine Learning (ML)

= is a subset of the AI focusing on creating algorithms that learn from data and make predictions or discover hidden insights. Types of ML:

SUPERVISED LEARNING: used when we have label data (= we have an outcome

variable y in the dataset). GOAL: make prediction by comparing the model's

predictions with the actual outcomes.

- Training process: the model is divided in training (= where is trained) and test (= where is evaluated)

Examples: linear regression model or the nearest neighbourhood technique

UNSUPERVISED LEARNING : used when we don’t have the outcome variable (only x variables). GOAL: group similar statistical units according to the value and discover hidden patterns, schemes and structure in the data

No need to split the dataset into training and test sets
Example: cluster analysis ML learns from the experience  GOAL: minimize a loss function (like, mean error) to optimize the model.

Designing a ML system

Formalize the learning task  set research goal and choose an appropriate performance measure for evaluating the learned system (like, number of misclassified characters)
Collect and organize data  for example, from a survey to data matrix = 5 rows (n) x 2 columns (p)
Extract features  select the variables you need for the analysis
Choose class of learning models  the choice depends on:
- The research goal and the learning tasks decided before
- Whether we have y (supervised) or just x (unsupervised)  the choice have to be coherent with the data Is possible to choose more then one models and compare them; in both cases we are detecting data but in a different way.
Train the model  implement the model on your data

In the supervised learning , the model learn about the rule in the training

set and evaluate the performance of y (can be qualitative or quantitative)

on the test set

In the unsupervised learning we do not divide the data set

Evaluate the model  performance:

In supervised learning , if we use a measure (y = numerical) we use the mean square error (MSE)  lower MSE = better performance. Otherwise, if y = categorical variable we use the misclassification error rate (  how many observations and in the wrong class).

Cheat: outcome variable (y)

Exploratory data analysis (EDA)

= is the initial step in data analysis, where we explore and summarize the dataset to identify potential issues and patterns before applying machine learning models.  Use of descriptive statistical techniques to describe the structure and relationships present in the data. Key aspects of EDA: identifying outliers, missing values and duplicate data, understanding the distribution of variables, detecting potential data inconsistencies or noise. EDA can be:

Univariate (1 variable): measures of location, variability, skewness and kurtosis
Bivariate (2 variable): scatterplot and correlation coefficient (p)
Multivariate (more than 2 variable): matrix notation

Skewness

= is a statistical index that focus on the symmetry or asymmetry of the distribution of the variable. Skewness = n_3/sd^3  where n_3 is the 3rd^ central moment

if skewness = 0  we have a normal distribution (mean = median)
if skewness > 0  the distribution is skewed to the right, with a long tail to the right (the mean is grater then the median) o Most values on the left and some high values pull the mean up
if skewness < 0  the distribution is skewed to the left, with a long tail to the left (the mean is smaller then the median) o Most values on the right and some low values pull the mean to the right

Kurtosis

= is a statistical index that allows us to compute the distance between the empirical distribution and the theoretical distribution corresponding to the perfect normality Kurtosis = m4/(m_2)^

if kurtosis = 3  normal distribution
if kurtosis > 3  the distribution is sharper and fatter-tailed (greater frequency for values very distant from the mean)
if kurtosis < 3  we have a flatter and thinner-tailed distribution, but we have higher variability in the data and lower frequency for values distant from the mean (with respect to a normal distribution)

Population and Sampling

POPULATION : is the entire group we want to draw conclusions about SAMPLE : is a subset of the population. Is the group we want to collect data from and must be representative (we can generalize the empirical evidence of the sample to the target population). The sample size is lees then the population size.  The process of selecting sample is known as sampling

Why Sample?

Limited resources (time, money) and workload
Results with known accuracy (mathematically calculated)

Sampling Frame

= a list from which the potential respondents are drawn  is a list or database that includes all the elements from which a sample is selected. The sampling frame considers all the units and so, must be representative of the population. POPULATION OF INTEREST: example, students at the university (as a sampling frame we can use the registered database) Factors that influence the representativeness:

Sampling procedure : the method used to select the sample (like random, stratified, systematic)
Sample size: number of units included in the sample
Participation (response rate): percentage of people who actually respond out of those selected When might you sample the entire population? When the population is small, when you have extensive resources and when you expect high response rate.

Probability Sampling

= a probability sampling scheme is one in which every unit in the population has a chance (greater than zero) of being selected in the sample and this probability can be accurately determined. Probability Sampling types:

Simple random sampling
Systematic random sampling
Stratified random sampling
Cluster sampling

1. Simple Random Sampling

The sample is obtained by randomly selecting units from the target population  2 types:

WR: each element of the population can be selected more than once in the same sample
WOR: each element of the population can be selected just once in the same sample Pros:
Applicable when the population is small, homogeneous and available
All subsets of the frame have an equal probability of being selected
It provides for greatest number of possible samples
Estimates are easy to calculate
Simple random sampling is always an EPS design, but not all EPS designs are simple random sampling Cons:
No for large sampling frames

Pros:

Reduces costs and effort
Useful for geographically dispersed populations Cons: higher sampling error compared to simple random sampling

Cluster analysis

In cluster analysis, we study the relationships between the rows of a data matrix  so, between observations. To measure how similar or different 2 observations xi and xj are, we use a proximity index , also called distance measure.

Distance measures

= is a function that takes the corresponding row vectors from the data matrix and returns a number that indicates how close or far apart the two observations are (=measure the relationship). Consider a data matrix containing only quantitative (or binary) variables  If x and y are rows, then a function d (x, y) is called a distance between 2 observations, if it satisfies these properties:

Non-negativity d (x, y) >= 0, for all x and y  distance is always 0 or positive
Identity d (x, y) = 0 -> x = y, for all x and y  2 identical observations have 0 distance
Symmetry d (x, y) = d (y, x), for all x and y  the distance from x to y is the same as from y to x
Triangle inequality d (x, y) <= d (x, z) + d (y, z), for all x, y and z  the direct distance from x to y is never greater than going through a third point

Euclidean distance

= is the most used distance measure for numerical values. All distances can be represented in a distance matrix , where each element d(i,j) represents the distance between the row vectors xi and xj DISTANCE MATRIX: − The matrix has a nxn dimension − Is symmetric  the lower diagonal is equal to the upper diagonal − On the diagonal, the values are 0 because the distance of a point with itself is 0 How to calculate the Euclidean distance? For any two observations i and j a p- dimensional Euclidean space , the Euclidean distance d(i,j) is defined as: Meaning : d2(xi, xj)

− p is the number of variables (dimensions) − Xis and xjs are the values of the s-th variable for observations i and j, respectively − You square the difference for each variable, sum them up, and then take the square root

Similarity measures

= when we consider binary values we have similarity measure. Is a similarity measure if it satisfies the following properties:

Non-negativity : Sij≥0, for all ui, uj ∈ U  Similarity values are always 0 or positive
Normalization : Sii=1,for all ui∈U  An observation is perfectly similar to itself
Symmetry: Sij=Sji for all ui,uj ∈ U  Similarity from ui to uj is the same = as from uj to ui

1. Russell-Rao similarity index

= measures the ratio between the number of co-presences (both equal to 1) and the total number of binary variables PP. Sij=CP / P where:

CP = number of co-presences (both variables = 1)
P = total number of binary variables

2. Jaccard similarity index

= measures the ratio between the number of co-presences and the number of cases excluding co-absences (when both variables = 0). Sij= CP / CP + PA + AP where:

CP = co-presences
PA = presence in ui and absence in uj
AP = absence in ui and presence in uj

3. Sokal-Michener similarity index

= measures the ratio between the number of co-presences or co-absences and the total number of binary variables PP. Sij =CP + CA / P where:

CP = co-presences
CA = co-absences (both variables = 0)

n clusters (n = 5 in this case), and each cluster contains only statistical unit, while the root has 1 cluster containing all the units. In divisive clustering we move from root to the branches.

Phases of agglomerative Hierarchical Clustering:

Initialization : Start with n clusters (each observation represents a cluster)
Selection : Find and select the 2 closest units (= clusters), based on a distance measure, like Euclidean distance measure
Updating : Merge the 2 selected clusters into 1; update the number of clusters to n-1. Update the distance matrix and according to the updating techniques we have different algorithms.
Repetition : Repeat selection and merging n−1 times ( repeat steps 2 and 3 n-1 times)
Termination : End when all the statistical units are merged (included) into a single cluster In UPDATING we have 3 different techniques:

1. Single Linkage : here we use the minimum distance between cluster when we

update the distance matrix; more sensitive to outliers

2. Complete Linkage : here we use the maximum distance; less sensitive to

outliers

3. Average Linkage : here we use the average distance between 2 clusters

 these methods require only the distance matrix.

Exercise: agglomerative Hierarchical Clustering

Use single and complete agglomerative clustering to group the statistical units described by the following distance matrix. Show the dendrograms.

SINGLE LINKAGE

n = 4  number of statistical units Step 1: initialization phase  we have 4 clusters: 1 for unit A, 1 for unit B, 1 for unit C and 1 for unit D (each is a SINGLETON). Step 2: selection phase  we have to merge the 2 closest units in terms of distance. Unit A and unit B are the 2 closest units (distance = 1), so we merge A and B into a new cluster  d{A,B}=1 and G1={A,B} Step 3: updating of the distance matrix  we use the single linkage algorithm:

we have to update the distance between the G1 group (A,B) and units C and D

o d{(A,B),C} = min  d{(A,C); d(B,C)}=min  d{4,2}= 2 o d{(A,B),D} = min  d{(A,D); d(B,D)}=min  d{5,6}= 5 Step 4: now we have to repeat step 2 and 3 n-1 times.

- Step 2: join group C with the G1 group (A,B) creating another group  G2 = {G1,C} = {A, B, C} - Step 3: updating the distance matrix o Calculate the distance between group G2 (A,B,C) and D  d{(A,B,C),D} = min d{(A,B), D); d(C,D)}=min  d{5,3}= 3

FINAL STEP : join the G2 (A,B,C) group with unit D creating the final group G

G3={G2,D}  from singletons to 1 group. Let’s consider the dendrogram : Repeat step 2 and 3 n-1 times  4-1= 3 times

G1 : join A and B
G2 : join G1 and C
G3 : join G2 and D Final output  dendrogram where:

y = distance of aggregation
x = statistical units

COMPLETE LINKAGE

Step 1: initialization  4 clusters = A, B, C, D Step 2: selection  G1=(A,B) Step 3: updating the matrix Distance between G1 (A,B) and C and D

d{(A,B),C} = max  d{(A,C); d(B,C)}=max  d{4,2}= 4

 So, trade-off between internal homogeneity and information loss. How to evaluate the hierarchical clustering? There are 2 main indices (quaderno):

R-squared : the closer it is to 1, the more optimal the partition is (where 0 = worst case and 1 = best case - R-squared = B/T
- T = total deviance (or total variation)
- We can decompose T = W + B
Pseudo-F : a high value of the Pseudo-F statistic indicates distinct clusters; if it drops sharply, it means that very different groups have been merged  it means that W (= within group deviance) is increasing. In general, we want B as maximum as possible and W as minimum as possible, as in the R^2.

Hierarchical clustering: problems and limitations

Computational complexity in time and space , especially when we have many statistical units and many statistical variable
The decision to merge 2 groups is irreversible  because we have n- iteration and in each iteration we merge the 2 closest clusters, and we update the distance matrix n-1 times
No objective function is directly minimized (like in K-means)
Different methods have issues with one or more of the following: o Sensitivity to noise and outliers  the choice of the algorithms depends on the structure of the data and the major of the analysis (for example: if we want to have a model less sensitive to outliers and more compact is better to choose the complete linkage that is less sensitive with respect to the single linkage) o Difficulty in handling clusters of different sizes (especially when we have outliers) o Confusion with large clusters  sometimes we have cluster of difference size (= a problem

Supervised learning (label data)

When we consider the SUPERVISED LEARNING  main GOAL : make predictions about

the y class or value.

- If y variable = categorical  we are talking about classification tasks - If y variable = continuous  we are talking about regression tasks We divide our dataset into 2 parts:

Training set : used to train (= implement) the model  contains the 80% of the observation = bigger part
Test set : used to evaluate the model in terms of accuracy (=compare the predictive value and real value)  contains the 20% of the observation  of course, 100% in the original dataset. p = number of predictions To evaluate our model, we use the measures of performance that are:

Regression tasks : we use the Mean Square Error (MSE)  [sommatoria(actual value - predicted value)]

o In the regression task the y is a = numerical variable

o MSE is a measure of accuracy of our model  lower the MSE is, the more accurate is the model

Classification task: we use the Misclassification Error Rate = number of incorrect prediction/total number of instances (that are the rows)

o In the classification task the y is a = categorical variable (y could be

multiclass or binary variable) o Misclassification Error Rate measure the accuracy of the model

 the difference between regression and classification tasks is the y variable.

Training data: key points

Used to teach the model to recognize patterns and relationship in the data
The larger the training set, the better the model learns
Used to compare different models

Test data: key points

Test set used to evaluate how well the model performs on new data (unseen examples)
Helps to avoid bias and overfitting phenomenon  which occurs when a model works well on training data but not on new data (= overfitting)
The test set should be a good representation of the original data

Example of classification task: binary

Each row is an observation  10 observations in our data set
Columns  X1, X2..Xp are the features or characteristics (= independent variables)
Classification task  y = categorical variable X1 X2 Xp Y (0 or 1) 1 0 2 1 3 0 4 0 5 1 6 0 7 1 8 0 9 0 10 1 We randomly split the data set into:
Training set (80%) : 8 observations (units: 1, 2, 3, 5, 7, 8, 9, 10)
Test set (20%) : 2 observations (units: 4, 6)  used to verify the performance  We want to predict the class ŷ for unit 4 and 6 In the case the accuracy of the training set is too high with respect to the accuracy of the test set  we have the overfitting

o Then, we look at the class (y) of this K closest points and apply the majority vote rule  we use the prevalent class

In the case of REGRESSION  the predicted value ( y) is equal to the mean of the observations included in the nearest neighbour K o To predict the value of y we use the mean

K-Nearest Neighbour Regression Algorithm

Choose the optimal value of K :

If K is too small  it is sensitive to noise points (like in a)
If K is too large  it may include points from other classes Typically, K is chosen between 1 and 20, preferring odd numbers to avoid 50/50 ties in the case of the majority vote rule.

Cross-Validation criterion

To choose the optimal number K, we can use the CROSS-VALIDATION CRITERION  is a resampling method : technique for estimating model performance by repetitively drawing samples from the data. Training data is randomly divided into k equally sized ( different from the k of the NN) folds (= blocks).

One-fold is used as the validation set
The remaining k-1 folds are used for training  The process is repeated k times and results are averaged; The procedure stops when all k folds are used as a validation set.

Example of K-fold cross-validation

Training data: k = 5 folds Step 1 : we divide the training data into 5 folds, of which:

1 fold is  validation set
Remaining 4 folds are  training set Step 2 : consider the 2nd^ fold for validation (= just 1) and the other 4 folds for training Step 3 : consider the 3rd^ fold for validation and the other 4 folds for training Step 4: consider the 4th^ fold for validation and the other 4 folds for training Step 5: consider the 5th^ fold for validation and the other 4 folds for training  The procedure stops when all folds are used for validation. For each K (candidates of NN)  example: K=1, K=2, K=3, we want to choose the optimal k values and in order to do so we use cross-validation criterion We split each K (1,2,3) into 5 folds:
Training  use 4 of the folds as training data
Validation  use the remaining 1 fold as a validation

Then, we calculate the average performance for each K and we compare the averages  we choose the value of K with the best performance.