Scarica Statistics and Machine Learning e più Dispense in PDF di Machine learning solo su Docsity!
Statistics and ML
Knowledge Discovery from Data (KDD)
DM is a step of a larger process of extracting knowledge from data known as KDD sequence of integrated activity. KDD process:
- Data Collection : organize data in a model data matrix
- Data Selection : choosing the relevant data for analysis (formalizing the research goal)
- Data Preprocessing : cleaning and transforming the data to remove errors
- Data Mining : analysis of data in order to extract knowledge, patterns and trends (core of the analysis)
- Interpretation : analysing the results to extract knowledge
Data Mining (DM)
= is the use of efficient techniques for the analysis of our data in order to extract knowledge that should be coherent with the goal of the research is a sequence of activities that goes from the collection of data to interpretation of results (through table, graphs..) We get info from the real word so, we have to clean data and make a sort of pre- processing analysis on data. Then, when we have a definitive model we can apply it, analyses it and get information. Data mining is needed because we have complex and huge amount of data. Complex because:
- Several types of data like tables, images and graphs
- There are spatial and temporal aspects of the data
- INTERCONNECTED DATA of different types = datasets that combine multiple forms of data and establish relationships between them. Examples: transaction data, document data, network data and behavioral data.
Connection of DM with other areas
DM draws ideas from machine learning/AI, pattern recognition, statistics and database systems. Data mining can be seen as a connection of different data:
- DATA WAREHOUSE : system that stores and organizes data from multiple sources for analysis and reporting
- AI: involves systems that mimic the human behaviour and intelligence (can learn)
- STATISTICS: allow us to interpret and analyse data; we use statistics mainly to inference our data.
- MACHINE LEARNING : subset if AI, allow us to learn from pattern and schemes of data
- VISUALIZATION TECHNINQUES : allow us to understand data in the first part (=preliminary analysis) and to understand results in the last activity of data mining We can summarize the connection of the data mining with: statistics, AI, machine learning and pattern recognition.
Machine Learning (ML)
= is a subset of the AI focusing on creating algorithms that learn from data and make predictions or discover hidden insights. Types of ML:
- SUPERVISED LEARNING: used when we have label data (= we have an outcome
variable y in the dataset). GOAL: make prediction by comparing the model's
predictions with the actual outcomes.
- Training process: the model is divided in training (= where is trained) and test (= where is evaluated)
- Examples: linear regression model or the nearest neighbourhood technique
- UNSUPERVISED LEARNING : used when we don’t have the outcome variable (only x variables). GOAL: group similar statistical units according to the value and discover hidden patterns, schemes and structure in the data
- No need to split the dataset into training and test sets
- Example: cluster analysis ML learns from the experience GOAL: minimize a loss function (like, mean error) to optimize the model.
Designing a ML system
- Formalize the learning task set research goal and choose an appropriate performance measure for evaluating the learned system (like, number of misclassified characters)
- Collect and organize data for example, from a survey to data matrix = 5 rows (n) x 2 columns (p)
- Extract features select the variables you need for the analysis
- Choose class of learning models the choice depends on:
- The research goal and the learning tasks decided before
- Whether we have y (supervised) or just x (unsupervised) the choice have to be coherent with the data Is possible to choose more then one models and compare them; in both cases we are detecting data but in a different way.
- Train the model implement the model on your data
- In the supervised learning , the model learn about the rule in the training
set and evaluate the performance of y (can be qualitative or quantitative)
on the test set
- In the unsupervised learning we do not divide the data set
- Evaluate the model performance:
- In supervised learning , if we use a measure (y = numerical) we use the mean square error (MSE) lower MSE = better performance. Otherwise, if y = categorical variable we use the misclassification error rate ( how many observations and in the wrong class).
- Cheat: outcome variable (y)
Exploratory data analysis (EDA)
= is the initial step in data analysis, where we explore and summarize the dataset to identify potential issues and patterns before applying machine learning models. Use of descriptive statistical techniques to describe the structure and relationships present in the data. Key aspects of EDA: identifying outliers, missing values and duplicate data, understanding the distribution of variables, detecting potential data inconsistencies or noise. EDA can be:
- Univariate (1 variable): measures of location, variability, skewness and kurtosis
- Bivariate (2 variable): scatterplot and correlation coefficient (p)
- Multivariate (more than 2 variable): matrix notation
Skewness
= is a statistical index that focus on the symmetry or asymmetry of the distribution of the variable. Skewness = n_3/sd^3 where n_3 is the 3rd^ central moment
- if skewness = 0 we have a normal distribution (mean = median)
- if skewness > 0 the distribution is skewed to the right, with a long tail to the right (the mean is grater then the median) o Most values on the left and some high values pull the mean up
- if skewness < 0 the distribution is skewed to the left, with a long tail to the left (the mean is smaller then the median) o Most values on the right and some low values pull the mean to the right
Kurtosis
= is a statistical index that allows us to compute the distance between the empirical distribution and the theoretical distribution corresponding to the perfect normality Kurtosis = m4/(m_2)^
- if kurtosis = 3 normal distribution
- if kurtosis > 3 the distribution is sharper and fatter-tailed (greater frequency for values very distant from the mean)
- if kurtosis < 3 we have a flatter and thinner-tailed distribution, but we have higher variability in the data and lower frequency for values distant from the mean (with respect to a normal distribution)
Population and Sampling
POPULATION : is the entire group we want to draw conclusions about SAMPLE : is a subset of the population. Is the group we want to collect data from and must be representative (we can generalize the empirical evidence of the sample to the target population). The sample size is lees then the population size. The process of selecting sample is known as sampling
Why Sample?
- Limited resources (time, money) and workload
- Results with known accuracy (mathematically calculated)
Sampling Frame
= a list from which the potential respondents are drawn is a list or database that includes all the elements from which a sample is selected. The sampling frame considers all the units and so, must be representative of the population. POPULATION OF INTEREST: example, students at the university (as a sampling frame we can use the registered database) Factors that influence the representativeness:
- Sampling procedure : the method used to select the sample (like random, stratified, systematic)
- Sample size: number of units included in the sample
- Participation (response rate): percentage of people who actually respond out of those selected When might you sample the entire population? When the population is small, when you have extensive resources and when you expect high response rate.
Probability Sampling
= a probability sampling scheme is one in which every unit in the population has a chance (greater than zero) of being selected in the sample and this probability can be accurately determined. Probability Sampling types:
- Simple random sampling
- Systematic random sampling
- Stratified random sampling
- Cluster sampling
1. Simple Random Sampling
The sample is obtained by randomly selecting units from the target population 2 types:
- WR: each element of the population can be selected more than once in the same sample
- WOR: each element of the population can be selected just once in the same sample Pros:
- Applicable when the population is small, homogeneous and available
- All subsets of the frame have an equal probability of being selected
- It provides for greatest number of possible samples
- Estimates are easy to calculate
- Simple random sampling is always an EPS design, but not all EPS designs are simple random sampling Cons:
- No for large sampling frames
Pros:
- Reduces costs and effort
- Useful for geographically dispersed populations Cons: higher sampling error compared to simple random sampling
Cluster analysis
In cluster analysis, we study the relationships between the rows of a data matrix so, between observations. To measure how similar or different 2 observations xi and xj are, we use a proximity index , also called distance measure.
Distance measures
= is a function that takes the corresponding row vectors from the data matrix and returns a number that indicates how close or far apart the two observations are (=measure the relationship). Consider a data matrix containing only quantitative (or binary) variables If x and y are rows, then a function d (x, y) is called a distance between 2 observations, if it satisfies these properties:
- Non-negativity d (x, y) >= 0, for all x and y distance is always 0 or positive
- Identity d (x, y) = 0 -> x = y, for all x and y 2 identical observations have 0 distance
- Symmetry d (x, y) = d (y, x), for all x and y the distance from x to y is the same as from y to x
- Triangle inequality d (x, y) <= d (x, z) + d (y, z), for all x, y and z the direct distance from x to y is never greater than going through a third point
Euclidean distance
= is the most used distance measure for numerical values. All distances can be represented in a distance matrix , where each element d(i,j) represents the distance between the row vectors xi and xj DISTANCE MATRIX: − The matrix has a nxn dimension − Is symmetric the lower diagonal is equal to the upper diagonal − On the diagonal, the values are 0 because the distance of a point with itself is 0 How to calculate the Euclidean distance? For any two observations i and j a p- dimensional Euclidean space , the Euclidean distance d(i,j) is defined as: Meaning : d2(xi, xj)
− p is the number of variables (dimensions) − Xis and xjs are the values of the s-th variable for observations i and j, respectively − You square the difference for each variable, sum them up, and then take the square root
Similarity measures
= when we consider binary values we have similarity measure. Is a similarity measure if it satisfies the following properties:
- Non-negativity : Sij≥0, for all ui, uj ∈ U Similarity values are always 0 or positive
- Normalization : Sii=1,for all ui∈U An observation is perfectly similar to itself
- Symmetry: Sij=Sji for all ui,uj ∈ U Similarity from ui to uj is the same = as from uj to ui
1. Russell-Rao similarity index
= measures the ratio between the number of co-presences (both equal to 1) and the total number of binary variables PP. Sij=CP / P where:
- CP = number of co-presences (both variables = 1)
- P = total number of binary variables
2. Jaccard similarity index
= measures the ratio between the number of co-presences and the number of cases excluding co-absences (when both variables = 0). Sij= CP / CP + PA + AP where:
- CP = co-presences
- PA = presence in ui and absence in uj
- AP = absence in ui and presence in uj
3. Sokal-Michener similarity index
= measures the ratio between the number of co-presences or co-absences and the total number of binary variables PP. Sij =CP + CA / P where:
- CP = co-presences
- CA = co-absences (both variables = 0)
n clusters (n = 5 in this case), and each cluster contains only statistical unit, while the root has 1 cluster containing all the units. In divisive clustering we move from root to the branches.
Phases of agglomerative Hierarchical Clustering:
- Initialization : Start with n clusters (each observation represents a cluster)
- Selection : Find and select the 2 closest units (= clusters), based on a distance measure, like Euclidean distance measure
- Updating : Merge the 2 selected clusters into 1; update the number of clusters to n-1. Update the distance matrix and according to the updating techniques we have different algorithms.
- Repetition : Repeat selection and merging n−1 times ( repeat steps 2 and 3 n-1 times)
- Termination : End when all the statistical units are merged (included) into a single cluster In UPDATING we have 3 different techniques:
1. Single Linkage : here we use the minimum distance between cluster when we
update the distance matrix; more sensitive to outliers
2. Complete Linkage : here we use the maximum distance; less sensitive to
outliers
3. Average Linkage : here we use the average distance between 2 clusters
these methods require only the distance matrix.
Exercise: agglomerative Hierarchical Clustering
Use single and complete agglomerative clustering to group the statistical units described by the following distance matrix. Show the dendrograms.
SINGLE LINKAGE
n = 4 number of statistical units Step 1: initialization phase we have 4 clusters: 1 for unit A, 1 for unit B, 1 for unit C and 1 for unit D (each is a SINGLETON). Step 2: selection phase we have to merge the 2 closest units in terms of distance. Unit A and unit B are the 2 closest units (distance = 1), so we merge A and B into a new cluster d{A,B}=1 and G1={A,B} Step 3: updating of the distance matrix we use the single linkage algorithm:
- we have to update the distance between the G1 group (A,B) and units C and D
o d{(A,B),C} = min d{(A,C); d(B,C)}=min d{4,2}= 2 o d{(A,B),D} = min d{(A,D); d(B,D)}=min d{5,6}= 5 Step 4: now we have to repeat step 2 and 3 n-1 times.
- Step 2: join group C with the G1 group (A,B) creating another group G2 = {G1,C} = {A, B, C} - Step 3: updating the distance matrix o Calculate the distance between group G2 (A,B,C) and D d{(A,B,C),D} = min d{(A,B), D); d(C,D)}=min d{5,3}= 3
FINAL STEP : join the G2 (A,B,C) group with unit D creating the final group G
G3={G2,D} from singletons to 1 group. Let’s consider the dendrogram : Repeat step 2 and 3 n-1 times 4-1= 3 times
- G1 : join A and B
- G2 : join G1 and C
- G3 : join G2 and D Final output dendrogram where:
- y = distance of aggregation
- x = statistical units
COMPLETE LINKAGE
Step 1: initialization 4 clusters = A, B, C, D Step 2: selection G1=(A,B) Step 3: updating the matrix Distance between G1 (A,B) and C and D
- d{(A,B),C} = max d{(A,C); d(B,C)}=max d{4,2}= 4
So, trade-off between internal homogeneity and information loss. How to evaluate the hierarchical clustering? There are 2 main indices (quaderno):
- R-squared : the closer it is to 1, the more optimal the partition is (where 0 = worst case and 1 = best case - R-squared = B/T
- T = total deviance (or total variation)
- We can decompose T = W + B
- Pseudo-F : a high value of the Pseudo-F statistic indicates distinct clusters; if it drops sharply, it means that very different groups have been merged it means that W (= within group deviance) is increasing. In general, we want B as maximum as possible and W as minimum as possible, as in the R^2.
Hierarchical clustering: problems and limitations
- Computational complexity in time and space , especially when we have many statistical units and many statistical variable
- The decision to merge 2 groups is irreversible because we have n- iteration and in each iteration we merge the 2 closest clusters, and we update the distance matrix n-1 times
- No objective function is directly minimized (like in K-means)
- Different methods have issues with one or more of the following: o Sensitivity to noise and outliers the choice of the algorithms depends on the structure of the data and the major of the analysis (for example: if we want to have a model less sensitive to outliers and more compact is better to choose the complete linkage that is less sensitive with respect to the single linkage) o Difficulty in handling clusters of different sizes (especially when we have outliers) o Confusion with large clusters sometimes we have cluster of difference size (= a problem
Supervised learning (label data)
When we consider the SUPERVISED LEARNING main GOAL : make predictions about
the y class or value.
- If y variable = categorical we are talking about classification tasks - If y variable = continuous we are talking about regression tasks We divide our dataset into 2 parts:
- Training set : used to train (= implement) the model contains the 80% of the observation = bigger part
- Test set : used to evaluate the model in terms of accuracy (=compare the predictive value and real value) contains the 20% of the observation of course, 100% in the original dataset. p = number of predictions To evaluate our model, we use the measures of performance that are:
- Regression tasks : we use the Mean Square Error (MSE) [sommatoria(actual value - predicted value)]
o In the regression task the y is a = numerical variable
o MSE is a measure of accuracy of our model lower the MSE is, the more accurate is the model
- Classification task: we use the Misclassification Error Rate = number of incorrect prediction/total number of instances (that are the rows)
o In the classification task the y is a = categorical variable (y could be
multiclass or binary variable) o Misclassification Error Rate measure the accuracy of the model
the difference between regression and classification tasks is the y variable.
Training data: key points
- Used to teach the model to recognize patterns and relationship in the data
- The larger the training set, the better the model learns
- Used to compare different models
Test data: key points
- Test set used to evaluate how well the model performs on new data (unseen examples)
- Helps to avoid bias and overfitting phenomenon which occurs when a model works well on training data but not on new data (= overfitting)
- The test set should be a good representation of the original data
Example of classification task: binary
- Each row is an observation 10 observations in our data set
- Columns X1, X2..Xp are the features or characteristics (= independent variables)
- Classification task y = categorical variable X1 X2 Xp Y (0 or 1) 1 0 2 1 3 0 4 0 5 1 6 0 7 1 8 0 9 0 10 1 We randomly split the data set into:
- Training set (80%) : 8 observations (units: 1, 2, 3, 5, 7, 8, 9, 10)
- Test set (20%) : 2 observations (units: 4, 6) used to verify the performance We want to predict the class ŷ for unit 4 and 6 In the case the accuracy of the training set is too high with respect to the accuracy of the test set we have the overfitting
o Then, we look at the class (y) of this K closest points and apply the majority vote rule we use the prevalent class
- In the case of REGRESSION the predicted value ( y) is equal to the mean of the observations included in the nearest neighbour K o To predict the value of y we use the mean
K-Nearest Neighbour Regression Algorithm
Choose the optimal value of K :
- If K is too small it is sensitive to noise points (like in a)
- If K is too large it may include points from other classes Typically, K is chosen between 1 and 20, preferring odd numbers to avoid 50/50 ties in the case of the majority vote rule.
Cross-Validation criterion
To choose the optimal number K, we can use the CROSS-VALIDATION CRITERION is a resampling method : technique for estimating model performance by repetitively drawing samples from the data. Training data is randomly divided into k equally sized ( different from the k of the NN) folds (= blocks).
- One-fold is used as the validation set
- The remaining k-1 folds are used for training The process is repeated k times and results are averaged; The procedure stops when all k folds are used as a validation set.
Example of K-fold cross-validation
Training data: k = 5 folds Step 1 : we divide the training data into 5 folds, of which:
- 1 fold is validation set
- Remaining 4 folds are training set Step 2 : consider the 2nd^ fold for validation (= just 1) and the other 4 folds for training Step 3 : consider the 3rd^ fold for validation and the other 4 folds for training Step 4: consider the 4th^ fold for validation and the other 4 folds for training Step 5: consider the 5th^ fold for validation and the other 4 folds for training The procedure stops when all folds are used for validation. For each K (candidates of NN) example: K=1, K=2, K=3, we want to choose the optimal k values and in order to do so we use cross-validation criterion We split each K (1,2,3) into 5 folds:
- Training use 4 of the folds as training data
- Validation use the remaining 1 fold as a validation
Then, we calculate the average performance for each K and we compare the averages we choose the value of K with the best performance.