Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

solution d'une série d'exercices

Typology: Exams

2021/2022

1 / 6

Download exercice corrigé exercice corrigé and more Exams Telecommunication electronics in PDF only on Docsity! REI502M - Introduction to Data Mining Solutions to homework 2 Elías Snorrason 10. september 2019 Problem 3.2 Consider the training examples shown in the following table for a binary classification problem. Customer ID Gender Car Type Shirt Size Class 1 M Family Small C0 2 M Sports Medium C0 3 M Sports Medium C0 4 M Sports Large C0 5 M Sports Extra Large C0 6 M Sports Extra Large C0 7 F Sports Small C0 8 F Sports Small C0 9 F Sports Medium C0 10 F Luxury Large C0 11 M Family Large C1 12 M Family Extra Large C1 13 M Family Medium C1 14 M Luxury Extra Large C1 15 F Luxury Small C1 16 F Luxury Small C1 17 F Luxury Medium C1 18 F Luxury Medium C1 19 F Luxury Medium C1 20 F Luxury Large C1 A. Compute the Gini index for the overall collection of training examples. This results in a single partition with 20 records and two possible classes with relative frequencies p and (1 − p), respectively. In this case, C0 and C1 have the same relative frequencies (p = 1− p = 1 2 ) Gini = 1− p2 − (1− p)2 = 2p(1− p) = 2p2 = 2 · 1 4 = 1 2 = 0.5 1 B. Compute the Gini index for the ‘Customer ID‘ attribute. Split the entire collection into 20 partitions based on the ‘Customer ID‘ attribute. Since each partition only contains a single record, it’s Gini index is zero by default. The weighted average of the Gini indices for all the partitions becomes: Gini = 20∑ i=1 1 20 :0 Gini(ID i) = 0 C. Compute the Gini index for the Gender attribute. Split the entire collection into two partitions based on the Gender attribute (M or F). Each partition has 10 records. Set p as relative frequency of C0 in each case. Gini(M) = 2p(1− p) = 2 · 6 10 · 4 10 = 48 100 = 0.48 Gini(F) = 2p(1− p) = 2 · 4 10 · 6 10 = 48 100 = 0.48 Take the weighted averages of both Gini indices to determine the total Gini index for the given split. Gini = 10 20 Gini(M) + 10 20 Gini(F) = 48 100 = 0.48 D. Compute the Gini index for the Car Type attribute using multiway split. This gives us 3 partitions (Family (4 records), Sports (8 records) and Luxury (8 records)). Gini(Family) = 2 · 1 4 · 3 4 = 6 16 = 0.375 Gini(Sports) = 2 · 8 8 · 0 8 = 0 64 = 0 Gini(Luxury) = 2 · 1 8 · 7 8 = 14 64 = 0.2656 The weighted average of these indices is: Gini = 4 20 Gini(Family) + :0 8 20 Gini(Sports) + 8 20 Gini(Luxury) = 4 20 6 16 + 8 20 14 64 = 13 80 = 0.1625 E. Compute the Gini index for the Shirt Size attribute using multiway split. Here we get 4 partitions (Small (5 records), Medium (7 records), Large (4 records) and Extra Large (4 records)). Gini(Small) = 2 · 3 5 · 2 5 = 12 25 = 0.48 Gini(Medium) = 2 · 3 7 · 4 7 = 24 49 = 0.4898 Gini(Large) = 2 · 2 4 · 2 4 = 1 2 = 0.5 Gini(Extra Large) = 2 · 2 4 · 2 4 = 1 2 = 0.5 The weighted averages of these indices is: Gini = ( 4 20 · 12 25 + 8 20 · 24 49 + 4 20 · 1 2 + 4 20 · 1 2 ) = 3013 6125 = 0.4919 2 F. What is the best split (between a1 and a2) according to the Gini index? Use the same relative frequencies as in the previous part. a1: p(a1) + - True 3 4 1 4 False 1 5 4 5 ⇒ Gini (a1) = 4 9 (1− 32 42 − 12 42 ) + 5 9 (1− 12 52 − 42 52 ) = 0.3444 a2: p(a2) + - True 2 5 3 5 False 2 4 2 4 ⇒ Gini (a2) = 5 9 (1− 22 52 − 32 52 ) + 4 9 (1− 22 42 − 22 42 ) = 0.4889 Problem 3.5 Consider the following data set for a binary class problem. A B Class Label T F + T T + T T + T F - T T + F F - F F - F F - T T - T F - A. Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose? Determine impurities before splitting: Entropy (Class) = − 4 10 log2 ( 4 10 ) − 6 10 log2 ( 6 10 ) = 0.9710 Determine the count matrices (and relative frequencies p). Then calculate the corresponding impurities. A: A + - True 4 3 False 0 3 and p(A) + - True 4 7 3 7 False 0 3 3 ⇓ Entropy (A) = 7 10 ( −4 7 log2 ( 4 7 ) − 3 7 log2 ( 3 7 )) + 3 10 · 0 = 0.6897 ⇒ Gain (A) = 0.2813 B: B + - True 3 1 False 1 5 and p(B) + - True 3 4 1 4 False 1 6 5 6 ⇓ Entropy (B) = 4 10 ( −3 4 log2 ( 3 4 ) − 1 4 log2 ( 1 4 )) + 6 10 ( −1 6 log2 ( 1 6 ) − 5 6 log2 ( 5 6 )) = 0.7145 ⇒ Gain (B) = 0.2565 According to this, attribute A should be picked for splitting. 5 B. Calculate the gain in the Gini index when splitting on A and B. Which attribute would the decision tree induction algorithm choose? We start with the gini index without splitting the set: Gini (Class) = 2 · 4 10 6 10 = 48 100 = 0.48 Using the same count matrices and relative frequencies as in the previous part, we calculate the corresponding Gini indices. Gini (A) = 7 10 · 24 49 + > 0 3 10 · 0 9 = 0.3429 ⇒ Gain (A) = 0.1371 Gini (B) = 4 10 · 6 16 + 6 10 · 10 36 = 0.3167 ⇒ Gain (B) = 0.1633 From this, attribute B should be chosen. C. Figure 3.11 shows that entropy and the Gini index are both monotonically increasing on the range [0, 0.5] and they are both monotonically decreasing on the range [0.5, 1]. Is it possible that information gain and the gain in the Gini index favor different attributes? Explain. Figure 3.11 only plots the impurities (not the gains). The optimal purity/information gains may favor different attributes, since they are based on calculating weighted averages. This is not easily seen from Figure 3.11, but one can assume that the gain is highly sensitive to changes in relative frequencies due to the different shapes of the impurity-curves. 6