



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
solution d'une série d'exercices
Typology: Exams
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Consider the training examples shown in the following table for a binary classification problem.
Customer ID Gender Car Type Shirt Size Class 1 M Family Small C 2 M Sports Medium C 3 M Sports Medium C 4 M Sports Large C 5 M Sports Extra Large C 6 M Sports Extra Large C 7 F Sports Small C 8 F Sports Small C 9 F Sports Medium C 10 F Luxury Large C 11 M Family Large C 12 M Family Extra Large C 13 M Family Medium C 14 M Luxury Extra Large C 15 F Luxury Small C 16 F Luxury Small C 17 F Luxury Medium C 18 F Luxury Medium C 19 F Luxury Medium C 20 F Luxury Large C
A. Compute the Gini index for the overall collection of training examples.
This results in a single partition with 20 records and two possible classes with relative frequencies p and (1 − p), respectively. In this case, C0 and C1 have the same relative frequencies (p = 1 − p = 12 )
Gini = 1 − p^2 − (1 − p)^2 = 2p(1 − p) = 2p^2 = 2 ·
B. Compute the Gini index for the ‘Customer ID‘ attribute.
Split the entire collection into 20 partitions based on the ‘Customer ID‘ attribute. Since each partition only contains a single record, it’s Gini index is zero by default. The weighted average of the Gini indices for all the partitions becomes:
Gini =
i=
Gini(ID i) = 0
C. Compute the Gini index for the Gender attribute.
Split the entire collection into two partitions based on the Gender attribute (M or F). Each partition has 10 records. Set p as relative frequency of C0 in each case.
Gini(M) = 2p(1 − p) = 2 ·
Gini(F) = 2p(1 − p) = 2 ·
Take the weighted averages of both Gini indices to determine the total Gini index for the given split.
Gini =
Gini(M) +
Gini(F) =
D. Compute the Gini index for the Car Type attribute using multiway split.
This gives us 3 partitions (Family (4 records), Sports (8 records) and Luxury (8 records)).
Gini(Family) = 2 ·
Gini(Sports) = 2 ·
Gini(Luxury) = 2 ·
The weighted average of these indices is:
Gini =
Gini(Family) +
Gini(Sports) +
Gini(Luxury) =
E. Compute the Gini index for the Shirt Size attribute using multiway split.
Here we get 4 partitions (Small (5 records), Medium (7 records), Large (4 records) and Extra Large (4 records)).
Gini(Small) = 2 ·
Gini(Medium) = 2 ·
Gini(Large) = 2 ·
Gini(Extra Large) = 2 ·
The weighted averages of these indices is:
Gini =
C. For a 3 , which is a continuous attribute, compute the information gain for every possible split.
Value of a 3 that occur in the given table are in the following range of [1. 0 , 8 .0]. After sorting, we’ll set split positions midway between neighboring values. Table sorted by a 3 (then by ID): Instance a 3 Target Class 1 1.0 + 6 3.0 - 4 4.0 + 3 5.0 - 9 5.0 - 2 6.0 + 5 7.0 - 8 7.0 + 7 8.0 - Below, split positions of a 3 are displayed in the top left corner of each count matrices. Below each matrix is the weighted entropy and information gain (E/G) in each case. The split position with maximal information gain is emphasised in bold. This corresponds to splitting at a 3 = 2. 0.
0.5 <= >
D. What is the best split (among a 1 , a 2 , and a 3 ) according to the information gain?
By maximizing information gain:
Gain (a 1 ) = 0. 2294 Gain (a 2 ) = 0. 0072 Gain (a 3 ) = 0. 1427
we get the best split from a 1.
E. What is the best split (between a 1 and a 2 ) according to the misclassification error rate?
We’ve already calculated the relative frequencies: Start with a 1 :
p(a 1 ) + - True (^3414) False (^1545)
⇒ Error (a 1 ) = 49 (1 − 34 ) + 59 (1 − 45 ) = 19 + 19 = (^29)
Then a 2 :
p(a 2 ) + - True (^2535) False (^2424)
⇒ Error (a 2 ) = 59 (1 − 35 ) + 49 (1 − 24 ) = 29 + 29 = (^49)
F. What is the best split (between a1 and a2) according to the Gini index?
Use the same relative frequencies as in the previous part. a 1 :
p(a 1 ) + - True (^3414) False (^1545)
⇒ Gini (a 1 ) = 49 (1 − 3 2 42 −^
12 42 ) +^
5 9 (1^ −^
12 52 −^
42 52 ) = 0.^3444
a 2 :
p(a 2 ) + - True (^2535) False (^2424)
⇒ Gini (a 2 ) = 59 (1 − 2
2 52 −^
32 52 ) +^
4 9 (1^ −^
22 42 −^
22 42 ) = 0.^4889
Problem 3.
Consider the following data set for a binary class problem.
A B Class Label T F + T T + T T + T F - T T + F F - F F - F F - T T - T F -
A. Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose?
Determine impurities before splitting:
Entropy (Class) = −
log 2
log 2
Determine the count matrices (and relative frequencies p). Then calculate the corresponding impurities. A:
A + - True 4 3 False 0 3
and
p(A) + - True (^4737) False (^0 ) ⇓
Entropy (A) =
log 2
log 2
· 0 = 0. 6897 ⇒ Gain (A) = 0. 2813
True 3 1 False 1 5
and
p(B) + - True (^3414) False (^1656) ⇓
Entropy (B) =
log 2
log 2
log 2
log 2
= 0. 7145 ⇒ Gain (B) = 0. 2565
According to this, attribute A should be picked for splitting.