exercice corrigé exercice corrigé, Exams of Telecommunication electronics

solution d'une série d'exercices

Typology: Exams

2021/2022

Uploaded on 11/01/2022

farah-jbara
farah-jbara 🇹🇳

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
REI502M - Introduction to Data Mining
Solutions to homework 2
Elías Snorrason 10. september 2019
Problem 3.2
Consider the training examples shown in the following table for a binary classification problem.
Customer ID Gender Car Type Shirt Size Class
1 M Family Small C0
2 M Sports Medium C0
3 M Sports Medium C0
4 M Sports Large C0
5 M Sports Extra Large C0
6 M Sports Extra Large C0
7 F Sports Small C0
8 F Sports Small C0
9 F Sports Medium C0
10 F Luxury Large C0
11 M Family Large C1
12 M Family Extra Large C1
13 M Family Medium C1
14 M Luxury Extra Large C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
18 F Luxury Medium C1
19 F Luxury Medium C1
20 F Luxury Large C1
A. Compute the Gini index for the overall collection of training examples.
This results in a single partition with 20 records and two possible classes with relative frequencies pand (1 p),
respectively. In this case, C0 and C1 have the same relative frequencies (p= 1 p=1
2)
Gini = 1 p2(1 p)2= 2p(1 p)=2p2= 2 ·1
4=1
2= 0.5
1
pf3
pf4
pf5

Partial preview of the text

Download exercice corrigé exercice corrigé and more Exams Telecommunication electronics in PDF only on Docsity!

REI502M - Introduction to Data Mining

Solutions to homework 2

Elías Snorrason 10. september 2019

Problem 3.

Consider the training examples shown in the following table for a binary classification problem.

Customer ID Gender Car Type Shirt Size Class 1 M Family Small C 2 M Sports Medium C 3 M Sports Medium C 4 M Sports Large C 5 M Sports Extra Large C 6 M Sports Extra Large C 7 F Sports Small C 8 F Sports Small C 9 F Sports Medium C 10 F Luxury Large C 11 M Family Large C 12 M Family Extra Large C 13 M Family Medium C 14 M Luxury Extra Large C 15 F Luxury Small C 16 F Luxury Small C 17 F Luxury Medium C 18 F Luxury Medium C 19 F Luxury Medium C 20 F Luxury Large C

A. Compute the Gini index for the overall collection of training examples.

This results in a single partition with 20 records and two possible classes with relative frequencies p and (1 − p), respectively. In this case, C0 and C1 have the same relative frequencies (p = 1 − p = 12 )

Gini = 1 − p^2 − (1 − p)^2 = 2p(1 − p) = 2p^2 = 2 ·

B. Compute the Gini index for the ‘Customer ID‘ attribute.

Split the entire collection into 20 partitions based on the ‘Customer ID‘ attribute. Since each partition only contains a single record, it’s Gini index is zero by default. The weighted average of the Gini indices for all the partitions becomes:

Gini =

∑^20

i=

Gini(ID i) = 0

C. Compute the Gini index for the Gender attribute.

Split the entire collection into two partitions based on the Gender attribute (M or F). Each partition has 10 records. Set p as relative frequency of C0 in each case.

Gini(M) = 2p(1 − p) = 2 ·

Gini(F) = 2p(1 − p) = 2 ·

Take the weighted averages of both Gini indices to determine the total Gini index for the given split.

Gini =

Gini(M) +

Gini(F) =

D. Compute the Gini index for the Car Type attribute using multiway split.

This gives us 3 partitions (Family (4 records), Sports (8 records) and Luxury (8 records)).

Gini(Family) = 2 ·

Gini(Sports) = 2 ·

Gini(Luxury) = 2 ·

The weighted average of these indices is:

Gini =

Gini(Family) + 

8 :^0

Gini(Sports) +

Gini(Luxury) =

E. Compute the Gini index for the Shirt Size attribute using multiway split.

Here we get 4 partitions (Small (5 records), Medium (7 records), Large (4 records) and Extra Large (4 records)).

Gini(Small) = 2 ·

Gini(Medium) = 2 ·

Gini(Large) = 2 ·

Gini(Extra Large) = 2 ·

The weighted averages of these indices is:

Gini =

C. For a 3 , which is a continuous attribute, compute the information gain for every possible split.

Value of a 3 that occur in the given table are in the following range of [1. 0 , 8 .0]. After sorting, we’ll set split positions midway between neighboring values. Table sorted by a 3 (then by ID): Instance a 3 Target Class 1 1.0 + 6 3.0 - 4 4.0 + 3 5.0 - 9 5.0 - 2 6.0 + 5 7.0 - 8 7.0 + 7 8.0 - Below, split positions of a 3 are displayed in the top left corner of each count matrices. Below each matrix is the weighted entropy and information gain (E/G) in each case. The split position with maximal information gain is emphasised in bold. This corresponds to splitting at a 3 = 2. 0.

0.5 <= >

  • 0 4
  • 0 5 E/G 0.9911 0

E/G 0.8484 0.

E/G 0.9858 0.

E/G 0.9183 0.

E/G 0.9839 0.

E/G 0.9728 0.

E/G 0.8889 0.

E/G 0.9911 0

D. What is the best split (among a 1 , a 2 , and a 3 ) according to the information gain?

By maximizing information gain:

Gain (a 1 ) = 0. 2294 Gain (a 2 ) = 0. 0072 Gain (a 3 ) = 0. 1427

we get the best split from a 1.

E. What is the best split (between a 1 and a 2 ) according to the misclassification error rate?

We’ve already calculated the relative frequencies: Start with a 1 :

p(a 1 ) + - True (^3414) False (^1545)

⇒ Error (a 1 ) = 49 (1 − 34 ) + 59 (1 − 45 ) = 19 + 19 = (^29)

Then a 2 :

p(a 2 ) + - True (^2535) False (^2424)

⇒ Error (a 2 ) = 59 (1 − 35 ) + 49 (1 − 24 ) = 29 + 29 = (^49)

F. What is the best split (between a1 and a2) according to the Gini index?

Use the same relative frequencies as in the previous part. a 1 :

p(a 1 ) + - True (^3414) False (^1545)

⇒ Gini (a 1 ) = 49 (1 − 3 2 42 −^

12 42 ) +^

5 9 (1^ −^

12 52 −^

42 52 ) = 0.^3444

a 2 :

p(a 2 ) + - True (^2535) False (^2424)

⇒ Gini (a 2 ) = 59 (1 − 2

2 52 −^

32 52 ) +^

4 9 (1^ −^

22 42 −^

22 42 ) = 0.^4889

Problem 3.

Consider the following data set for a binary class problem.

A B Class Label T F + T T + T T + T F - T T + F F - F F - F F - T T - T F -

A. Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose?

Determine impurities before splitting:

Entropy (Class) = −

log 2

log 2

Determine the count matrices (and relative frequencies p). Then calculate the corresponding impurities. A:

A + - True 4 3 False 0 3

and

p(A) + - True (^4737) False (^0 ) ⇓

Entropy (A) =

log 2

log 2

· 0 = 0. 6897 ⇒ Gain (A) = 0. 2813

B:

B + -

True 3 1 False 1 5

and

p(B) + - True (^3414) False (^1656) ⇓

Entropy (B) =

log 2

log 2

log 2

log 2

= 0. 7145 ⇒ Gain (B) = 0. 2565

According to this, attribute A should be picked for splitting.