Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Log in Sign up

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

exercice corrigé exercice corrigé, Exams of Telecommunication electronics

Université du Centre, Sousse Telecommunication electronics

solution d'une série d'exercices

Typology: Exams

2021/2022

Uploaded on 11/01/2022

farah-jbara 🇹🇳

1 / 6

This page cannot be seen from the preview

Don't miss anything!

REI502M - Introduction to Data Mining

Solutions to homework 2

Elías Snorrason 10. september 2019

Problem 3.2

Consider the training examples shown in the following table for a binary classification problem.

Customer ID Gender Car Type Shirt Size Class

1 M Family Small C0

2 M Sports Medium C0

3 M Sports Medium C0

4 M Sports Large C0

5 M Sports Extra Large C0

6 M Sports Extra Large C0

7 F Sports Small C0

8 F Sports Small C0

9 F Sports Medium C0

10 F Luxury Large C0

11 M Family Large C1

12 M Family Extra Large C1

13 M Family Medium C1

14 M Luxury Extra Large C1

15 F Luxury Small C1

16 F Luxury Small C1

17 F Luxury Medium C1

18 F Luxury Medium C1

19 F Luxury Medium C1

20 F Luxury Large C1

A. Compute the Gini index for the overall collection of training examples.

This results in a single partition with 20 records and two possible classes with relative frequencies pand (1 −p),

respectively. In this case, C0 and C1 have the same relative frequencies (p= 1 −p=1

2)

Gini = 1 −p2−(1 −p)2= 2p(1 −p)=2p2= 2 ·1

4=1

2= 0.5

1

Discover Exams of Telecommunication electronics Université du Centre, Sousse

Partial preview of the text

Download exercice corrigé exercice corrigé and more Exams Telecommunication electronics in PDF only on Docsity!

REI502M - Introduction to Data Mining

Solutions to homework 2

Elías Snorrason 10. september 2019

Problem 3.

Consider the training examples shown in the following table for a binary classification problem.

Customer ID Gender Car Type Shirt Size Class 1 M Family Small C 2 M Sports Medium C 3 M Sports Medium C 4 M Sports Large C 5 M Sports Extra Large C 6 M Sports Extra Large C 7 F Sports Small C 8 F Sports Small C 9 F Sports Medium C 10 F Luxury Large C 11 M Family Large C 12 M Family Extra Large C 13 M Family Medium C 14 M Luxury Extra Large C 15 F Luxury Small C 16 F Luxury Small C 17 F Luxury Medium C 18 F Luxury Medium C 19 F Luxury Medium C 20 F Luxury Large C

A. Compute the Gini index for the overall collection of training examples.

This results in a single partition with 20 records and two possible classes with relative frequencies p and (1 − p), respectively. In this case, C0 and C1 have the same relative frequencies (p = 1 − p = 12 )

Gini = 1 − p^2 − (1 − p)^2 = 2p(1 − p) = 2p^2 = 2 ·

B. Compute the Gini index for the ‘Customer ID‘ attribute.

Split the entire collection into 20 partitions based on the ‘Customer ID‘ attribute. Since each partition only contains a single record, it’s Gini index is zero by default. The weighted average of the Gini indices for all the partitions becomes:

Gini =

∑^20

i=

Gini(ID i) = 0

C. Compute the Gini index for the Gender attribute.

Split the entire collection into two partitions based on the Gender attribute (M or F). Each partition has 10 records. Set p as relative frequency of C0 in each case.

Gini(M) = 2p(1 − p) = 2 ·

Gini(F) = 2p(1 − p) = 2 ·

Take the weighted averages of both Gini indices to determine the total Gini index for the given split.

Gini =

Gini(M) +

Gini(F) =

D. Compute the Gini index for the Car Type attribute using multiway split.

This gives us 3 partitions (Family (4 records), Sports (8 records) and Luxury (8 records)).

Gini(Family) = 2 ·

Gini(Sports) = 2 ·

Gini(Luxury) = 2 ·

The weighted average of these indices is:

Gini =

Gini(Family) +

8 :^0

Gini(Sports) +

Gini(Luxury) =

E. Compute the Gini index for the Shirt Size attribute using multiway split.

Here we get 4 partitions (Small (5 records), Medium (7 records), Large (4 records) and Extra Large (4 records)).

Gini(Small) = 2 ·

Gini(Medium) = 2 ·

Gini(Large) = 2 ·

Gini(Extra Large) = 2 ·

The weighted averages of these indices is:

Gini =

C. For a 3 , which is a continuous attribute, compute the information gain for every possible split.

Value of a 3 that occur in the given table are in the following range of [1. 0 , 8 .0]. After sorting, we’ll set split positions midway between neighboring values. Table sorted by a 3 (then by ID): Instance a 3 Target Class 1 1.0 + 6 3.0 - 4 4.0 + 3 5.0 - 9 5.0 - 2 6.0 + 5 7.0 - 8 7.0 + 7 8.0 - Below, split positions of a 3 are displayed in the top left corner of each count matrices. Below each matrix is the weighted entropy and information gain (E/G) in each case. The split position with maximal information gain is emphasised in bold. This corresponds to splitting at a 3 = 2. 0.

0.5 <= >

0 4

0 5 E/G 0.9911 0

E/G 0.8484 0.

E/G 0.9858 0.

E/G 0.9183 0.

E/G 0.9839 0.

E/G 0.9728 0.

E/G 0.8889 0.

E/G 0.9911 0

D. What is the best split (among a 1 , a 2 , and a 3 ) according to the information gain?

By maximizing information gain:

Gain (a 1 ) = 0. 2294 Gain (a 2 ) = 0. 0072 Gain (a 3 ) = 0. 1427

we get the best split from a 1.

E. What is the best split (between a 1 and a 2 ) according to the misclassification error rate?

We’ve already calculated the relative frequencies: Start with a 1 :

p(a 1 ) + - True (^3414) False (^1545)

⇒ Error (a 1 ) = 49 (1 − 34 ) + 59 (1 − 45 ) = 19 + 19 = (^29)

Then a 2 :

p(a 2 ) + - True (^2535) False (^2424)

⇒ Error (a 2 ) = 59 (1 − 35 ) + 49 (1 − 24 ) = 29 + 29 = (^49)

F. What is the best split (between a1 and a2) according to the Gini index?

Use the same relative frequencies as in the previous part. a 1 :

p(a 1 ) + - True (^3414) False (^1545)

⇒ Gini (a 1 ) = 49 (1 − 3 2 42 −^

12 42 ) +^

5 9 (1^ −^

12 52 −^

42 52 ) = 0.^3444

a 2 :

p(a 2 ) + - True (^2535) False (^2424)

⇒ Gini (a 2 ) = 59 (1 − 2

2 52 −^

32 52 ) +^

4 9 (1^ −^

22 42 −^

22 42 ) = 0.^4889

Problem 3.

Consider the following data set for a binary class problem.

A B Class Label T F + T T + T T + T F - T T + F F - F F - F F - T T - T F -

A. Calculate the information gain when splitting on A and B. Which attribute would the decision tree induction algorithm choose?

Determine impurities before splitting:

Entropy (Class) = −

log 2

Determine the count matrices (and relative frequencies p). Then calculate the corresponding impurities. A:

A + - True 4 3 False 0 3

and

p(A) + - True (^4737) False (^0 ) ⇓

Entropy (A) =

log 2

· 0 = 0. 6897 ⇒ Gain (A) = 0. 2813

B:

B + -

True 3 1 False 1 5

and

p(B) + - True (^3414) False (^1656) ⇓

Entropy (B) =

log 2

= 0. 7145 ⇒ Gain (B) = 0. 2565

According to this, attribute A should be picked for splitting.

exercice corrigé exercice corrigé, Exams of Telecommunication electronics

Related documents

Partial preview of the text

Download exercice corrigé exercice corrigé and more Exams Telecommunication electronics in PDF only on Docsity!

REI502M - Introduction to Data Mining

Solutions to homework 2

Elías Snorrason 10. september 2019

Problem 3.

∑^20

8 :^0

E/G 0.8484 0.

E/G 0.9858 0.

E/G 0.9183 0.

E/G 0.9839 0.

E/G 0.9728 0.

E/G 0.8889 0.

E/G 0.9911 0

B:

B + -

8 :^0