Assignment problems Solved - Advanced Logarithm Data Processing | CIS 4930, Assignments of Computer Science

Material Type: Assignment; Class: ADV LG DATA PROCESSG; Subject: COMPUTER SCIENCE AND INFORMATION SYSTEMS; University: University of Florida; Term: Fall 2007;

Typology: Assignments

Pre 2010

Uploaded on 03/13/2009

koofers-user-n0e
koofers-user-n0e 🇺🇸

10 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Chapter 4, Problem 3:
a)
For the classification function, the number of + classifications are 4 and the number of – classifications are
5. Hence, the entropy of this collection with respect to the + class:
Entropy = – (4/9) log2 (4/9) = 0.52
b)
Now if we split on a1 and a2 then,
a1 N1 N2 a2 N1 N2
+ 3 1 + 2 2
- 1 4 - 2 3
Entropy(N1) = 0.81 Entropy(N1) = 1.00
Entropy(N2) = 0.72 Entropy(N2) = 0.97
gain = 0.99 – (4/9)(0.81) – (5/9)(0.72) Δgain = 0.99 – (4/9)(1.00) – (5/9)(0.97)Δ
= 0.23 = 0.01
c)
Hence, the best split for a3 is 2.0.
d)
Based on the calculations we made in part (b) and (c), the best split is a1.
e)
Classification error rate for the classification function
= 1 – max[4/9, 5/9]
= 1 – 5/9
= 0.44
a1 N1 N2 a2 N1 N2
+ 3 1 + 2 2
- 1 4 - 2 3
Error 0.25 0.20 Error 0.50 0.40
= 0.44 – (4/9)(0.25) – (5/9)(0.2) Δ = 0.44 – (4/9)(0.50) – (5/9)(0.40)Δ
= 0.22 = 0.00
So, based on classification error rate, a1 is the best split.
f)
For the classification function, GINI = 1 – (4/9)2 – (5/9)2 = 0.49
pf3
pf4
pf5

Partial preview of the text

Download Assignment problems Solved - Advanced Logarithm Data Processing | CIS 4930 and more Assignments Computer Science in PDF only on Docsity!

Chapter 4, Problem 3: a) For the classification function, the number of + classifications are 4 and the number of – classifications are

  1. Hence, the entropy of this collection with respect to the + class: Entropy = – (4/9) log2 (4/9) = 0. b) Now if we split on a1 and a2 then, a1 N1 N2 a2 N1 N
    • 3 1 + 2 2 1 4 2 3 Entropy(N1) = 0.81 Entropy(N1) = 1. Entropy(N2) = 0.72 Entropy(N2) = 0. Δ gain = 0.99 – (4/9)(0.81) – (5/9)(0.72) Δgain = 0.99 – (4/9)(1.00) – (5/9)(0.97) = 0.23 = 0. c) Hence, the best split for a3 is 2.0. d) Based on the calculations we made in part (b) and (c), the best split is a1. e) Classification error rate for the classification function = 1 – max[4/9, 5/9] = 1 – 5/ = 0. a1 N1 N2 a2 N1 N
    • 3 1 + 2 2 1 4 2 3 Error 0.25 0.20 Error 0.50 0. Δ = 0.44 – (4/9)(0.25) – (5/9)(0.2) Δ= 0.44 – (4/9)(0.50) – (5/9)(0.40) = 0.22 = 0. So, based on classification error rate, a1 is the best split. f) For the classification function, GINI = 1 – (4/9)^2 – (5/9)^2 = 0.

Now if we split on a1 and a2 then, a1 N1 N2 a2 N1 N

  • 3 1 + 2 2 1 4 2 3 GINI 0.38 0.32 GINI 0.5 0. Δ = 0.49 – (4/9)(0.38) – (5/9)(0.32) Δ= 0.49 – (4/9)(0.5) – (5/9)(0.48) = 0.14 = 0. Hence, a1 is the best split. Chapter 4, Problem 7: a) For the classification function, Error (without any split) = 1 – max(50/100, 50/100) = 1 – ½ = ½ Attribute A: T F Class + 25 25 Class – 0 50 Error 1 – max(25/25, 0/25) 1 – max(25/75, 50/75) = 1 – 1 = 1 – 2/ = 0 = 1/ Δ = ½ – [25/100 • 0 + 75/100 • 1/3] = ½ – ¼ = 0. Attribute B: T F Class + 30 20 Class – 20 30 Error 1 – max(30/50, 20/50) 1 – max(20/50, 30/50) = 1 – 3/5 = 1 – 3/ = 2/5 = 2/ Δ = ½ – [50/100 • 2/5 + 50/100 • 2/5] = ½ – 2/5 = 1/10 = 0. Attribute C: T F Class + 25 25 Class – 25 25 Error 1 – max(25/50, 25/50) 1 – max(25/50, 25/50) = 1 – ½ = 1 – ½ = ½ = ½ Δ = ½ – [50/100 • ½ + 50/100 • ½] = ½ – ½ = 0 Clearly, attribute A is the best first split since it has the best gain. Hence, the tree would look like A T F 25 + +? 25 + 0 – 50 –

result in the this tree. Lets start with finding the second split for the left child (C = true). Also note that the classification error for this node is ½ as computed in part (a). Attribute A: T F Class + 25 0 Class – 0 25 Error 1 – max(25/25, 0/25) 1 – max(0/25, 25/25) = 1 – 1 = 1 – 1 = 0 = 0 Δ = ½ – [25/50 • 0 + 25/50 • 0] = ½ – 0 = 0. Attribute B: T F Class + 5 20 Class – 20 5 Error 1 – max(5/25, 20/25) 1 – max(20/25, 5/25) = 1 – 4/5 = 1 – 4/ = 1/5 = 1/ Δ = ½ – [25/50 • 1/5 + 25/50 • 1/5] = ½ – 1/5 = 3/10 = 0. Clearly, attribute A is better split for the left subtree. Now lets find the second splitting attribute for the right child of the root (C = false). The classification error for this is ½ as computed in part (a). Attribute A: T F Class + 0 25 Class – 0 25 Error 1 – max(25/50, 25/50) = 1 – ½ = ½ Δ = ½ – 50/50 • ½ = ½ – ½ = 0 Attribute B: T F Class + 25 0 Class – 0 25 Error 1 – max(25/25, 0/25) 1 – max(0/25, 25/25) = 1 – 1 = 1 – 1 = 0 = 0 Δ = ½ – [25/50 • 0 + 25/50 • 0] = ½ – 0 = 0. C T F A B T F T F 25 + + 0 + – 25 + + – 0 + 0 – 25 – 0 – 25 – Hence, attribute B is a better split for the right subtree. Since all the leaf nodes are pure classes, 0 out of

the 100 instances will get misclassified. This tree will correctly classify all instances. e) It is clearly evident from the results of part (c) and (d), that the greedy approach does not always lead to a decision tree with lowest misclassification errors. Chapter 4, Problem 9: Tree (a) Number of non leaf nodes = 2 Number of leaf nodes = 3 Number of errors = 7 Number of classes = 3 Number of attributes = 16 Number of records = n Hence, Cost(tree) = 2 log2 16 + 3 log2 3 = 24 + 31.585 = 8 + 4.755 = 12.755 bits Cost(data | tree) = 7 log2 n = 7 log2 n bits Cost(tree, data) = 12.755 + 7 log2 n bits Tree (b) Number of non leaf nodes = 4 Number of leaf nodes = 5 Number of errors = 4 Number of classes = 3 Number of attributes = 16 Number of records = n Hence, Cost(tree) = 4 log2 16 + 5 log2 3 = 44 + 51.585 = 16 + 7.925 = 23.925 bits Cost(data | tree) = 4 log2 n = 4 log2 n bits Cost(tree, data) = 23.925 + 4 log2 n bits Solving for n in 23.925 + 4 log2 n = 12.755 + 7 log2 n , we get, n = 13.208. Hence, according to MDL principle, decision tree (b) is better if n ≥ 14 and tree (a) is better otherwise. Chapter 5, Problem 4: a) Accuracy(R1) = 4 / (4 + 1) = 4/5 = 0. Accuracy(R2) = 30 / (30 + 10) = ¾ = 0. Accuracy(R3) = 100 / (100 + 90) = 10/19 = 0.5263 R1 best, R3 worst b) For FOIL's information gain, we will extend a rule with equal positive and negative coverage with the given rules, and then compare with the results i.e. p0 = n FOIL(R1) = 4 • (log2(4/5) – log2(1/2)) = 4 • (–0.3219 + 1) = 2. FOIL(R2) = 30 • (log2(30/40) – log2(1/2)) = 30 • (–0.4150 + 1) = 17. FOIL(R3) = 100 • (log2(100/190) – log2(1/2)) = 100 • (–0.9260 + 1) = 7.4001 R2 best, R1 worst c) For R1, k = 2, f+ = 4, e+ = 5 • 100/500 = 1, f– = 1, e– = 5 • 400/500 = 4 LSR(R1) = 2 • (4 • log2(4/1) + 1 • log2(1/4)) = 2 • (4 • 2 + 1 • (–2)) = 2 • 6 = 12

c) P(A=1) = 5/10 = 0.5 P(B=1) = 4/10 = 0. P(A=1) • P(B=1) = 0.5 • 0.4 = 0. P(A=1, B=1) = P(B=1 | A=1) • P(A=1) = 2/5 • 5/10 = 2/10 = 0. Since P(A=1, B=1) = P(A=1) • P(B=1), we can say that the random variables A and B are independent. d) P(A=1) = 5/10 = 0.5 P(B=0) = 6/10 = 0. P(A=1) • P(B=0) = 0.5 • 0.6 = 0. P(A=1, B=0) = P(B=0 | A=1) • P(A=1) = 3/5 • 5/10 = 3/10 = 0. Since P(A=1, B=0) = P(A=1) • P(B=0), we can say that the random variables A and B are independent. e) P(A=1, B=1 | +) = P(A=1, B=1, class=+) / P(+) = (1/10) / (5/10) = 1/5 = 0. Using values from part (a), P(A=1 | + ) • P(B=1 | +) = (3/5) • (2/5) = 6/25 = 0. Since P(A=1, B=1 | +) ≠ P(A=1 | + ) • P(B=1 | +), we can say that the random variable A and B are not conditionally independent on the class '+'. Chapter 5, Problem 12: a) P(B=g, F=e, G=e, S=y) = P(S=y | B=g, F=e, G=e) • P(B=g, F=e, G=e) = P(S=y | B=g, F=e) • P(G=e | B=g, F=e) • P(B=g, F=e) S does not depend on G = (1 – P(S=n | B=g, F=e)) • P(G=e | B=g, F=e) • P(B=g) • P(F=e) B and F are independent = (1 – P(S=n | B=g, F=e)) • P(G=e | B=g, F=e) • (1 – P(B=b)) • P(F=e) = (1 – 0.8) • 0.8 • (1 – 0.1) • 0. = 0.2 • 0.8 • 0.9 • 0.2 = 0. b) P(B=b, F=e, G=ne, S=n) = P(S=n | B=b, F=e, G=ne) • P(B=b, F=e, G=ne) = P(S=n | B=b, F=e) • P(G=ne| B=b, F=e) • P(B=b, F=e) S does not depend on G = P(S=n | B=b, F=e) • (1 – P(G=e | B=b, F=e)) • P(B=b) • P(F=e) B and F are independent = 1 • (1 – 0.9) • 0.1 • 0. = 1 • 0.1 • 0.1 • 0.2 = 0. c) P(S=y | B=b) = 1 – P(S=n | B=b) = 1 – P(S=n, B=b) / P(B=b) = 1 – 0.92 = 0. P(S=n, B=b) / P(B=b) = (P(S=n,B=b,F=e) + P(S=n, B=b, F=ne)) / P(B=b) = (P(S=n | B=b, F=e) • P(B=b) • P(F=e) + P(S=n |B=b, F=ne) • P(B=b) • P(F=ne)) / P(B=b) = P(S=n | B=b, F=e) • P(F=e) + P(S=n |B=b, F=ne) • P(F=ne) = 1 • 0.2 + 0.9 • (1 – 0.8) = 0.2 + 0.72 = 0.