Decision Trees-Artificial Intelligence-Quiz, Exercises of Artificial Intelligence

Madam Amrita Ahuja took this quiz in class of Artificial Intelligence at Central University of Jammu and Kashmir. This quiz involves: Decision, Trees, Data, Points, Boundaries, Dimension, Average, Entropy, Fraction, Positive, Nearest, Neighbors

Typology: Exercises

2011/2012

Uploaded on 07/31/2012

shaina_44kin
shaina_44kin 🇮🇳

3.9

(9)

64 documents

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
6.034 Quiz 2, Spring 2005
Open Book, Open Notes
Name:
Problem Score
1
(13 pts)
2
(8 pts)
3
(7 pts)
4
(9 pts)
5
(8 pts)
6
(16 pts)
7
(15 pts)
8
(12 pts)
9
(12 pts)
Total
(100 pts)
1
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Decision Trees-Artificial Intelligence-Quiz and more Exercises Artificial Intelligence in PDF only on Docsity!

6.034 Quiz 2, Spring 2005

Open Book, Open Notes

Name:

Problem Score 1 (13 pts) 2 (8 pts) 3 (7 pts) 4 (9 pts) 5 (8 pts) 6 (16 pts) 7 (15 pts) 8 (12 pts) 9 (12 pts) Total (100 pts)

1 Decision Trees (13 pts)

Data points are: Negative: (1, 0) (2, 1) (2, 2) Positive: (0, 0) (1, 0) Construct a decision tree using the algorithm described in the notes for the data above.

  1. Show the tree you constructed in the diagram below. The diagram is more than big

f 1 > 1.

f 1 > -0.

_

_

enough, leave any parts that you don’t need blank.

  1. Draw the decision boundaries on the graph at the top of the page.

2 Nearest Neighbors (8 pts)

Data points are: Negative: (1, 0) (2, 1) (2, 2) Positive: (0, 0) (1, 0)

  1. Draw the decision boundaries for 1Nearest Neighbors on the graph above. Try to get the integervalued coordinate points in the diagram on the correct side of the boundary lines.
  2. What class does 1NN predict for the new point: (1, 1.01) Explain why.

Positive (+) since this is the class of the closest data point (1,0).

  1. What class does 3NN predict for the new point: (1, 1.01) Explain why.

Positive (+) since it is the majority class of the three closest data points (0,0), (1,0) and (2,2).

3 Perceptron (7 pts)

x 1 1 2

1

2

-2 -

**-

  • +**

x 2

Data points are: Negative: (1, 0) (2, 2) Positive: (1, 0). Assume that the points are examined in the order given here. Recall that the perceptron algorithm uses the extended form of the data points in which a 1 is added as the 0th component.

  1. The linear separator obtained by the standard perceptron algorithm (using a step size of 1.0 and a zero initial weight vector) is (0 1 2). Explain how this result was obtained. The perceptron algorithm cycles through the augmented points, updating weights according to the update rule wnew = w + y ·x after misclassifying points. The intermediate weights are given in the table below.

Test point misclassified? Updated weights Initial weights 0 0 0 : (1 1 0) yes 1 1 0 : (1 2 2) yes 2 1 2 +: (1 1 0) yes 1 0 2 : (1 1 0) no : (1 2 2) no +: (1 1 0) yes 0 1 2 : (1 1 0) no : (1 2 2) no +: (1 1 0) no

  1. What class does this linear classifier predict for the new point: (2.0, 1.01) The margin of the point is 0.01, so it would be classified as negative.
  2. Imagine we apply the perceptron learning algorithm to the 5 point data set we used on Problem 1: Negative: (1, 0) (2, 1) (2, 2), Positive: (0, 0) (1, 0). Describe qualitatively what the result would be. The perceptron algorithm would not converge since the 5 point data set is not linearly separable.

  1. What would be the change in w 2 as determined by backpropagation using a step size (η) of 1.0? Assume that the input is x = (2, −2) and the initial weights are as specified above. Show the formula you are using as well as the numerical result.

(a) Δw 2 =

Solution:

∂E

Δw 2 = −η ∂w 2

∂E ∂y ∂z = −η ∂y ∂z ∂w 2

= −η(y − y i)y(1 − y)x 2

= (−1)(0.5 + 0)(0.5)(0.5)(−2)

= 0. 25

Derivations: 1 E = (y − y i)^2 2 y = s(z) 2 z = wixi i= ∂E (^) i = ∂y

y − y

∂y = y(1 − y) ∂z ∂z = x 2 ∂w 2

5 Naive Bayes (8 pts)

Consider a Naive Bayes problem with three features, x 1... x 3. Imagine that we have seen a total of 12 training examples, 6 positive (with y = 1) and 6 negative (with y = 0). Here is a table with some of the counts:

y = 0 y = 1 x 1 = 1 6 6 x 2 = 1 0 0 x 3 = 1 2 4

  1. Supply the following estimated probabilities. Use the Laplacian correction.
    • Pr(x^7 1 = 1|y^ = 0) =^

6+2 8

  • Pr(x 2 = 1|y = 1) = 0+1^ =^1 6+2 8
  • Pr(x 3 = 0 y = 0) = 1 − 2+ 6+2 =^

5 | 8

  1. Which feature plays the largest role in deciding the class of a new instance? Why? x 3 , because it has the biggest difference in the likelihood of being true for the two different classes. The other two features carry no information about the class.

7 Error versus complexity (15 pts)

Most learning algorithms we have seen try to find a hypotheses that minimizes error. But how do they attempt to control complexity? Here are some possible approaches:

A: Use a fixedcomplexity hypothesis class

B: Include a complexity penalty in the measure of error

C: Nothing

For each of the following algorithms, specify which approach it uses and say what hy pothesis class it uses (including any restrictions) and what complexity criterion (if any) is included in the measure of error. If the algorithm attempts to optimize the error measure, say whether it is guaranteed to find an optimal solution or just an approximation.

  1. perceptron A. It uses a fixed hypothesis class of linear separators. It is guaranteed to find a separator if one exists.
  2. linear SVM B. It includes a complexity penalty in the error criterion (which is to max imize the margin while separating the data). It optimizes this criterion.
  3. decision tree with fixed depth A. It uses a fixed hypothesis class, which is the class of fixeddepth trees. Implicitly, it tries to find the lowesterror tree within this class, but isn’t guaranteed to optimize that criterion.
  4. neural network (no weight decay or early stopping) A. It uses a fixed hypothesis class, which is determined by the wiring dia gram of the network.
  5. SVM (with arbitrary data and c < ∞) B. It includes a complexity penalty in the error criterion (which is to max imize the margin subject to assigning an α < c to each data point.

8 Regression (12 pts)

Consider a onedimensional regression problem (predict y as a function of x). For each of the algorithms below, draw the approximate shape of the output of the algorithm, given the data points shown in the graph.

x

y

  1. 2nearestneighbor (equally weighted averaging)
  2. regression trees (with leaf size 1)

x

y

9 SVM

x 1 1 2

1

2

-2 -

**-

  • +**

x 2

Data points are: Negative: (1, 0) (2, 2) Positive: (1, 0)

Recall that for SVMs, the negative class is represented by a desired output of 1 and the positive class by a desired output of 1.

  1. For each of the following separators (for the data shown above), indicate whether they satisfy all the conditions required for a support vector machine, assuming a linear kernel. Justify your answers very briefly.

(a) x 1 + x 2 = 0 Goes through the (2,2) point so obviously not maximal margin. (b) x 1 + 1. 5 x 2 = 0 Yes. All three points are support vectors, with margin = 1. (c) x 1 + 2x 2 = 0 No. Three points are needed to define a line, with two support vectors there is no unique maximal margin line. (d) 2 x 1 + 3x 2 = 0 No. The margin for the points is 2, not 1

  1. For each of the kernel choices below, find the decision boundary diagram (on the next page) that best matches. In these diagrams, the brightness of a point represents the magnitude of the SVM output; red means positive output and blue means negative. The black circles are the negative training points and the white circles are the positive training points.

(a) Polynomial kernel, degree 2 : D (b) Polynomial kernel, degree 3 : B (c) Radial basis kernel, sigma = 0.5 : A (d) Radial basis kernel, sigma = 1.0 : C