KNN, ID Trees, and Neural Nets-Artificial Intelligence-Tutorial Handout, Exercises of Artificial Intelligence

Madam Amrita Ahuja distributed this handout in class of Artificial Intelligence course at Central University of Jammu and Kashmir. This handout explains important concepts including: Knn, Id, Trees, Neural, Nets, Decision, Boundaries, Distance, Euclidean, Distance, Functions, Dimensions

Typology: Exercises

2011/2012

Uploaded on 07/31/2012

shaina_44kin
shaina_44kin 🇮🇳

3.9

(9)

64 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
KNN-ID and Neural Nets
KNN, ID Trees, and Neural Nets
Intro to Learning Algorithms
KNN, Decision trees, Neural Nets are all supervised learning algorithms
Their general goal = make accurate predictions about unknown data after being trained on known
data.
Data comes in form of examples with the general form:
x1, .. xn are also known as features, inputs or dimensions y is the output or class label.
Both xi and ys can be discrete (taking on specific values) {0, 1}
or continuous (taking on a range of values) [0, 1]
In training we are given (x1, ... xn, y) tuples. In testing (classification), we are given only (x1,...
xn) and the goal is to predict y with high accuracy.
Training error is the classification error measured using training data to test.
Testing error is classification error on data not seen in the training phase.
K Nearest Neighbors
1-NN
Given an unknown point, pick the closest 1 neighbor by some distance measure.
Class of unknown is the 1-nearest neighbor's label.
k-NN
Given an unknown, pick the k closest neighbors by some distance function.
Class of unknown is the mode of the k-nearest neighbor's labels.
k is usually an odd number to facilitate tie breaking.
How to draw 1-NN decision boundaries
Decision boundaries, lines on which it is equally likely to be in any of the classes.
1. Examine the region where you think decision boundaries should occur.
2. Find oppositely labeled points (+/-)
3. Draw bisectors. (use pencil)
4. Extend and join all bisectors. Erase extraneously extended lines.
5. Remember to draw boundaries to the edge of the graph and indicate it with arrows! (a very
common mistake).
6. Your 1-NN boundaries generally should have sharp edges and corners (otherwise, you are doing
something wrong or drawing boundaries for a higher k-nn.)
Distance Functions
How to determine what points are "nearest". Here are some standard Distance functions:
Euclidean Distance
Manhattan Distance (Block distance)
- Sum of distances in each dimension
Hamming Distance
- Sum of differences in each dimension
I(x,y) = 0 if identical, 1 if different.
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download KNN, ID Trees, and Neural Nets-Artificial Intelligence-Tutorial Handout and more Exercises Artificial Intelligence in PDF only on Docsity!

KNN, ID Trees, and Neural Nets

Intro to Learning Algorithms

KNN, Decision trees, Neural Nets are all supervised learning algorithms Their general goal = make accurate predictions about unknown data after being trained on known data.

Data comes in form of examples with the general form:

x 1 , .. x (^) n are also known as features, inputs or dimensions y is the output or class label.

Both xi and ys can be discrete (taking on specific values) {0, 1}

or continuous (taking on a range of values) [0, 1]

In training we are given (x 1 , ... x (^) n , y) tuples. In testing (classification), we are given only (x 1 ,...

xn ) and the goal is to predict y with high accuracy.

Training error is the classification error measured using training data to test. Testing error is classification error on data not seen in the training phase.

K Nearest Neighbors

1-NN

● Given an unknown point, pick the closest 1 neighbor by some distance measure. ● Class of unknown is the 1-nearest neighbor's label. k-NN ● Given an unknown, pick the k closest neighbors by some distance function. ● Class of unknown is the mode of the k-nearest neighbor's labels. ● k is usually an odd number to facilitate tie breaking.

How to draw 1-NN decision boundaries

Decision boundaries, lines on which it is equally likely to be in any of the classes.

  1. Examine the region where you think decision boundaries should occur.
  2. Find oppositely labeled points (+/-)
  3. Draw bisectors. (use pencil)
  4. Extend and join all bisectors. Erase extraneously extended lines.
  5. Remember to draw boundaries to the edge of the graph and indicate it with arrows! (a very common mistake).
  6. Your 1-NN boundaries generally should have sharp edges and corners (otherwise, you are doing something wrong or drawing boundaries for a higher k-nn.)

Distance Functions How to determine what points are "nearest". Here are some standard Distance functions:

Euclidean Distance

Manhattan Distance (Block distance)

  • Sum of distances in each dimension

Hamming Distance

  • Sum of differences in each dimension I(x,y) = 0 if identical, 1 if different.

Cosine Similarity

  • Used in Text classification; words are dimensions; documents are vectors of words; vector component is 1 if word i exist.

(Optional) How to Weigh Dimensions Differently

In Euclidean distance all dimensions are treated the same. But in practice not dimensions are equally important or useful!

For example. Suppose we represent documents as vectors of words. Consider the task of classifying documents related to "Red Sox". If all words are equal, then the word "the" weighs the same as the word "Sox". But almost every english document has the word "the". But only sports related documents have the word "Sox". So we want k-nn distance metrics to weight meaningful words like sox more than functional words like "the".

For text classification, a weight scheme used to make some dimensions (words) more important than others is known as: TF-IDF

Here: tf: Words that occur frequently should be weighed more. idf: Words that occur in all the documents (functional-words like the, of etc) should be weighed less. Using this weighing scheme with a distance metric, knn would produce better (more relevant) classifications.

Another way to vary the importance of different dimensions is to use: Mahalanobis Distance

Here S is a covariance matrix. Dimensions that show more variance are weighted more.

H

H

  • /3 to /9 /10 to /
    • 1 3 0.33 0. numerator denominator fraction (fraction)
      • 1 10 0.10 0. numerator denominator fraction (fraction)
    • 2 3 0.67 0.92 2 10 0.20 0.
    • 1 4 0.25 0.81 3 10 0.30 0.
    • 2 4 0.50 1.00 4 10 0.40 0.
    • 1 5 0.20 0.72 1 11 0.09 0.
    • 2 5 0.40 0.97 2 11 0.18 0.
    • 3 5 0.60 0.97 3 11 0.27 0.
    • 1 6 0.17 0.65 4 11 0.36 0.
    • 2 6 0.33 0.92 5 11 0.45 0.
    • 3 6 0.50 1.00 1 12 0.08 0.
    • 1 7 0.14 0.59 2 12 0.17 0.
    • 2 7 0.29 0.86 3 12 0.25 0.
    • 3 7 0.43 0.99 5 12 0.42 0.
    • 1 8 0.13 0.54 1 13 0.08 0.
    • 2 8 0.25 0.81 2 13 0.15 0.
    • 3 8 0.38 0.95 3 13 0.23 0.
    • 4 8 0.50 1.00 4 13 0.31 0.
    • 1 9 0.11 0.50 5 13 0.38 0.
    • 2 9 0.22 0.76 6 13 0.46 1.
    • 3 9 0.33 0.
    • 4 9 0.44 0.

Neural Networks:

First, Read Professor Winston's Notes on NNs!

General Back Propagation Algorithm:

function train (examples)

  1. Initialize weights
  2. While true:
    1. foreach (inputs, outputs) = example in examples
      1. Run backward-propagation(inputs, outputs)
    2. If termination conditions met then quit

Possible Termination conditions

  1. When the error rate is 0
  2. When performance function P over the output and desired output is within an epsilon of 0.
  3. When the weight change from the last iteration is within an epsilon of 0.

function backward-propagation (inputs, outputs)

  1. Fix inputs (x_i .. x_n) to input values, fix desired d to outputs
  2. Forward Propagation step (Working forwards from input to outputs) m for n from layer 1 to L ■ compute ■ i = inputs to the weights [ xi when at the first layer, oi (output of previous layer) at other layers. ]

■ compute

  1. Backward Propagation step m for l from layer L to 1 (Working backwards from output layer to input layers) ■ for each neuron n in layer l

compute

for the standard sigmoid unit, and performance function If n is in the Lth^ (last) layer:

Else:

Note w (^) nj are links that come out of n into j nodes

compute compute new weights

function test (inputs) => outputs Run the forward propagation step, return o (^) n in the last layer as outputs.

A B T W^ A W^ B W^ T z o d d-o

forward 0 0 -1 0 0 1 a) -1 b) 0.27 1 (1-0.27) =0.

backward

forward 0 1 -

c) 0

d) 0

e) 0.

0.86 f) -0.86 g) 0.30 1

backward

forward 1 0 -

h) 0

i) 0.

j) 0.

0.71 k) -0.71 l) 0.33 0

Simulating The Steps of Back Propagation

This is a detailed Step-by-step answer to the first 2 steps of the Fall 2009 Quiz 3 Neural Nets part B.

The given network has the following architecture:

Your task is to fill in the following table (non-shaded). Detailed calculations a) - l) for the fill-in boxes are included below. In forward steps, your goal is to compute z, o, d-o In backward steps, your goal is to compute the weight updates δ and ΔWs and find the new Ws.

a) z = AW (^) A + BWB + TWT = 00+00+-11 = -

b) o = sigmoid(z) = sigmoid(-1) = 1/(1+e^-(-1)) = 0. c) δ = (d - o)(o(1-o)) # Because it's the last (and only layer). = (1 - 0.27) * (0.27*(1-0.27)) = 0. ΔW (^) A = alpha * δ * A = 1 * 0.14 * 0 = 0 WA = 0 + ΔWA = 0 + 0 = 0

d) ΔWB = alpha * δ * B

W (^) B = 0 + ΔWB = 0 + 0 = 0

e) ΔW (^) T = alpha * δ * T = 1 * 0.14 * -1 = -0.

W (^) T = 1 + ΔWT = 1 - 0.14 = 0.

f) z = AW (^) A + BWB + TWT = 00+10+-10.86 = -0.

g) o = sigmoid(-0.86) = 0. h) δ = (d - o)(o(1-o)) = (1 - 0.3) * (0.3*(1-0.3)) = 0. ΔW (^) A = alpha * δ * A = 1 * 0.15 * 0 = 0 WA = 0 + ΔWA = 0

i) ΔWB = 1 * 0.15 * 1 = 0.

W (^) B = 0 + ΔWB = 0.

j) ΔW (^) T = alpha * δ * T = 1 * 0.15 * -1 = -0.

W (^) T = 0.86 + ΔWT = 0.86 + -0.15 = 0.

k) z = AW (^) A + BWB + TWT = 10+00.15+-10.71 = -0.

l) o = sigmoid(-0.71) = 0.

Deriving the Back-Propagation Weight Update Equations

The general equation for weight updates in back propagation: (For a weight with input i and going into a node n.)

is the learning rate (also denoted by r). Lower the learning rate, longer it takes to converge. But if the learning rate is too high we may never find the maximum (we keep oscillating.!)

To find the partial, we employ the chain rule:

Here:

Where i is an input into neuron N.

Hence the third term is always the input.

is used as the short hand for the first 2 partials.

For the very last layer:

For the standard Performance function (for Gradient Ascent - Maximization)

Partial on the Sigmoid function:

Hence for the last layer:

For when layer is not the last layer. Suppose m is the layer above n.

How to solve Network Weights - An XOR example:

Solve for the 9 unknown weights for the network that will perfectly classify the concept in the picture.

Step 1. First, think of input-level units (units A, and B) as defining regions (that divide +s from -s) in the X, Y graph. These regions should be depicted as linear boundary lines with arrows pointing towards the +ve data points. Next, think of hidden level neural units (unit C) as some logical operator (a linearly separable operator) that combines those regions defined by the input level units.

So in this case: units A, and B represent the diagonal boundaries (with arrows) on the graph (definition two distinct ways of separating the space). Unit C represents a logical AND that intersects the two regions to create the bounded region in the middle.

Step 2. Write the line equations for the regions you defined in the graph. A) The boundary equation for the region define by line A

Y < -1 x + 3/

B) The boundary equation for the region defined by line B

Y > -1 x + 1/

Step 3. Rewrite the line equations into the form: aX + bY > c Where a, b, and c are integers. A) Y < -1 x + 3/ X + Y < 3/ -2X + -2Y > 3

B) Y > -1 x + 1/ X + Y > 1/ 2X + 2Y > 1

Step 4. Note down the sum-of-weights-and-inputs (z) for each neural unit can also be written in this form. For Unit A: z = W (^) XA X + WYA Y + WA(-1) > 0 W (^) XA X + WYA Y > WA

For Unit B: z = W (^) XB X + WYB Y + WB (-1) > 0 WXB X + WYB Y > WB

A B desired output Equations Simplified

0 0 0 - W^ C < 0^ WC > 0

0 1 0 WBC - WC < 0^ W^ BC < WC

1 0 0 WAC - WC < 0^ WAC < WC

W AC + WBC - WC

WAC + WBC > WC

Why WXA X + WYA Y + WA(-1) > 0 vs. < 0? When z = WXA X + WYA Y + WA(-1) > 0 sigmoid(z>0) approaches 1 +ve points When z = WXA X + WYA Y + WA(-1) < 0 sigmoid(z<0) approaches 0 -ve points

The when expressed as > 0 the region is towards the +ve points When expressed as < 0 the region defined to is pointing towards -ve points.

Step 5. Easy! Just read off the weights by correspondence. (Note: In the 2006 Quiz, the L and Pinapple problem wants you to match the correct equation by constraining the value of some weights.)

-2 X + -2 Y > 3 line A's inequality WXA X + WYA Y > WA z equation for unit A.

WXA = -2 WYA = -2 WA = 3

2 X + 2 Y > 1 line B's inequality W (^) XB X + WYB Y > WB (^) z equation for unit B

WXB = 2 WYB = 2 WB = 1

Step 6. Solve the logic in the second Layer We want to compute (A AND B) So build a Truth table! and solve for the constraints!

We notice a symmetry in W (^) BC and WAC, so we make a guess that they have the same value.

W (^) BC = 2 and WAC = 2

Then equalities in the table above condense down to:

W (^) C > 0 WC > 2 (twice) WC < 2+2 = 4

So 2 < Wc < 4 Then WC = 3 will work. An acceptable solution:

WBC = 2 WAC = 2 WC = 3

The following solution also works, because it also obeys the stated constraints.

WBC = 109 WAC = 109 WC = 110

But quizzes will ask for smallest integer solutions.

Try different models. Pick the model that gives you the lowest CV error.

Examples:

KNN - vary k from 1 to N-1. Run cross-validation under different ks. Choose the k with the lowest CV error.

Decision Tree - Try trees of varying depth, and varying number of tests. Run cross-validation to pick the tree of lowest CV Error.

Neural Net - Try different Neural Net architectures. Run CV to pick the architecture of lowest CV Error.

Models chosen using cross-validation can generalize better, and are less likely to overfit or underfit.