







Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Madam Amrita Ahuja distributed this handout in class of Artificial Intelligence course at Central University of Jammu and Kashmir. This handout explains important concepts including: Knn, Id, Trees, Neural, Nets, Decision, Boundaries, Distance, Euclidean, Distance, Functions, Dimensions
Typology: Exercises
1 / 13
This page cannot be seen from the preview
Don't miss anything!








KNN, Decision trees, Neural Nets are all supervised learning algorithms Their general goal = make accurate predictions about unknown data after being trained on known data.
Data comes in form of examples with the general form:
x 1 , .. x (^) n are also known as features, inputs or dimensions y is the output or class label.
Both xi and ys can be discrete (taking on specific values) {0, 1}
or continuous (taking on a range of values) [0, 1]
In training we are given (x 1 , ... x (^) n , y) tuples. In testing (classification), we are given only (x 1 ,...
xn ) and the goal is to predict y with high accuracy.
Training error is the classification error measured using training data to test. Testing error is classification error on data not seen in the training phase.
● Given an unknown point, pick the closest 1 neighbor by some distance measure. ● Class of unknown is the 1-nearest neighbor's label. k-NN ● Given an unknown, pick the k closest neighbors by some distance function. ● Class of unknown is the mode of the k-nearest neighbor's labels. ● k is usually an odd number to facilitate tie breaking.
Decision boundaries, lines on which it is equally likely to be in any of the classes.
Distance Functions How to determine what points are "nearest". Here are some standard Distance functions:
Euclidean Distance
Manhattan Distance (Block distance)
Hamming Distance
Cosine Similarity
In Euclidean distance all dimensions are treated the same. But in practice not dimensions are equally important or useful!
For example. Suppose we represent documents as vectors of words. Consider the task of classifying documents related to "Red Sox". If all words are equal, then the word "the" weighs the same as the word "Sox". But almost every english document has the word "the". But only sports related documents have the word "Sox". So we want k-nn distance metrics to weight meaningful words like sox more than functional words like "the".
For text classification, a weight scheme used to make some dimensions (words) more important than others is known as: TF-IDF
Here: tf: Words that occur frequently should be weighed more. idf: Words that occur in all the documents (functional-words like the, of etc) should be weighed less. Using this weighing scheme with a distance metric, knn would produce better (more relevant) classifications.
Another way to vary the importance of different dimensions is to use: Mahalanobis Distance
Here S is a covariance matrix. Dimensions that show more variance are weighted more.
First, Read Professor Winston's Notes on NNs!
General Back Propagation Algorithm:
function train (examples)
Possible Termination conditions
function backward-propagation (inputs, outputs)
■ compute
compute
for the standard sigmoid unit, and performance function If n is in the Lth^ (last) layer:
Else:
Note w (^) nj are links that come out of n into j nodes
compute compute new weights
function test (inputs) => outputs Run the forward propagation step, return o (^) n in the last layer as outputs.
A B T W^ A W^ B W^ T z o d d-o
forward 0 0 -1 0 0 1 a) -1 b) 0.27 1 (1-0.27) =0.
backward
forward 0 1 -
c) 0
d) 0
e) 0.
0.86 f) -0.86 g) 0.30 1
backward
forward 1 0 -
h) 0
i) 0.
j) 0.
0.71 k) -0.71 l) 0.33 0
This is a detailed Step-by-step answer to the first 2 steps of the Fall 2009 Quiz 3 Neural Nets part B.
The given network has the following architecture:
Your task is to fill in the following table (non-shaded). Detailed calculations a) - l) for the fill-in boxes are included below. In forward steps, your goal is to compute z, o, d-o In backward steps, your goal is to compute the weight updates δ and ΔWs and find the new Ws.
a) z = AW (^) A + BWB + TWT = 00+00+-11 = -
b) o = sigmoid(z) = sigmoid(-1) = 1/(1+e^-(-1)) = 0. c) δ = (d - o)(o(1-o)) # Because it's the last (and only layer). = (1 - 0.27) * (0.27*(1-0.27)) = 0. ΔW (^) A = alpha * δ * A = 1 * 0.14 * 0 = 0 WA = 0 + ΔWA = 0 + 0 = 0
d) ΔWB = alpha * δ * B
W (^) B = 0 + ΔWB = 0 + 0 = 0
e) ΔW (^) T = alpha * δ * T = 1 * 0.14 * -1 = -0.
W (^) T = 1 + ΔWT = 1 - 0.14 = 0.
f) z = AW (^) A + BWB + TWT = 00+10+-10.86 = -0.
g) o = sigmoid(-0.86) = 0. h) δ = (d - o)(o(1-o)) = (1 - 0.3) * (0.3*(1-0.3)) = 0. ΔW (^) A = alpha * δ * A = 1 * 0.15 * 0 = 0 WA = 0 + ΔWA = 0
i) ΔWB = 1 * 0.15 * 1 = 0.
W (^) B = 0 + ΔWB = 0.
j) ΔW (^) T = alpha * δ * T = 1 * 0.15 * -1 = -0.
W (^) T = 0.86 + ΔWT = 0.86 + -0.15 = 0.
k) z = AW (^) A + BWB + TWT = 10+00.15+-10.71 = -0.
l) o = sigmoid(-0.71) = 0.
The general equation for weight updates in back propagation: (For a weight with input i and going into a node n.)
is the learning rate (also denoted by r). Lower the learning rate, longer it takes to converge. But if the learning rate is too high we may never find the maximum (we keep oscillating.!)
To find the partial, we employ the chain rule:
Here:
Where i is an input into neuron N.
Hence the third term is always the input.
is used as the short hand for the first 2 partials.
For the very last layer:
For the standard Performance function (for Gradient Ascent - Maximization)
Partial on the Sigmoid function:
Hence for the last layer:
For when layer is not the last layer. Suppose m is the layer above n.
Solve for the 9 unknown weights for the network that will perfectly classify the concept in the picture.
Step 1. First, think of input-level units (units A, and B) as defining regions (that divide +s from -s) in the X, Y graph. These regions should be depicted as linear boundary lines with arrows pointing towards the +ve data points. Next, think of hidden level neural units (unit C) as some logical operator (a linearly separable operator) that combines those regions defined by the input level units.
So in this case: units A, and B represent the diagonal boundaries (with arrows) on the graph (definition two distinct ways of separating the space). Unit C represents a logical AND that intersects the two regions to create the bounded region in the middle.
Step 2. Write the line equations for the regions you defined in the graph. A) The boundary equation for the region define by line A
Y < -1 x + 3/
B) The boundary equation for the region defined by line B
Y > -1 x + 1/
Step 3. Rewrite the line equations into the form: aX + bY > c Where a, b, and c are integers. A) Y < -1 x + 3/ X + Y < 3/ -2X + -2Y > 3
B) Y > -1 x + 1/ X + Y > 1/ 2X + 2Y > 1
Step 4. Note down the sum-of-weights-and-inputs (z) for each neural unit can also be written in this form. For Unit A: z = W (^) XA X + WYA Y + WA(-1) > 0 W (^) XA X + WYA Y > WA
For Unit B: z = W (^) XB X + WYB Y + WB (-1) > 0 WXB X + WYB Y > WB
A B desired output Equations Simplified
0 0 0 - W^ C < 0^ WC > 0
0 1 0 WBC - WC < 0^ W^ BC < WC
1 0 0 WAC - WC < 0^ WAC < WC
Why WXA X + WYA Y + WA(-1) > 0 vs. < 0? When z = WXA X + WYA Y + WA(-1) > 0 sigmoid(z>0) approaches 1 +ve points When z = WXA X + WYA Y + WA(-1) < 0 sigmoid(z<0) approaches 0 -ve points
The when expressed as > 0 the region is towards the +ve points When expressed as < 0 the region defined to is pointing towards -ve points.
Step 5. Easy! Just read off the weights by correspondence. (Note: In the 2006 Quiz, the L and Pinapple problem wants you to match the correct equation by constraining the value of some weights.)
-2 X + -2 Y > 3 line A's inequality WXA X + WYA Y > WA z equation for unit A.
2 X + 2 Y > 1 line B's inequality W (^) XB X + WYB Y > WB (^) z equation for unit B
Step 6. Solve the logic in the second Layer We want to compute (A AND B) So build a Truth table! and solve for the constraints!
We notice a symmetry in W (^) BC and WAC, so we make a guess that they have the same value.
W (^) BC = 2 and WAC = 2
Then equalities in the table above condense down to:
W (^) C > 0 WC > 2 (twice) WC < 2+2 = 4
So 2 < Wc < 4 Then WC = 3 will work. An acceptable solution:
WBC = 2 WAC = 2 WC = 3
The following solution also works, because it also obeys the stated constraints.
WBC = 109 WAC = 109 WC = 110
But quizzes will ask for smallest integer solutions.
Try different models. Pick the model that gives you the lowest CV error.
Examples:
KNN - vary k from 1 to N-1. Run cross-validation under different ks. Choose the k with the lowest CV error.
Decision Tree - Try trees of varying depth, and varying number of tests. Run cross-validation to pick the tree of lowest CV Error.
Neural Net - Try different Neural Net architectures. Run CV to pick the architecture of lowest CV Error.
Models chosen using cross-validation can generalize better, and are less likely to overfit or underfit.