Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Learning as Search-Artificial Intelligence-lecture Handout, Exercises of Artificial Intelligence

Artificial Intelligence is modern era course. This is lecture handout. It was provided by Prof. Parindra Valli at Ankit Institute of Technology and Science. It includes: Hypothesis, Specific, Candidate, Elimination, Algorithm, Decision, Entropy, Information, Biological, computers

Typology: Exercises

2011/2012

Uploaded on 08/07/2012

ambu
ambu 🇮🇳

4

(12)

111 documents

1 / 16

Related documents


Partial preview of the text

Download Learning as Search-Artificial Intelligence-lecture Handout and more Exercises Artificial Intelligence in PDF only on Docsity! Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan 2.2 Concept learning as search Now that we are well familiar with most of the terminologies of machine learning, we can define the learning process in technical terms as: “We have to assume that the concept lies in the hypothesis space. So we search for a hypothesis belonging to this hypothesis space that best fits the training examples, such that the output given by the hypothesis is same as the true output of concept.” In short:- Assume C∈H, search for an h∈H that best fits D Such that ∀ xi∈D, h(xi) = C(xi). The stress here is on the word ‘search’. We need to somehow search through the hypothesis space. 2.2.1 General to specific ordering of hypothesis space Many algorithms for concept learning organize the search through the hypothesis space by relying on a very useful structure that exists for any concept learning problem: a general-to-specific ordering of hypotheses. By taking advantage of this naturally occurring structure over the hypothesis space, we can design learning algorithms that exhaustively search even infinite hypothesis spaces without explicitly enumerating every hypothesis. To illustrate the general-to- specific ordering, consider two hypotheses: h1 = < H, H > h2 = < ?, H > Now consider the sets of instances that are classified positive by h1 and by h2. Because h2 imposes fewer constraints on the instance, it classifies more instances as positive. In fact, any instance classified positive by h1 will also be classified positive by h2. Therefore, we say that h2 is more general than h1. So all the hypothesis in H can be ordered according to their generality, starting from < ?, ? > which is the most general hypothesis since it always classifies all the instances as positive. On the contrary we have < Ø , Ø > which is the most specific hypothesis, since it doesn’t classify a single instance as positive. 2.2.2 FIND-S FIND-S finds the maximally specific hypothesis possible within the version space given a set of training data. How can we use the general to specific ordering of hypothesis space to organize the search for a hypothesis consistent with the observed training examples? One way is to begin with the most specific possible hypothesis in H, then generalize the hypothesis each time it fails to cover an observed positive training example. (We say that a hypothesis “covers” a positive docsity.com Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan example if it correctly classifies the example as positive.) To be more precise about how the partial ordering is used, consider the FIND-S algorithm: To illustrate this algorithm, let us assume that the learner is given the sequence of following training examples from the SICK domain: D T BP SK x1 H H 1 x2 L L 0 x3 N H 1 The first step of FIND-S is to initialize h to the most specific hypothesis in H: h = < Ø , Ø > Upon observing the first training example (< H, H >, 1), which happens to be a positive example, it becomes obvious that our hypothesis is too specific. In particular, none of the “Ø” constraints in h are satisfied by this training example, so each Ø is replaced by the next more general constraint that fits this particular example; namely, the attribute values for this very training example: h = < H , H > This is our h after we have seen the first example, but this h is still very specific. It asserts that all instances are negative except for the single positive training example we have observed. Upon encountering the second example; in this case a negative example, the algorithm makes no change to h. In fact, the FIND-S algorithm simply ignores every negative example. While this may at first seem strange, notice that in the current case our hypothesis h is already consistent with the new negative example (i.e. h correctly classifies this example as negative), and hence no revision is needed. In the general case, as long as we assume that the hypothesis space H contains a hypothesis that describes the true target concept c and that the training data contains no errors and conflicts, then the current hypothesis h can never require a revision in response to a negative example. Initialize h to the most specific hypothesis in H For each positive training instance x For each attribute constraint ai in h If the constraint ai is satisfied by x Then do nothing Else Replace ai in h by the next more general constraint that is satisfied by x Output hypothesis h docsity.com Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan Since S0 has only one hypothesis that is < Ø, Ø >, which implies S0(x1) = 0, which is not consistent with d1, so we have to remove < Ø, Ø > from S1. Also, we add minimally general hypotheses from H to S1, such that those hypotheses are consistent with d1. The obvious choices are like <H,H>, <H,N>, <H,L>, <N,H>……… <L,N>, <L,L>, but none of these except <H,H> is consistent with d1. So S1 becomes: S1 = {< H, H >} G1 = {< ?, ? >} Second training example is: d2 = (<L, L>, 0) [A negative example] S2 = S1 = {< H, H>}, since <H, H> is consistent with d2: both give negative outputs for x2. G1 has only one hypothesis: < ?, ? >, which gives a positive output on x2, and hence is not consistent, since SK(x2) = 0, so we have to remove it and add in its place, the hypotheses which are minimally specialized. While adding we have to take care of two things; we would like to revise the statement of the algorithm for the negative examples: “Add to G all minimal specializations h of g, such that h is consistent with d, and some member of S is more specific than h” The immediate one step specialized hypotheses of < ?, ? > are: {< H, ? >, < N, ? >, < L, ? >, < ?, H >, < ?, N >, < ?, L >} Out of these we have to get rid of the hypotheses which are not consistent with d2 = (<L, L>, 0). We see that all of the above listed hypotheses will give a 0 (negative) output on x2 = < L, L >, except for < L, ? > and < ?, L >, which give a 1 (positive) output on x2, and hence are not consistent with d2, and will not be added to G2. This leaves us with {< H, ? >, < N, ? >, < ?, H >, < ?, N >}. This takes care of the inconsistent hypotheses, but there’s another condition in the algorithm that we must take care of before adding all these hypotheses to G2. We will repeat the statement again, this time highlighting the point under consideration: “Add to G all minimal specializations h of g, such that h is consistent with d, and some member of S is more specific than h” This is very important condition, which is often ignored, and which results in the wrong final version space. We know the current S we have is S2, which is: S2 = {< H, H>}. Now for which hypotheses do you think < H, H > is more specific to, docsity.com Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan out of {< H, ? >, < N, ? >, < ?, H >, < ?, N >}. Certainly < H, H > is more specific than < H, ? > and < ?, H >, so we remove < N, ? > and < ?, N >to get the final G2: G2 = {< H, ? >, < ?, H >} S2 = {< H, H>} Third and final training example is: d3 = (<N, H>, 1) [A positive example] We see that in G2, < H, ? > is not consistent with d3, so we remove it: G3 = {< ?, H >} We also see that in S2, < H, H > is not consistent with d3, so we remove it and add minimally general hypotheses than < H, H >. The two choices we have are: < H, ? > and < ?, H >. We only keep < ?, H >, since the other one is not consistent with d3. So our final version space is encompassed by S3 and G3: G3 = {< ?, H >} S3 = {< ?, H >} It is only a coincidence that both G and S sets are the same. In bigger problems, or even here if we had more examples, there was a chance that we’d get different but consistent sets. These two sets of G and S outline the version space of a concept. Note that the final hypothesis is the same one that was computed by FIND-S. 2.3 Decision trees learning Up untill now we have been searching in conjunctive spaces which are formed by ANDing the attributes, for instance: IF Temperature = High AND Blood Pressure = High THEN Person = SICK But this is a very restrictive search, as we saw the reduction in hypothesis space from 29 total possible concepts to 17. This can be risky if we’re not sure if the true concept will lie in the conjunctive space. So a safer approach is to relax the searching constraints. One way is to involve OR into the search. Do you think we’ll have a bigger search space if we employ OR? Yes, most certainly; consider, for example, the statement: IF Temperature = High OR Blood Pressure = High THEN Person = SICK If we could use these kind of OR statements, we’d have a better chance of finding the true concept, if the concept does not lie in the conjunctive space. These are also called disjunctive spaces. 2.3.1 Decision tree representation Decision trees give us disjunctions of conjunctions, that is, they have the form: (A AND B) OR (C AND D) docsity.com Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan In tree representation, this would translate into: where A, B, C and D are the attributes for the problem. This tree gives a positive output if either A AND B attributes are present in the instance; OR C AND D attributes are present. Through decision trees, this is how we reach the final hypothesis. This is a hypothetical tree. In real problems, every tree has to have a root node. There are various algorithms like ID3 and C4.5 to find decision trees for learning problems. 2.3.2 ID3 ID stands for interactive dichotomizer. This was the 3rd revision of the algorithm which got wide acclaims. The first step of ID3 is to find the root node. It uses a special function GAIN, to evaluate the gain information of each attribute. For example if there are 3 instances, it will calculate the gain information for each. Whichever attribute has the maximum gain information, becomes the root node. The rest of the attributes then fight for the next slots. 2.3.2.1 Entropy In order to define information gain precisely, we begin by defining a measure commonly used in statistics and information theory, called entropy, which characterizes the purity/impurity of an arbitrary collection of examples. Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this Boolean classification is: Entropy(S) = - p+log2 p+ - p-log2 p- where p+ is the proportion of positive examples in S and p- is the proportion of negative examples in S. In all calculations involving entropy we define 0log 0 to be 0. To illustrate, suppose S is a collection of 14 examples of some Boolean concept, including 9 positive and 5 negative examples, then the entropy of S relative to this Boolean classification is: Entropy(S) = - (9/14)log2 (9/14) - (5/14)log2 (5/14) A B C D docsity.com Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan )( || || )( || || )(),( 2 2 1 1 SeE S Se SeE S Se SEESG −−= = 0.02 This tells us that information gain for A is the highest. So we will simply choose A as the root of our decision tree. By doing that we’ll check if there are any conflicting leaf nodes in the tree. We’ll get a better picture in the pictorial representation shown below: This is a tree of height one, and we have built this tree after only one iteration. This tree correctly classifies 3 out of 5 training samples, based on only one attribute A, which gave the maximum information gain. It will classify every forthcoming sample that has a value of a1 in attribute A as YES, and each sample having a3 as NO. The correctly classified samples are highlighted below: S A B E C d1 a1 b1 e2 YES d2 a2 b2 e1 YES d3 a3 b2 e1 NO d4 a2 b2 e1 NO d5 a3 b1 e2 NO Note that a2 was not a good determinant for classifying the output C, because it gives both YES and NO for d2 and d4 respectively. This means that now we have to look at other attributes B and E to resolve this conflict. To build the tree further we will ignore the samples already covered by the tree above. Our new sample space will be given by S’ as given in the table below: S’ A B E C d2 a2 B2 e1 YES d4 a2 B2 e2 NO S’ = [d2, d4] YES NO a1 a2 a3 A docsity.com Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan We’ll apply the same process as above again. First we calculate the entropy for this sub sample space S’: E(S’) = - p+log2 p+ - p-log2 p- = 2 1 log 2 1 2 1 log 2 1 22 −− = 1 This gives us entropy of 1, which is the maximum value for entropy. This is also obvious from the data, since half of the samples are positive (YES) and half are negative (NO). Since our tree already has a node for A, ID3 assumes that the tree will not have the attribute repeated again, which is true since A has already divided the data as much as it can, it doesn’t make any sense to repeat A in the intermediate nodes. Give this a thought yourself too. Meanwhile, we will calculate the gain information of B and E with respect to this new sample space S’: |S’| = 2 |S’b2| = 2 )'( |'| |'| )'(),'( 2 2 bSE S bS SEBSG −= ) 2 1 log 2 1 2 1 log 2 1 ( 2 2 1),'( 22 −−−=BSG = 1 - 1 = 0 Similarly for E: |S’| = 2 |S’e1| = 1 [since there is only one observation of e1 which outputs a YES] E(S’e1) = -1log21 - 0log20 = 0 [since log 1 = 0] |S’e2| = 1 [since there is only one observation of e2 which outputs a NO] E(S’e2) = -0log20 - 1log21 = 0 [since log 1 = 0] Hence: )'( |'| |'| )'( |'| |'| )'(),'( 2 2 1 1 eSE S eS eSE S eS SEESG −−= )0( 2 1 )0( 2 1 1),'( −−=ESG = 1 - 0 - 0 = 1 docsity.com Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan Therefore E gives us a maximum information gain, which is also true intuitively since by looking at the table for S’, we can see that B has only one value b2, which doesn’t help us decide anything, since it gives both, a YES and a NO. Whereas, E has two values, e1 and e2; e1 gives a YES and e2 gives a NO. So we put the node E in the tree which we are already building. The pictorial representation is shown below: Now we will stop further iterations since there are no conflicting leaves that we need to expand. This is our hypothesis h that satisfies each training example. 3 LEARNING: Connectionist Although ID3 spanned more of the concept space, but still there is a possibility that the true concept is not simply a mixture of disjunctions of conjunctions, but some more complex arrangement of attributes. (Artificial Neural Networks) ANNs can compute more complicated functions ranging from linear to any higher order quadratic, especially for non-Boolean concepts. This new learning paradigm takes its roots from biology inspired approach to learning. Its primarily a network of parallel distributed computing in which the focus of algorithms is on training rather than explicit programming. Tasks for which connectionist approach is well suited include: • Classification • Fruits – Apple or orange • Pattern Recognition • Finger print, Face recognition • Prediction • Stock market analysis, weather forecast 3.1 Biological aspects and structure of a neuron The brain is a collection of about 100 billion interconnected neurons. Each neuron is a cell that uses biochemical reactions to receive, process and transmit information. A neuron's dendritic tree is connected to a thousand neighboring YES NO a1 a2 a3 YES NO e1 e2 A E docsity.com He et 100s Winco oe Elo Est Toes lahcon ook docsity.com Artificial Intelligence (CS607) © Copyright Virtual University of Pakistan 3.2.2 Response of changing weight The change in weight results in the rotation of the decision line. Hence this up and down shift, together with the rotation of the straight line can achieve any linear decision. docsity.com
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved