Version Spaces - Artificial Intelligence - Lecture Slides, Slides of Artificial Intelligence

Some concept of Artificial Intelligence are Agents and Problem Solving, Autonomy, Programs, Classical and Modern Planning, First-Order Logic, Resolution Theorem Proving, Search Strategies, Structure Learning. Main points of this lecture are: Version Spaces, Decision Trees, Machine Learning, Variables, Data Type Definition, Binary, Supervised Learning Problem, Describe the General Concept, Forecast, Sport

Typology: Slides

2012/2013

Uploaded on 04/29/2013

shantii
shantii 🇮🇳

4.4

(14)

97 documents

1 / 24

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Lecture 35 of 41
Machine Learning:
Version Spaces and Decision Trees
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18

Partial preview of the text

Download Version Spaces - Artificial Intelligence - Lecture Slides and more Slides Artificial Intelligence in PDF only on Docsity!

Lecture 35 of 41

Machine Learning:

Version Spaces and Decision Trees

Example:

Learning A Concept ( EnjoySport ) from Data

Example Sky Air

Temp

Humidity Wind Water Forecast Enjoy

Sport

0 Sunny Warm Normal Strong Warm Same Yes

1 Sunny Warm High Strong Warm Same Yes

2 Rainy Cold High Strong Warm Change No

3 Sunny Warm High Strong Cool Change Yes

  • Specification for Training Examples
    • Similar to a data type definition
    • 6 variables ( aka attributes, features): Sky , Temp , Humidity , Wind , Water , Forecast
    • Nominal-valued (symbolic) attributes - enumerative data type

• Binary (Boolean-Valued or H -Valued) Concept

  • Supervised Learning Problem: Describe the General Concept

Typical Concept Learning Tasks

  • Given
    • Instances X: possible days, each described by attributes Sky, AirTemp, Humidity, Wind, Water, Forecast
    • Target function c  EnjoySport: XH  {{Rainy, Sunny}  {Warm, Cold}  {Normal, High}  {None, Mild, Strong}  {Cool, Warm}  {Same, Change}}  {0, 1}
    • Hypotheses H : conjunctions of literals (e.g., )
    • Training examples D : positive and negative examples of the target function
  • Determine
    • Hypothesis hH such that h(x) = c(x) for all xD
    • Such h are consistent with the training data
  • Training Examples
    • Assumption: no missing X values
    • Noise in values of c (contradictory labels)?

x1, c  x 1  ,  , xm,c  xm 

Inductive Learning Hypothesis

  • Fundamental Assumption of Inductive Learning
  • Informal Statement
    • Any hypothesis found to approximate the target function well over a sufficiently large set of training examples will also approximate the target function well over other unobserved examples
    • Definitions deferred: sufficiently large, approximate well, unobserved
  • Formal Statements, Justification, Analysis
    • Statistical (Mitchell, Chapter 5; statistics textbook)
    • Probabilistic (R&N, Chapters 14-15 and 19; Mitchell, Chapter 6)
    • Computational (R&N, Section 18.6; Mitchell, Chapter 7)
  • More on This Topic: Machine Learning and Pattern Recognition (CIS732)
  • Next: How to Find This Hypothesis?

Find-S Algorithm

1. Initialize h to the most specific hypothesis in H

H : the hypothesis space (partially ordered set under relation Less-Specific-Than )

2. For each positive training instance x

For each attribute constraint ai in h IF the constraint ai in h is satisfied by x THEN do nothing ELSE replace ai in h by the next more general constraint that is satisfied by x

3. Output hypothesis h

Hypothesis Space Search

by Find-S

Instances X Hypotheses H

x 1 = <Sunny, Warm, Normal, Strong, Warm, Same>, + x 2 = <Sunny, Warm, High, Strong, Warm, Same>, + x 3 = <Rainy, Cold, High, Strong, Warm, Change>, - x 4 = <Sunny, Warm, High, Strong, Cool, Change>, +

h 1 = <Ø, Ø, Ø, Ø, Ø, Ø> h 2 = <Sunny, Warm, Normal, Strong, Warm, Same> h 3 = <Sunny, Warm, ?, Strong, Warm, Same> h 4 = <Sunny, Warm, ?, Strong, Warm, Same> h 5 = <Sunny, Warm, ?, Strong, ?, ?>

  • Shortcomings of Find-S
    • Can’t tell whether it has learned concept
    • Can’t tell when training data inconsistent
    • Picks a maximally specific h (why?)
    • Depending on H , there might be several!

h 1

h 0

h2,

h 4

x 3

x 1 x 2

x 4

1. Initialization

G  (singleton) set containing most general hypothesis in H , denoted {} S  set of most specific hypotheses in H , denoted {<Ø, … , Ø>}

2. For each training example d

If d is a positive example ( Update-S ) Remove from G any hypotheses inconsistent with d For each hypothesis s in S that is not consistent with d Remove s from S Add to S all minimal generalizations h of s such that

  1. h is consistent with d
  2. Some member of G is more general than h (These are the greatest lower bounds, or meets , sd , in VSH,D ) Remove from S any hypothesis that is more general than another hypothesis in S (remove any dominated elements)

Candidate Elimination Algorithm [1]

Candidate Elimination Algorithm [2]

(continued)

If d is a negative example ( Update-G ) Remove from S any hypotheses inconsistent with d For each hypothesis g in G that is not consistent with d Remove g from G Add to G all minimal specializations h of g such that

  1. h is consistent with d
  2. Some member of S is more specific than h (These are the least upper bounds, or joins , gd , in VSH,D ) Remove from G any hypothesis that is less general than another hypothesis in G (remove any dominating elements)

An Unbiased Learner

  • Example of A Biased H
    • Conjunctive concepts with don’t cares
    • What concepts can H not express? (Hint: what are its syntactic limitations?)
  • Idea
    • Choose H’ that expresses every teachable concept
    • i.e., H’ is the power set of X
    • Recall: | AB | = | B | |^ A^ |^ ( A = X ; B = {labels}; H’ = AB )
    • {{Rainy, Sunny}  {Warm, Cold}  {Normal, High}  {None, Mild, Strong}  {Cool, Warm}  {Same, Change}}  {0, 1}
  • An Exhaustive Hypothesis Language
    • Consider: H’ = disjunctions () , conjunctions (), negations (¬) over previous H
    • | H’ | = 2 (2 • 2 • 2 • 3 • 2 • 2)^ = 2^96 ; | H | = 1 + (3 • 3 • 3 • 4 • 3 • 3) = 973
  • What Are S, G For The Hypothesis Language H’?
    • S  disjunction of all positive examples
    • G  conjunction of all negated negative examples

Decision Trees

  • Classifiers: Instances (Unlabeled Examples)
  • Internal Nodes: Tests for Attribute Values
    • Typical: equality test (e.g., “Wind = ?”)
    • Inequality, other tests possible
  • Branches: Attribute Values
    • One-to- one correspondence (e.g., “Wind = Strong”, “Wind = Light”)
  • Leaves: Assigned Classifications (Class Labels)
  • Representational Power: Propositional Logic ( Why? )

Outlook?

Humidity? Maybe Wind?

Sunny Overcast Rain

No Yes

High Normal

No Maybe

Strong Light

Decision Tree for Concept PlayTennis

[21+, 5-] [8+, 30-]

Decision Tree Learning:

Top-Down Induction ( ID3 )

A 1

True False

[29+, 35-]

[18+, 33-] [11+, 2-]

A 2

True False

[29+, 35-]

  • Algorithm Build-DT ( Examples , Attributes )

IF all examples have the same label THEN RETURN (leaf node with label ) ELSE IF set of attributes is empty THEN RETURN (leaf with majority label ) ELSE Choose best attribute A as root FOR each value v of A Create a branch out of the root for the condition A = v IF { xExamples : x.A = v } = Ø THEN RETURN (leaf with majority label ) ELSE Build-DT ({ xExamples : x.A = v }, Attributes ~ {A})

  • But Which Attribute Is Best?

Choosing the “Best” Root Attribute

  • Objective
    • Construct a decision tree that is a small as possible (Occam’s Razor)
    • Subject to: consistency with labels on training data
  • Obstacles
    • Finding the minimal consistent hypothesis (i.e., decision tree) is NP - hard (D’oh!)
    • Recursive algorithm ( Build-DT )
      • A greedy heuristic search for a simple tree
      • Cannot guarantee optimality (D’oh!)
  • Main Decision: Next Attribute to Condition On
    • Want: attributes that split examples into sets that are relatively pure in one label
    • Result: closer to a leaf node
    • Most popular heuristic
      • Developed by J. R. Quinlan
      • Based on information gain
      • Used in ID3 algorithm

Entropy:

Information Theoretic Definition

  • Components
    • D : a set of examples {< x 1 , c ( x 1 )>, < x 2 , c ( x 2 )>, …, < xm , c ( xm )>}
    • p+ = Pr ( c ( x ) = +), p- = Pr ( c ( x ) = -)
  • Definition
    • H is defined over a probability density function p
    • D contains examples whose frequency of + and - labels indicates p+ and p- for the observed data
    • The entropy of D relative to c is: H ( D )  - p+ log b ( p +) - p- log b ( p - )
  • What Units is H Measured In?
    • Depends on the base b of the log (bits for b = 2, nats for b = e , etc.)
    • A single bit is required to encode each example in the worst case ( p+ = 0.5)
    • If there is less uncertainty (e.g., p+ = 0.8), we can use less than 1 bit each

Information Gain:

Information Theoretic Definition

  • Partitioning on Attribute Values
    • Recall: a partition of D is a collection of disjoint subsets whose union is D
    • Goal: measure the uncertainty removed by splitting on the value of attribute A
  • Definition
    • The information gain of D relative to attribute A is the expected reduction in entropy due to splitting (“sorting”) on A:

where Dv is { xD : x.A = v }, the set of examples in D where attribute A has value v

  • Idea: partition on A ; scale entropy to the size of each subset Dv
  • Which Attribute Is Best?

v values(A) v

v (^) HD D

GainD,A -HD D

[21+, 5-] [8+, 30-]

A 1

True False

[29+, 35-]

[18+, 33-] [11+, 2-]

A 2

True False

[29+, 35-]