Docsity
Docsity

Prepara tus exámenes
Prepara tus exámenes

Prepara tus exámenes y mejora tus resultados gracias a la gran cantidad de recursos disponibles en Docsity


Consigue puntos base para descargar
Consigue puntos base para descargar

Gana puntos ayudando a otros estudiantes o consíguelos activando un Plan Premium


Orientación Universidad
Orientación Universidad


Soluciones Mitchel Machine Learning, Monografías, Ensayos de Introducción al Aprendizaje Automático

Algunas soluciones al libro Machine Learning de Mitchel

Tipo: Monografías, Ensayos

2014/2015

Subido el 12/08/2021

juanr-47
juanr-47 🇺🇾

3 documentos

1 / 37

Toggle sidebar

Esta página no es visible en la vista previa

¡No te pierdas las partes importantes!

bg1
Some notes and solutions to Tom Mitchell’s
Machine Learning (McGraw Hill, 1997)
Peter Danenberg
24 October 2011
Contents
1 TODO An empty module that gathers the exercises’ dependen-
cies 1
2 Exercises 2
2.1 DONE 1.1 .............................. 2
2.2 DONE 1.2 .............................. 2
2.3 DONE 1.3 .............................. 3
2.4 DONE 1.4 .............................. 3
2.5 TODO 1.5 .............................. 4
3 Notes 4
3.1 Chapters................................ 4
3.1.1 1................................ 4
3.2 Exercises ............................... 11
3.2.1 1.3............................... 11
3.2.2 1.4............................... 11
3.2.3 1.5............................... 12
1 TODO An empty module that gathers the ex-
ercises’ dependencies
such that running chicken-install -s installs them.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25

Vista previa parcial del texto

¡Descarga Soluciones Mitchel Machine Learning y más Monografías, Ensayos en PDF de Introducción al Aprendizaje Automático solo en Docsity!

Some notes and solutions to Tom Mitchell’s

Machine Learning (McGraw Hill, 1997)

Peter Danenberg

24 October 2011

Contents

1 TODO An empty module that gathers the exercises’ dependen- cies 1

2 Exercises 2 2.1 DONE 1.1.............................. 2 2.2 DONE 1.2.............................. 2 2.3 DONE 1.3.............................. 3 2.4 DONE 1.4.............................. 3 2.5 TODO 1.5.............................. 4

3 Notes 4 3.1 Chapters................................ 4 3.1.1 1................................ 4 3.2 Exercises............................... 11 3.2.1 1.3............................... 11 3.2.2 1.4............................... 11 3.2.3 1.5............................... 12

1 TODO An empty module that gathers the ex-

ercises’ dependencies

such that running chicken-install -s installs them.

2 Exercises

2.1 DONE 1.

CLOSED: 2011-10-12 Wed 04:

Appropriate animal languages could craft appropriate responses and prompts, perhaps, though ignorant of the semantics. fugues train on bach data, or buxtehude. performance measure? perfect authentic cadence, of course. ;) no, not that simple. narratives learn the structure of narratives? performance measure is tricky here.

Not appropriate comedy requires a bizarre ex nihilo and sponteneity (dis- tinguishable from three above?) in fact, the second and third above are inappropriate, rather? define “inappropriate”: difficult? vague performance measure? data representation and search or have meta-learning-problems been solved? new science and mathematics can “creativity” be modelled?

So we can’t, indeed, escape the question of modelling; once the mechanics of learning have been mastered, there lies the ex nihilo.

2.2 DONE 1.

CLOSED: 2011-10-12 Wed 04: Learning task: produces melodic answers to query phrases. Given a phrase that ends on a dominant, say, within a key; gives an appropriate response that ends on the tonic. Must follow a constrained set of progressions (subdomi- nant to dominant, dominant to tonic, flat-six to neopolitan, etc.), and be of an appropriate length.

task T constructing answering phrases to musical prompts (chords)

performance measure P percent of answers that return to the dominant once at the end (given appropriate length and progression constraints)

training experience E expert (bach, chopin, beethoven) prompts and an- swers.

target function V : progression → R; V (b = final tonic) = 100, V (b = final non-tonic) = − 100.

target function representation Vˆ (b) = w 0 +w 1 x 1 , where x 1 = length of prompt− number of chords in answer

2.5 TODO 1.

3 Notes

3.1 Chapters

  • a computer program is said to learn from experienc E with respect to some class of tasks T and performance measure P, if its performanc at tasks in T, as measured by P, improves with experience E.
  • neural network, hidden markov models, decision tree
  • artificial intelligence :: symbolic representations of concepts
  • bayesian :: estimating values of unobserved variables
  • statistics :: characterization of errors, confidence intervals
  • attributes of training experience: - type of training experience from which our system will learn ∗ direct or indirect feedback direct individual checkers board states and the correct move for each indirect move sequences, final outcomes · credit assignment: game can be lost even when early moves are optimal - degree to which learner controls sequence of training examples - how well it represents the distribution of examples over which the final system performance P must be measured ∗ mastery of one distribution of examples will not necessary (sic) lead to strong performance over some other distribution
  • task T: playing checkers; performance measure P: percent of games won; training experience E: games played against itself.
    1. the exactly type of knowledge to be learned; 2. a representation for this target knowledge; 3. a learning mechanism.
  • program: generate legal moves: needs to learn how to choose the best move; some large search space
  • class for which the legal moves that define some large search space are known a priori, but for which the best search strategy is not known
  • target function :: choosemove : B -> M (some B from legal board states to some M from legal moves)

- very difficult to learn given the kind of indirect training experience available - alternative target function: assigns a numerical score to any given board state

  • alternative target function :: V : B -> R (V maps legal board state B to some real value) - higher scores to better board states
  • V(b = finally won) = 100
  • V(b = finally lost) = -
  • V(b = finally drawn) = 0
  • else V(b) = V(b’) where b’ is the best final board state starting from b and playing optimally until the end of the game (assuming the oppont plays optimally, as well). - red black trees? greedy optimization?
  • this definition is not efficiently computable; requires searching ahead to end of game.
  • nonoperational definition
  • goal: operational definition
  • function approximation : Vˆ (distinguished from ideal target function V)
  • the more expressive the representation, the more training data program will require to choose among alternative hypotheses
  • Vˆ linear combination of following board features:

x 1 number of black pieces x 2 number of red pieces x 3 number of black kings x 4 number of red kings x 5 number of black pieces threatened by red x 6 number of red pieces threatened by black

  • Vˆ = w 0 + w 1 x 1 + w 2 x 2 + w 3 x 3 + w 4 x 4 + w 5 x 5 + w 6 x 6
  • w 0... w 6 are weights chosen by the learning algorithm
  • partial design, learning program:

T playing checkers

is a risk function, corresponding to the expected value of the squared error loss or quadratic loss... the defference occurs because of randomness or because the estimator doesn’t account for information that could produce a more accurate estimate.

http://en.wikipedia.org/wiki/Mean_squared_error

  • thus we seek the weights, or equivalently the Vˆ , that minimize E for the observed training examples - damn, statistics would make this all intuitive and clear
  • several algorithms are known for finding weights of a linear function that minimize E; we require an algorithm that will incrementally refine the weights as new training examples become available and that will be robust to errors in these estimated training values.
  • one such algorithm is called the least mean squares, or LMS training rule.

least mean squares (LMS) algorithms is a type of adaptive filter used to mimic a desired filter by finding the filter coefficients that relate to producing the least mean squares of the error signal (difference between the desired and the actual signal). it is a stochastic gradient descent method in that the filter is only adapted based on the error at the current time. the diea behind LMS filters is to use steepest descent to find filter weight h(n) which minimize a cost function: C(N ) = E

|e(n)|^2

where e(n) is the error at the current sample ‘n’ and E{.} denotes the expected value. this cost function is the mean square error, and is minimized by the LMS. applying steepest descent means to take the partial derivatives with respect to the individual entries of the filter coefficient (weight) vector, where ▽ is the gradient operator: ˆh(n+′) = ˆh(n) − μ 2 ▽^ C(n) = ˆh(n) + μE{x(n)e∗(n)} where mu 2 is the step size. that means we have found a se- quential update algorithm which minimizes the cost function. unfortunately, this algorithm is not realizable until we know E{x(n)e∗(n)}. for most systems, the expectation function must be approxi- mated. this can be done with the following unbiased estimator: E^ ˆ{x(n)e∗(n)} = 1 N

∑N − 1

i=0 x(n^ −^ i)e

∗(n − i) where N indicates the number of samples we use for that esti- mate. the simplest case is N = 1: ˆh(n + 1) = hˆ(n) + μx(n)e∗(n)

http://en.wikipedia.org/wiki/Least_mean_squares_filter

in probability theory and statistics, the expected value (or ex- pectation value, or mathematical expectation, or mean, or first moment) of a random variable is the integral of the random variable with respect to its probability measure. for discrete random variables this is equivalent to the probability- weighted sum of the possible values. for continuous random variables with a density function it is the probability density-weighted integral of the possible values. it os often helpful to interpret the expected value of a random variable as the long-run average value of the variable over many independent repetitions of an experiment. the expected value, when it exists, is almost surel the limit of the sample mean as sample size grows to infitiny.

http://en.wikipedia.org/wiki/Expected_value

- damn, everytime we encroach something interesting; find out why differential equations, linear algebra, probability and statistics are so important. that’s like two years of fucking work, isn’t it? or at least one? maybe it’s worth it, if we can pull it

  • LMS weight update rule: for each training example ⟨b, Vtrain(b)⟩: - use the current weights to calculate Vˆ (b) - for each weight wi, update it as: wi ← wi + η(Vtrain(b) − Vˆ (b))xi
  • here η is a small constant (e.g., 0.1) that moderates the size of the weight update.
  • notice that when the error Vtrain(b)− Vˆ (b) is zero, no weights are changed. when Vtrain(b) − Vˆ (b) is positive (i.e., when Vˆ (b) is too low), then each weight is increased in proportion to the value of its correpsonding feature. this will raise the value of Vˆ (b), reducing the error. notice that if the value of some feature xi is zero, then its weight is not altered regardless of the error, so that the only weights updated are those whose features actually occur on the training example board. - mastering these things takes practice; the practice, indeed, of mas- tering things; long haul, if crossfit, for instance, is to be believed; and raising kids - don’t forget: Vtrain(b) (for intermediate values) is Vˆ (Successor(b)), where Vˆ is the learner’s current approimation to V and where Successor(b) denotes the next board state following b for which it is again the program’s turn to move

- program represents the learned eval function using an artifical neural network that considers the complete description of the board state rather than a subsect of board features.

  • nearest neighbor :: store training examples, try to find “closest” stored situation
  • genetic algorithm :: generate large number of candidate checkers programs allow them to play against each other, keeping only the most successful programs
  • explanation-based learning :: analyze reasons underlying specific successes and failures
  • learning involves searching a very large space of possible hypotheses to determine one that best fits the observed data and any prior knowledge held by the learner.
  • many chapters preset algorithms that search a hypothesis space defined by some underlying representation (linear functions, logical descriptions, decision trees, neural networks); for each of these hypotheses representa- tions, the correpsponding learning algorithm takes advantage of a different underlying structure to organize the search through the hypothesis space. -... confidence we can have that a hypothesis consistent with the training data will correctly generalize to unseen examples
  • what algorithms exist?
  • how much training data?
  • prior knowledge?
  • choosing useful next training experience?
  • how to reduce the learning task to one of more function approximation problems?
  • learner alter its representation to improve ability to represent and learn the target function?
  • determine type of training experience (games against experts, games against self, table of correct moves,... ); determine target function (board -> move, board -> value,... ); determine representation of learned func- tion (polynomial, linear function, neural network,... ); determine learning algorithm (gradient descent, linear programming,... ).

3.2 Exercises

From page 11: “The LMS training rule can be viewed as performing a stochastic gradient-descent search through the space of possible hypotheses (weight values) to minimize the squared error E.”

  • Gradient descent is a first-order optimization algorithm. To find a local minimum of a function... one takes steps proportional to the negative of the gradient of the function at the current point. - If one takes steps proportional to the positive of the gradient, one approaches a local maximum: gradient ascent.
  • Known as steepest descent.
  • If F (x) is defined and differentiable in a neighborhood of point a, F (x) decreases fastest if one goes from a in the direction of the negative gradient of F at a, − ▽ F (a).
  • If b = a − γ ▽ F (a) for γ > 0 , then F (a) ≥ F (b).
  • One starts with a guess x 0 for a local minimum of F , and considers the sequence x 0 , x 1 ,... such that xn+1 = xn − γn ▽ F (xn), n ≥ 0.
  • We have F (x 0 ) ≥ F (x 1 ) ≥ · · ·.
  • Gradient descent can be used to solve a system of linear equations, re- formulated as a quadratic minimization problem, e.g., using linear least squares.
  • Convergence can be made faster by using an adaptive step size.

Training Games against self −−−−−−−−−−−−→ V

−^ Board−−−−→−value−−→ Representation Linear function −−−−−−−−−−→ Algorithm Gradient descent −−−−−−−−−−→ Design

Figure 1: Summary of design

Experiment generator Take as input the current hypothesis and output a new problem for the performance system to explore. Our experiment gen- erator always proposes the same initial board game. More sophisticated

x 9 X empty corner? x 10 O empty corner? x 11 X empty side? x 12 O empty side?

Page 8: “In general, this choice of representation involves a crucial tradeoff. On one hand, we wish to pick a very expressive representation to allow repre- senting as close an approximation as possible to the ideal target function V. On the other hand, the more expressive the representation, tho more training data the program will require in order to choose among the alternative hypotheses it can represent.” Here’s a crazy thought: since the space-state complexity of tic-tac-toe is utterly tractable, let’s have nine features: one corresponding to each of the squares. How do we deal with training the opposite direction, by the way: invert the outcome of the training data? I have no idea how much training data nine variables need: we’ll have to plot it; interesting to compare a strategy containing e.g. forks and wins. Is it interesting that each variable is binary? Let’s start with the generalizer and a catalog of games; in order to map the number of training-examples... Ah, I see: the second player has a fixed evaluation function. Can we abstract xkcd? Problem is, the space for O is much more complicated. Maybe we can abstract the Wikipedia strategy: # wikipedia-strategy

  1. Win
  2. Block
  3. Fork
  4. Block a fork
  5. Center
  6. Opposite corner
  7. Empty side

(It looks like the Wikipedia strategy was abstracted from here, by the way; damn: it looks like there are separate X- and O-heuristics.) Represent the board as a vector of nine values; can we set up abstractions for < x, y > as well as {map,reduce,for-each}-{row,column,diagonal,triplet}? Meh; maybe we can implement the X/O-agnostic heuristics.

;;;; Tic-tac-toe with heuristic player

(use debug

vector-lib srfi- srfi- srfi-26)

;;;; General tic-tac-toe definitions

(define n (make-parameter 3))

(define (n-by-n) (* (n) (n)))

(define (row start) (iota (n) start))

(define (column start) (iota (n) start (n)))

(define (a) 0)

(define (b) (- (n) 1))

(define (c) (- (n-by-n) 1))

(define (d) (- (c) (- (n) 1)))

(define (ac-diagonal) (iota (n) (a) (+ (n) 1)))

(define (bd-diagonal) (iota (n) (b) (- (n) 1)))

(define (rows) (map row (iota (n) (a) (n))))

(define (columns) (map column (iota (n))))

(define (diagonals) (list (ac-diagonal) (bd-diagonal)))

(define (tuplets) (append (rows) (columns) (diagonals)))

(vector-map (lambda (i mark) (cond ((X? mark) "X") ((O? mark) "O") (else " "))) board))))

(define (display-board board) (display (board->string board)))

(define (make-empty-board) (make-board (n-by-n) ))

;;; Functional variant of Knuth shuffle: partitions the cards around a ;;; random pivot, takes the first card of the right-partition, repeat.

(define shuffle (case-lambda ((deck) (shuffle ’() deck)) ((shuffled-deck deck) (if (null? deck) shuffled-deck (let ((pivot (random (length deck)))) (let ((left-partition (take deck pivot)) (right-partition (drop deck pivot))) (shuffle (cons (car right-partition) shuffled-deck) (append left-partition (cdr right-partition)))))))))

(define (make-random-board) (let ((board (make-empty-board))) (let iter ((moves (random (n-by-n))) (indices (shuffle (iota (n-by-n))))) (if (zero? moves) board (let ((mark (random (length indices)))) ;; You may end up with a board where there are more Os ;; than Xs. (vector-set! board (car indices) (if (even? moves) X O)) (iter (- moves 1) (cdr indices)))))))

(define (fold-tuplet cons nil board) (fold (lambda (tuplet accumulatum) (cons tuplet accumulatum)) nil

(tuplets)))

;;;; Play mechanics

(define (empty-spaces board) (vector-fold (lambda (space empty-spaces mark) (if (empty? mark) (cons space empty-spaces) empty-spaces)) ’() board))

;;; Putting tuplet first would allow you to use many boards. (define (first-empty-space board tuplet) (find (cute empty? board <>) tuplet))

(define (winning-tuplet? player? tuplet board) (let ((non-player-marks (filter (cute (complement player?) board <>) tuplet))) (equal? (map (cute board-ref board <>) non-player-marks) ‘(,))))

;;; The solutions here may be non-unique: in which case, we have a ;;; convergent fork. (define (winning-spaces player? board) (fold-tuplet (lambda (tuplet winning-spaces) (if (winning-tuplet? player? tuplet board) (cons (first-empty-space board tuplet) winning-spaces) winning-spaces)) ’() board))

(define (forking-space? player? player space board) (let ((board (board-copy board))) (board-set! board space player) (> (length (winning-spaces player? board)) 1)))

(define (forking-spaces player? player board) (filter (lambda (space) (forking-space? player? player space board)) (empty-spaces board)))

(define (center-space board) (/ (- (n-by-n) 1) 2))

(define (center-empty? board)

(lambda (board) (random-empty-space board)))

;;; http://www.buzzle.com/articles/tic-tac-toe-strategy-guide.html (define (make-heuristic-player player? player opponent? opponent) (lambda (board) (let ((my-winning-spaces (winning-spaces player? board))) (if (null? my-winning-spaces) (let ((losing-spaces (winning-spaces opponent? board))) (if (null? losing-spaces) (let ((my-forking-spaces (forking-spaces player? player board))) (if (null? my-forking-spaces) (let ((opponent-forking-spaces (forking-spaces opponent? opponent board))) (if (null? opponent-forking-spaces) (if (center-empty? board) (center-space board) (let ((opposite-corners (opposite-corners player? board))) (if (null? opposite-corners) (let ((empty-corners (empty-corners board))) (if (null? empty-corners) (random-empty-space board) (random-ref empty-corners))) (random-ref opposite-corners)))) (random-ref opponent-forking-spaces))) (random-ref my-forking-spaces))) (random-ref losing-spaces))) (random-ref my-winning-spaces)))))

(define (make-heuristic-X-player) (make-heuristic-player X? X O? O))

(define (make-heuristic-O-player) (make-heuristic-player X? X O? O))

(define make-default-X-player (make-parameter make-heuristic-X-player))

(define make-default-O-player (make-parameter make-heuristic-O-player))

;;; Can we get rid of =move= if we simply cycle through X and O; ;;; thence recurse? (define play (case-lambda (()

(play (make-empty-board))) ((board) (play ((make-default-X-player)) ((make-default-O-player)) board)) ((X-player O-player board) (play 0 X-player O-player board)) ((move X-player O-player board) (debug move) (display-board board) (or (outcome board) (let-values (((token player) (if (even? move) (values X X-player) (values O O-player)))) (let ((next-move (player board))) (board-set! board next-move token) (play (+ move 1) X-player O-player board)))))))

(use test debug)

(include "tic-tac-toe.scm")

(let ((board (board X X O O X O O X))) (test "winning-spaces with X" ’(2 2) (winning-spaces X? board)) (test "winning-spaces with O" ’(2) (winning-spaces O? board)))

(let ((board (board X X O O X O X))) (test "empty-spaces" ’(7 2) (empty-spaces board)))

(let ((board (board X X O