
















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Various aspects of machine learning, focusing on types of learning, protocols, and the process of learning a function from training data. It covers concepts such as direct learning, supervised learning, and the importance of determining the hypothesis space and loss function. The document also introduces the concept of gradient descent and its use in minimizing error.
Typology: Study notes
1 / 24
This page cannot be seen from the preview
Don't miss anything!

















The previous lecture notes introduced the basic problems and questions involved in Machine Learning: How do we recognize the relevant features for a task in a particular domain? What are the basic issues involved in representing the features algorithmically? The lecture notes will discuss two basic paradigms: PAC (Risk Minimization) and Bayesian Theory, various learning protocols (Online/Batch; Supervised/ Unsupervised/Semi-supervised), and some common algorithms such as Decision Trees, Rules and ILP, Linear Threshold Units, Probabilistic Representations, Unsupervised/Semi-supervised. They will also discuss Clustering and dimension- ality Reduction.
One can learn many things, for instance, in classification tasks, it might be a mat- ter of learning a hidden function. A face-recognition task, for example, involves presenting examples of what a face is and trying to discern how to most accurately classify it. This is a type of concept learning in which the machine is learning to recognize a concept across new instances based on previous input. Classification tasks are also used for diagnosis in areas such as medical diagnosis and risk as- sessment. It is also necessary to learn models. For example, a machine can learn a map and use it to navigate, learn a distribution and use it to answer queries, learn a language model, or learn to make multiple related decisions.
In a language model, a probability can be assigned to each legitimate sentence, with non-sentences receiving zero probability and likely sentences receiving a high probability. Once the model exists, specific predictions can be made about the language. It is also possible to learn different skills such as how to play games or learn to plan. There can also be a task of acquiring a representation and using it for reasoning. Using clustering a machine can learn the shapes of objects, functionality, seg- mentation, and other abstract concepts. Most of the work in Machine Learning is either classification or uses classification as a building block.
Direct Learning involves learning a function that maps an input instance to the sought after property. In Model Learning, a model of a domain is learned and then used to answer various questions about the domain. These notes will focus more on direct learning. In both types of learning, there are several learning protocols. In supervised learning someone or something provides the labels for the concept. The data, in this case, already has labels. With unsupervised learning there is data, but no labels. Sometimes data is only partially labeled, in this case, the protocol is semi- supervised learning. As humans, we usually learn about the world and develop concepts, but when working with Machine Learning, something must convert the representation of the real word data into a labeled state. This can be done by using domain knowledge to construct labels or with a sophisticated algorithm. Developing a theory for how to convert data into a supervised state is an open area of research. Some other learning protocols include reinforcement learning and teaching. In teaching, the teacher should provide the students with helpful examples of the concept and the students, if they learn successfully, learn a rule to recognize the concept.
3 What is learning?
As an exercise in working with labelled data, students were asked to look at a Badges Game. In this game, a list of names was provided and some were labelled with a ‘+’ and some with a ‘-’. Students were told that the label was assigned based
5 A Learning Problem
Assuming there is a black box that computes a function, f , and the function is on four widths, x 1 , x 2 , x 3 , x 4 , and the output is y, how is the function learned? The table below shows such a function, f. There are various ways that f could be described based on the data. Assuming that there exists a function which can describe the data, it could be characterized by a series of negative disjunctions. The result would be a Boolean function that works on the training set of examples x 1 , x 2 , x 3 , x 4.
Table 1: Sample Learning Problem Example x1 x2 x3 x4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0
Given the training data, the function f could be x 2 ∧ x 4. This function com- pletely describes the training data. The function f could also be the function x 1 ∨ x 3 ∧ x 4 , which also completely describes the training data. Both functions are correct for the training data, but they are not the same function. When a new example, xn is presented, the two functions make different predictions. Table 1 only shows a portion of the possible values for four variables.
Very little can actually be known about f based on the training data in Table 1. There are 216 = 56536 possible functions over four input features. There is no way to know which is the correct function until every possible input-output pair has been labeled. After the seven examples in Table 1, 29 possibilities for f still remain. Given the size of the hypothesis space, is learning even possible? As examples are learned and labeled, the list of possible functions is narrowed significantly. But no simple rule can explain the data.
It is possible to consider all of the possible conjunctive rules of the form y = xi ∧ xj ∧ xk. There are only 16 and a counterexample, based on the data in Table 1, exists for each.
Table 2: Sample Learning Problem Rule Counterexample y = c x1 1100 0 x2 0100 0 x3 0110 0 x4 0101 1 x1 ∧ x2 1100 0 x1 ∧ x3 0011 1 x1 ∧ x4 0011 1 x2 ∧ x3 0011 1 x2 ∧ x4 0011 1 x3 ∧ x4 1001 1 x1 ∧ x2 ∧ x3 0011 1 x1 ∧ x2 ∧ x4 0011 1 x1 ∧ x3 ∧ x4 0011 1 x1 ∧ x2 ∧ x3 ∧ x4 0011 1
There is no simple rule of this form which explains the data. The same is true for simple clauses. It is possible to consider a more complex type of rule for the hypothesis space. For example, m − of − n rules can be used. There are 29 possible rules of the form ‘y = 1 if and only if at least m of the following n variables are 1 .’ Table 3 provides the index of the counterexample from Table 1, and ‘***’ represents a case where no counterexample is found. With this representation of the hypothesis space, a consistent hypothesis was found.
6 Views of Learning
Learning is the removal of our remaining uncertainty. Suppose, in the example above, that it was known that the unknown function was an m − of − n Boolean
automatically. In this case, the hypothesis space has a flexible size. There are advantages to fixed versus flexible hypothesis sizes. But, no matter what type of hypothesis spaces are used, it is important to develop algorithms for finding a hypothesis that fits the data and hope that it generalizes well. It is also important to be able to quantify how well an algorithm is expected to perform on unseen examples.
8 Terminology
This section presents some of the basic vocabulary:
Training examples are examples of the form (x, f (x)).
A Target function or concept is the true function f (?).
An hypothesis is a proposed function h, believed to be similar to f.
A concept is a Boolean function, an example for which f (x) = 1 are positive examples and those for which f (x) = 0 are negative examples (instances).
A classifier is a discrete valued function. The possible value of f : { 1 , 2 ,... K} are the classes or class labels. (In most algorithms, the classifier will actu- ally return a real valued function that will have to be interpreted.)
An hypothesis space is the space of all hypotheses that can, in principle, be output by the learning algorithm.
The version space is the space of all hypotheses in the hypothesis space that have not yet been ruled out.
9 Key Issues in Machine Learning
The key issues in Machine Learning are that of modeling the problem, represent- ing the problem, and choosing the best algorithms. Modeling involves thinking about how to formulate application problems as ma- chine learning problems. Various learning protocols can be chosen based on where the data is coming from and how it is represented. There are various applications for modeling, for instance, in e-mail, if a seminar announcement is received, the relevant information can be extracted and put into a calendar, or a message can
be categorized according to the appropriate folder or mail-label. In the domain of image processing, machine learning could be used to give photographs the proper rotation. There can be models which recognize a particular user and perform some operation only for him, for example, unlocking the office door. A language model could be used for context sensitive spelling rules and work from an online server or be incorporated into word processing software. Given a particular problem, the best means of representing it will vary. Later these notes will discuss how to determine what is a good hypothesis space and look for rigorous ways to define this. The algorithm used is also an important factor in machine learning. It is important to decide what is a good algorithm for the problem and to define ‘success’ in a quantifyable way. Algorithms will need to avoid being too general but also avoid overfitting the data. For example, if someone wanted to model the concept of a tree, she could use various intuitions about the concept. In an extreme case, she could ask a botanist, who is very rigorous in her field of study, and defines a tree as, ‘something with leaves that I have seen before’. For a second informant, she asks her brother, who isn’t terribly concerned with the details and defines a tree as ‘a green thing’. When presented with a new tree, neither definition is ideal. The botanist’s definition will not be general enough to accept new input in many cases. Her brother’s definition will be too general and include things such as shrubs, broccoli, and lizards in the ‘tree’ concept. There are also computational issues with the algorithm chosen with very ex- pressive algorithms often being more costly.
10 An Example
In the sentence below, there are two possible spellings, but only one is correct in the context:
I don’t know {whether, weather} to laugh or cry.
How can this distinction be made into a learning problem? The problem is looking for a function F : Sentences → {whether, weather}. But the domain of the function first must be defined better. One possibility is to define a Boolean feature for each word, w in English. This feature can be defined as xw : [xw = 1] iff w is in the sentence. This function maps a sentence to a point in { 0 , 1 }^50 ,^000 , based on approximately how
x = data representation w = the classifier Y = sgn {XT^ w} The weights, w, are Real numbers, the vectors are column vectors. The func- tion says to compute XT^ w and get the sign, and determine one value to represent the first whether and the other weather.
Not all functions are linearly separable. For example, XOR (Exclusive-OR), (x 1 ∧ x 2 ) ∨ (¬{x 1 } ∧ ¬{x 2 }), can not be represented with a linear function. More generally, XOR is a parity function:
xi ∈ { 0 , 1 }
f(x 1 , x 2 ,... , xn) = 1 iff
∑ Xi is even There is no way a linear function in x 1 by x 2 space will solve the learning problem. The next section will discuss how to address the fact that linear functions are not universal representations while still restricting the problem to one that is solved using only linear functions.
12 Functions Can be Made Linear
A function can sometimes be made linear if the variables and space are re-defined. For example, assume that the function describing the occurrence of whether vs. weather can be represented in Disjunctive Normal Form (DNF):
x 1 x 2 x 4 ∨ x 2 x 4 x 5 ∨ x 1 x 3 x 7
The data can be made linearly separable if new variables, y 1 ,... yn, are intro- duces where each y represents a conjunction of three x variables.
Space: X = x 1 , x 2 ,... , xn
New Space: Y = {y 1 , y 2 ,.. .} = {xixj xj , xj xj xi}
This transformation of the input ensures that every subset of the function is now a disjunction and the result is a linear function.
Sometimes the data is not separable in one dimension or it is not separable if the problem is restricted to a certain class of functions. For example, if the data is not separable by the function class in continuous space, it might be separable in < x, x^2 > space. In this case, the information source only provides x dimensions, but the algorithm was given x by x^2 dimensions.
13 A General Framework for Learning
The goal of learning is to predict an unobserved output value y ∈ Y based on an observed input vector x ∈ X. In order to do this, a functional relationship y f (x) is estimated from a set {(x, y)i}i=1,n. There are various ways of characterizing the classification, the most relevant is classifying y ∈ { 0 , 1 } or y ∈ { 1 , 2 ,... , k}. Although it is also possible to describe Regression, y ∈ < in the same framework. Part of finding the best function requires determining what we want f (x) to satisfy. Ideally, we want to minimize the Loss (Risk): L(f ()) = Ex,y([f (x) 6 = y]). Where Ex,y denotes the expectation with respect to the true distribution. Intu- itively, Ex,y is simply the number of mistakes and [.. .] is an indicator function. If the errors are minimized, the function should behave well on unseen examples. Unfortunately, it is not possible to minimize loss as defined above. Instead, empirical classification error is minimized as follows. For a set of training exam- ples {(Xi, Yi)}i=1,n, the following function is minimized:
L′(f ()) = 1/n
∑ i[f^ (Xi 6 =^ Yi] This minimization problem is typically NP hard. In order to alleviate this computational problem, a new function is minimized:
I(f (x), y) = [f (x) 6 = y] = { 1 when f (x) 6 = y; 0 otherwise}
This function is a convex upper bound of the classification error function.
14 Learning as an Optimization Problem
The loss problem must also be defined. A Loss Function, L(f (x), y), measures the penalty incurred by a classifier f on and example, (x, y). Where the variable y represents the truth and the variable x represents the prediction. If the function is correct, the loss is zero. If not, the distance incurs loss.
Figure 3: Blown-up Feature Space
17 Expressivity
Many functions are linear and many are not. Conjunctions are linear. For example, the conjunction, y = x 1 ∧ x 3 ∧ x 5 , can be expressed by y = sgn{ 1 · x 1 + 1 · x 3 + 1 · x 5 − 3 }. The function At least m of n is also linear: y = at least 2 of {x 1 , x 3 , x 5 } can be expressed as y = sgn{ 1 · x 1 + 1 · x 3 + 1 · x 5 − 2 }. Probabilistic Classifiers can be defined as well. Among the non-linear functions are XOR : y = x 1 ∧ x 2 ∨ x 1 ∧ x 2 , and most non-trivial DNFs such as y = x 1 ∧ x 2 ∨ x 3 ∧ x 4. Many of these can, however, be made linear. The Canonical Representation is
f (x) = sgn{xT^ · w − Θ} = sgn{
∑n i=1 wixi^ −^ Θ}
Where sgn{x · w − Θ} ≡ sgn{x′^ · w′}, and x′^ = (x, −1) and w = (w, Θ). Once the data is moved from an n dimensional representation to an (n + 1) dimensional representation, it is possible to look for hyperplanes that go through the origin.
18 LMS: An Online, Local Search Algorithm
The LMS algorithm is going to be a search algorithm starting with a guess and moving towards minimizing the error. First the hypothesis space must be deter- mined, for example, Linear Threshold Units. Then the Loss function, rather than errors, is represented by, for example, Squared Loss or LMS (Least Mean Square, L 2 ). Then a search procedure is used such as Gradient Descent. Let wi^ be the current weight vector we have. Our prediction on the d-th exam- ple x is therefore: (^0) d =
∑
i
wji · xi = w~j^ · ~x
Where the i subscript is the vector component, the j superscript is the time and the d subscript is the example number. Let td be the target value for this examples (real value; represents u · x). Assuming that x ∈ Rn; u ∈ Rn^ is the target weight vector, the target (or label) is td = u · x. Noise has been added, so it is possible that no weight vector is consistent with the data.
Figure 4: Estimation with Gradient Descent
The error the current hypothesis makes on the data set is:
Err( w~j^ ) =
∑
d∈D
(td − (^0) d)^2
19 Gradient Descent
Gradient Descent is used to determine the correct weight vector to minimize Err(w). The set D of examples is fixed, and E is a function of wi. At each step, the weight vector is modified in the direction that produces the steepest descent along the error surface, as represented in the image above. In order to find the best direction in the weight space, the gradient of E with respect to each of the components of w~ is computed.
5 E~(w) ≡ [
ϑE ϑw 1
ϑE ϑw 2
ϑE ϑwn
This vector specifies the direction that produces the steepest increase in E. The weight vector, w~ is modified in the direction of −E( w~)
This algorithm is similar to the previous one, but updates w~ incrementally. The weight update rule is:
wi = R(td − (^0) d)xid
The algorithm is described below: Gradient descent algorithm for training linear units: Start with an initial random weight vector
For every example d with target value: td
Evalutate the linear unit:
(^0) d =
∑
i
wi · xid = w~ · ~xd
Update w~ by incrementally adding to each component
Continue until E is below some threshold In general, this algorithm does not converge to a global minimum. Decreasing R with time guarantees convergence. But incremental algorithms are sometimes advantageous.
20 Learning Rates and Convergence
Generally, in the non-separable case, the learning rate, R, must decrease to zero to guarantee convergence. The learning rate is called the step size. There are more sophisticated algorithms, such as Conjugate Gradient algorithms, that choose the step size automatically. These algorithms tend to converge faster. There is only one ‘basin’ for linear threshold units, so a local minimum is the global minimum. However, choosing a starting point (i.e., an initial weight) can make the algorithm converge much faster. There are still questions remaining about the ability of the algorithm to cor- rectly determine the function. When new data is presented, will it fall on the appropriate side of the function? How well it will perform on future data de- pends partly on how many examples it was given. Also, a function could have a small error rate with high probability. A sample can also have errors that force the algorithm to learn the wrong weights for features.
21 Computational Issues
There are also issues of computational complexity. Assuming that the data is linearly separable, the following can be said about the sample complexity. Suppose we want to ensure that our LTU has an error rate on new examples of less than σ with high probability, where high probability is at least ( 1 − δ). How large does the m, the number of examples, have to be in order to achieve this error rate? It can be shown that for n dimensional problems, the following holds:
m = O(1/[ln(1/δ) + (n + 1)ln(1/)])
What can be said, however, about the computational complexity? It can be shown that there exists a polynomial time algorithm for finding a consistent LTU by reduction from linear programming. On-line algorithms have inverse quadratic dependence on the margin.
There are many other solutions with advantages and disadvantages to each. For example:
Direct Computation Set J(w) = 0 and solve for w. This can be accomplished using SVD methods.
Fisher Linear Discriminant A direct computation method (discussed in more detail below).
Probabilistic Methods (naive Bayes) Produces a stochastic classifier that can be viewed as a linear threshold unit.
Winnow A multiplicative update algorithm with the property that it can handle large numbers of irrelevant attributes.
22 Fisher Linear Discriminant
The Fisher Linear Discriminant method is a classical method for discriminant analysis. It is based on using dimensionality reduction in order to find a better