








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Prof. David C Parkes, Computer Science, POMDPs, Instance-Based Methods, Piecewise Linear Convex, Kinds of sensors, Nearest Neighbor Methods, Harvard, Lecture Notes
Typology: Study notes
1 / 14
This page cannot be seen from the preview
Don't miss anything!









There are two distinct topics to cover today but we will try to link them through discussion of the class project in which you will be designing a controller for a robot in a world with poisonous and nutritious plants. We first return to the problem of planning in which we have a known prob- abilistic model of an uncertain environment but in which the state is not fully observable. This is formalized as the problem of partially-observable MDPs or POMDPS. We briefly summarize some of the techniques used to solve them. The full details of POMDPs are outside the scope of this course. Second, we will discuss instance-based methods for density models and clas- sification. These are a set of simple, non-parametric methods for machine learn- ing that are becoming popularized by massive data sets for example when using machine learning to analyze data in Internet applications.
Return again to the planning problem, where an agent has a model of its domain. But now remove the assumption that the agent always knows the state of the environment when it has to make a decision. This is much more realistic for most situations that agents encounter. The agent is not normally told exactly what the state of the world is. Rather, it gets observations about the state of the world through its sensors, that allow it to form probabilistic beliefs about the world state. We provide some practical examples of sensors later in these notes. For this, we modify the MDP model to introduce observations. The new framework is called a partially observable Markov decision process, or POMDP as most people call it. The overall model is like a blend of HMMs and MDPs. There is a set O of possible observations, and an observation model or sensor model that tells us what the agent is likely to observe in each state s of the world: P (o | s) is the probability of receiving observation o ∈ O in state s ∈ S. Also, we introduce an initial distribution P (0)(s) into the model, to model the agent’s initial beliefs about states s. P (0)(s) is the probability that the initial state of the world is s. If there is a fixed, known, initial state s, this is modeled by setting P (0)(s) = 1.
Consider for example a plant-eating robot. The robot is confronted by a partic- ular plant, which may be either nutritious (N) or poisonous (P). In the simple set-up that we consider here the robot may eat or destroy the plant but is in a single, fixed location. In the actual class project there is no destroy option, but the robot may move from location to location and find additional plants. We can describe the whole problem through a state transition diagram, as show in Figure 1. The robot may test (or “observe”) a plant to try to determine if it is nu- tritious. We can model this situation as a POMDP. The states determine the possible observations, so we must include enough information in the state space to specify the observation model. We will use two variables to describe the state space: the first, a Boolean variable, indicates whether or not the plant is nutritious or poisonous. The second, a three-valued variable records the actions the agent has just taken: either the agent has taken no action, has tested the plant, or has eaten the plant. We also add a terminal state, to indicate that the game is over (which can be reached after eating or destroying the plant). This state is simply represented by state ”
needs to perform an action that is not optimal in any (physical) state, just to gather the information it needs to make decisions in the future. Such actions are called information-gathering actions. In our example, the test action is an information-gathering action. It is not optimal in any state. If the plant is nutritious, the optimal action is to eat it. If the plant is poisonous, the agent should destroy the plant rather than eat it. Basically, if the agent knows what the state is, it should take the action appropriate to that state, rather than wasting energy on tests. However, if the agent is sufficiently uncertain about what the state is, it should perform the information-gathering action in order to find out what it is. Because of these issues, solving POMDPs to obtain optimal policies is ex- tremely hard. In the finite horizon case, the cost is exponential in the horizon. In the infinite horizon case, the problem of whether or not there exists a policy that achieves a certain value is undecidable! [Madani, Hanks & Condon, AAAI99] Nevertheless we need to worry about solving POMDPs because the real world is a POMDP! In what follows, we first describe an optimal approach that makes an interesting observation about an equivalence with a fully-observable belief-state MDP and then briefly survey some approximate methods for solving POMDPs.
At any point in time, the agent can maintain a probability distribution over the current state of the world. This distribution is called its belief state, which is a very useful notion in thinking about POMDPs. The belief state allows us to formulate an agent’s policy directly as a MDP on its belief state! Instead of implementing a policy as a function from histories or internal states to actions, a policy is implemented by dividing the belief state into regions, and specifying an action in each region. A belief state policy in our example might be as follows:
If P (Nutritious) > 0. 9 , eat If 0. 2 < P (Nutritious) ≤ 0. 9 test If P (Nutritious) ≤ 0. 2 , destroy
An agent that implements a belief state policy needs to maintain its current belief state. At any point in time, the agent must know the current distribution over possible states, given its history of observations and actions taken. I.e., at time t + 1 it must compute b(St) = P (St | o 1 ,... , ot, a 1 ,... , at− 1 ). We denote its belief state b (representing its belief about the probability on each possible world state.) To understand how to use the idea of filtering from HMMs to track the belief
state, we note that
b(St) = P (St | o 1 ,... , ot, a 1 ,... , at− 1 ) ∝ P (ot | o 1 ,... , ot− 1 , a 1 ,... , at− 1 , St)P (St | o 1 ,... , ot− 1 , a 1 ,... , at− 1 ) = P (ot | St)
st− 1
P (St, st− 1 | o 1 ,... , ot− 1 , a 1 ,... , at− 1 )
= P (ot | St)
st− 1
P (St | st− 1 , at− 1 )P (st− 1 | o 1 ,... , ot− 1 , a 1 ,... , at− 1 )
= P (ot | St)
st− 1
P (St | st− 1 , at− 1 )b(st− 1 )
So, we can maintain the belief state from the observation ot and the previous belief state, using the model P (St | st− 1 , at− 1 ) and P (ot | St). This is our transition function for the equivalent belief state MDP! Finally, we can construct reward function
Rˆ(b) =
s
b(S = s)R(s)
Taken together, we have Rˆ and P (b′^ | b, a) defining a fully observable MDP on the space of belief states. The optimal policy for this fully observable MDP is also an optimal policy for the original POMDP. This is great except that the belief state is (usually) high dimensional and (always) continuous! We cannot just use the same value iteration method from before without a way to represent value functions in this high dimensional space. Three kinds of approaches are used to solve POMDP’s:
Piecewise Linear Convex The value function after any finite number of Bell- man iterations can be represented by a finite collection of piecewise linear value functions on belief space, where the value function is the maximum over the set of value functions. With this observation, along with iterative pruning to remove value functions for subpolicies that are dominated, then exact policies can be found for 10’s of underlying physical states.
Point approximation The value function after each successive Bellman itera- tion is approximated via a set of point belief estimates, with interpolation methods used to evaluate the value function for any other belief.
Function approximation The value function is represented as a linear weighted sum over a set of “basis functions”, which are feature vectors for the set of possible sequences of observations.
Since finding an optimal policy is too hard for all but the most trivial of domains, it is typical to fall back on approximate approaches.
Touch Many robots come with touch sensors, so that they know when they are in contact with an obstacle or a wall. Robots that manipulate objects need touch sensors to know just where and what force to apply to an object. Surprisingly, touch is also useful for robot navigation. For example, it is sometimes quite hard to control a robot’s motion precisely enough so that it can head for a door and go through it without mishap. A simple solution is for the robot to aim for the wall somewhere near the door, and then slide along the wall until it reaches the door. This sliding kind of motion is called compliant motion.
Sonar Sonar is a common sensor for mobile robots. It works by emitting a sound pulse in a certain direction, and timing how long it takes for the sound to be reflected from an object. A mobile robot will typically be equipped with a circular array of eight sonars. The sonar information will tell it the distance to the nearest object in each of eight directions. Sonar only gives a robot a very rough idea of its surroundings. Furthermore, it is quite a noisy sensor. One problem is that surfaces that are not smooth do not reflect sound back directly to the robot. Nevertheless, sonar is very useful for obstacle avoidance.
Doppler radar A Doppler radar is similar to sonar in that it emits radio waves and studies how they are reflected. In addition, it measures the change in frequency between the emitted waves and the reflected waves. By the Doppler effect, if a wave is reflected by an object moving away from the radar, its frequency will decrease, while if the object is moving closer the frequency will increase. A Doppler radar is therefore able to detect motion as well as position of objects. Nevertheless, it is still limited: it can detect radial motion, toward and away from the radar, but not lateral motion. An AI problem that uses a Doppler radar is as follows: given a profile of vehicle motions over time, try to determine which vehicle went where, and perhaps what kinds of vehicles they are.
Blood Test Another “sensor” in our wider sense is a blood test. Here, the true state is the actual condition of the patient. The blood test is a noisy sensor providing information about the patient. An AI problem that uses blood test as a sensor is automated medical diagnosis.
2 Instance-Based Methods (Supervised Learn-
ing)
Part of the challenge in your class project will be to learn classifiers to deter- mine whether a plant is nutritious or poisonous based on the observation of noisy images. In addition to the other techniques that we have explored in the course (probabilistic models, decision trees, neural nets, SVMs, clustering algorithms and so forth), you could also consider using instance-based methods for estimating the probability that a plant is nutritious or poisonous, based on
comparing it with data stored about the images of previous plants that were in each of these two classes.
In the problem of density estimation we are given a set of training examples {x 1 ,... , xn}, where xi ∈ X = {X 1 ,... , Xm} and an m-dimensional attribute vector. Based on this, we would like to estimate the probability density of some point x′^ ∈ X. For classification, we are given as training data labeled pairs (xi, y) with y ∈ Y as the class label. As is standard, we will be interested in classifying a new example x′^ ∈ X. What is novel is that we will consider a nonparametetric method to learn a model for these purposes. The meaning of “nonparametric” here is that the complexity of the model is not constrained a priori but rather depends on the data itself. In addition, the learned hypothesis is not represented by values assigned to a fixed set of parameters. In particular, the hypothesis is essentially represented indirectly in terms of the training data itself. The basic idea of instance-based methods, which are nonparametric in this sense, is that they will find similar things to some x′^ in the training data and use those to estimate either the probability density at x′^ or a class label at x′. We will discuss three kinds of non-parametric methods: histograms, kernel methods and nearest-neighbor methods.
Histograms are a very simple and crude learning method. Nevertheless, they can sometimes be surprisingly effective. To start with, let’s consider a one-dimensional density estimator. There is a single continuous attribute X, whose range is the closed interval [0, 1], and we want to estimate the probability density function P (X) from a set of training samples x 1 ,... , xn. One very intuitive way to do this is to divide the domain of X into a set of bins, and to count the proportion of the training samples falling into each bin. The proportion of samples in a bin is an estimate for the probability that a point falls in the bin. Formally, if n is the total number of training instances, Nx is the number of training instances in the bin containing x, and Vx is the volume of the bin containing x, an estimate for the density at any point in the bin is given by
P (x) =
Nx nVx
This simple formula provides the basis for all the non-parametric density estimators. The volume of a bin is the fraction of range [0,1] that is covered by the bin. The histogram method can also be used as a classifier. One simply divides the training instances according to the value of the class variable Y. One then
Let’s look at an example, with 3 bin sizes, from Bishop’s book on neural networks.
In the example, with very small bins the density is too rough, because there are many local maxima. With large bins the density is too smooth because the bimodal nature of the distribution has been lost. The middle-sized bins get just the right effect. Choosing the right bin size for a given distribution and training set is an empirical issue. Thus far, we have considered one-dimensional histograms. What happens when we increase the number of dimensions? Suppose each of m variables is divided into d ranges - how many bins are there? dm! Thus the number of bins grows exponentially with the number of dimensions. When there are exponentially many bins, then unless there is a huge amount of data, almost all bins will be empty. The resulting density estimate will be extremely rough, and almost useless for generalization. This is the case even with only two bins per dimension, which is the minimum for a dimension to be relevant to the density function. This is the dreaded curse of dimensionality, which is a problem for instance-based methods in general and especially those with “hard bin” approaches such as histograms. Because of the curse of dimensionality, histograms are inappropriate in more than a few dimensions. However, for small dimensional problems, they can be very effective because they are very simple. They are also very easy to
understand for humans, which is why we so often use histograms and similar graphs to present information. Another advantage of histograms is that they allow for easy online updates of the models. When a new instance is seen, the count for its bin is simply incremented. Note that histograms are implicitly making an assumption about “locality”: that examples close to each other have similar density.
Histograms are quite crude. They suffer from two disadvantages, in addition to the curse of dimensionality.
An alternative idea, that solves both these problems, is to have each instance create its own bin! For each point x, consider a bin centered at the point. We can use Formula 1 to estimate the density at the point by (^) nVNxx. A standard way to express the bin around an example is via a “kernel function” (not to be confused with the use of kernels in SVMs.) A kernel function is a function that summarizes the data around a point. One simple way to construct a kernel function is to use bins which are hypercubes with sides of size w, as follows. Consider kernel function
H(x) = |{x′^ ∈ D : ‖x′^ − x‖ ≤
w 2
where ‖ · ‖ is the max norm. This kernel function describes an m-dimensional hypercube with sides of size w. The volume Vx of the bin associated with this kernel function is wm. Using this kernel function, we can build a density estimator. Given a new example x, we compute H(x) and estimate
P (X = x) =
H(x) n · Vx
as the probability density at x. This approach is called a kernel-based method, or the Parzen windows method, after one of its pioneers. We can also use a kernel-based method for classification in a similar manner to histograms. We
One problem with kernel methods is that they impose high time and space complexity. All the examples are used for any prediction problem. (On the other hand, training is very simple— there is nothing to do!!) When using a hard bin, data structures such as “kd-trees” and “quad-trees” can help. When using a soft bin, tricks from coding there that maintain distance metrics (e.g., Johnsson-Lindenstrauss transforms) can be useful.
The histogram method fixes bins, and counts how many examples fall in the bin within which an example is located. Kernel-based methods are more flexible, using hard or soft bins centered about a test example. Still, one shortcoming is that the volume of a kernel is effectively fixed, even though there may be very little training examples considered for a particular example, for example together with a hard bin kernel. The k-nearest neighbor approach is in a sense the opposite to a kernel (hard- bin) approach. One fixes a desired number of neighboring examples, and de- termines the volume Vx that is required to encompass that many examples in a hard bin. The same formula P ∼ (^) nN·Vxx is used to estimate the density. The difference is that whereas before Vx was fixed and we computed Nx, now Nx is fixed to a particular constant (e.g., k) and we compute Vx. As a density estimator, the k nearest neighbor approach is somewhat suspect and should NOT be used. In fact, if one integrates the estimated density function across the whole space, the integral diverges. This means that
the integral of the density could be greater than one!! On the other hand, k-nearest neighbors provides for a particularly simple and natural classifier. Simply take the k neighbors closest to the point to be classified, and choose the majority classification among them. For this reason, they are quite popular. As with previous methods, there is a parameter to fix: what value of k should be used, i.e., how many neighbors should be looked at? As before, k serves as a smoothing parameter. If k is small, particularly if k = 1, the method has a tendency to overfit. A single noisy training example will affect all points that are closer to it than any other. As k grows larger, the effect of individual examples in the training data is lessened and the model is smoothed. The effect of k as a smoothing function is shown in the following figure (again from Bishop):
The same remarks as for kernel-based methods apply to the implementation of nearest neighbor methods. They are expensive, because all training instances ever seen have to be stored. Classification is also expensive, because the k nearest examples to a test example have to be identified. The naive approach of searching through all examples is prohibitive. A variety of algorithms and data structures can be used to mitigate this problem (but we don’t go into them here.) Note: Both kernel methods and nearest neighbor methods suffer from the curse of dimensionality. The reason is that in high dimensions, an example unlikely to have any close neighbors.