




























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Loss-based learning (the practice exams place an emphasis on Naive Bayes instead). ... If it is an unlucky one (L=¬l), it pays 10 with probability 1/5.
Typology: Study notes
1 / 68
This page cannot be seen from the preview
Don't miss anything!





























































(i) Standard A∗ (ii) A∗, but we apply the goal test before enqueuing nodes rather than after dequeuing (iii) A∗, but prioritize n by g(n) only (ignoring h(n)) (iv) A∗, but prioritize n by h(n) only (ignoring g(n)) (v) A∗, but prioritize n by g(n) + h(n′) (vi) A∗, but prioritize n by g(n′) + h(n)
(a) (3 points) Which of the above variants are complete, assuming all heuristics are admissible?
(b) (3 points) Which of the above variants are optimal, again assuming all heuristics are admissible?
Upper Bounds: A∗^ exploits lower bounds h on the true completion cost h∗. Suppose now that we also have an upper bound k(n) on the best completion cost (i.e. ∀n, k(n) ≥ h∗(n)). We will now consider A∗^ variants which still use g + h as the queue priority, but save some work by using k as well. Consider the point at which you are inserting a node n into the queue (fringe). (c) (3 points) Assume you are required to preserve optimality. In response to n’s insertion, can you ever delete any nodes m currently on the queue? If yes, state a general condition under which nodes m can be discarded, if not, state why not. Your answer should involve various path quantities (g, h, k) for both the newly inserted node n and other nodes m on the queue.
In a satisficing search, you are only required to find some solution of cost less than some threshold t (if one exists). You need not be optimal. (d) (3 points) In the satisficing case, in response to n’s insertion, can you ever delete any nodes m currently on the queue? If yes, state a general condition, if not, state why not. Your answer should involve various path quantities (g, h, k) for both the newly inserted node n and other nodes m on the queue.
Consider the general less-than chain CSP below. Each of the N variables Xi has the domain { 1... M }. The constraints between adjacent variables Xi and Xi+1 require that Xi < Xi+1.
For now, assume N = M = 5. (a) (1 point) How many solutions does the CSP have?
(b) (1 point) What will the domain of X 1 be after enforcing the consistency of only the arc X 1 → X 2?
(c) (2 points) What will the domain of X 1 be after enforcing the consistency of only the arcs X 2 → X 3 then X 1 → X 2?
(d) (2 points) What will the domain of X 1 be after fully enforcing arc consistency?
Now consider the general case for arbitrary N and M. (e) (3 points) What is the minimum number of arcs (big-O is ok) which must be processed by AC-3 (the algorithm which enforces arc consistency) on this graph before arc consistency is established?
(f ) (4 points) Imagine you wish to construct a similar family of CSPs which forces one of the two following types of solutions: either all values must be ascending or all values must be descending, from left to right. For example, if M = N = 3, there would be exactly two solutions: { 1 , 2 , 3 } and { 3 , 2 , 1 }. Explain how to formulate this variant. Your answer should include a constraint graph and precise statements of variables and constraints.
Note that if you pull lever B, the resulting payoff should change your beliefs about what kind of lever B is, and therefore what future payoffs from B might be. For example, if you get the 10 reward, your belief that B is lucky should increase. (d) (2 points) If you are in state (0, 1) and select action B, list the states you might land in and the probability you will land in them.
(e) (2 points) If you are in state (1, 0) and select action B, list the states you might land in and the probability you will land in them.
(f ) (3 points) On the computation tree in (c), clearly mark the probabilities on each branch of any chance nodes.
(g) (3 points) Again in this two-round setting, what is the MEU from the state state, and which first action(s) (A or B or both) give it?
(h) (2 points) If the number of plays N is large enough, the optimal first action will eventually be to pull lever B. Explain why this makes sense using concepts from reinforcement learning.
Assume there are two types of conditions: (S)inus congestion and (F)lu. Sinus congestion is is caused by (A)llergy or the flu. There are three observed symptoms for these conditions: (H)eadache, (R)unny nose, and fe(V)er. Runny nose and headaches are directly caused by sinus congestion (only), while fever comes from having the flu (only). For example, allergies only cause runny noses indirectly. Assume each variable is boolean.
A
(i)
S F
R H V
A
(ii)
S F
R H V
A
(iii)
S F
R H V
A
(iv)
S F
R H V
(a) (2 points) Consider the four Bayes Nets shown. Circle the one which models the domain (as described above) best.
(b) (3 points) For each network, if it models the domain exactly as above, write correct. If it has too many conditional independence properties, write extra independence and state one that it has but should not have. If it has too few conditional independence properties, write missing independence and state one that it should have but does not have.
(c) (3 points) Assume we wanted to remove the Sinus congestion (S) node. Draw the minimal Bayes Net over the remaining variables which can encode the original model’s marginal distribution over the remaining variables.
You have been put in charge of a Jabberwock for your friend Lewis. The Jabberwock is kept in a large tugley wood which is conveniently divided into an N × N grid. It wanders freely around the N 2 possible cells. At each time step t = 1, 2 , 3 ,... , the Jabberwock is in some cell Xt ∈ { 1 ,... , N }^2 , and it moves to cell Xt+1 randomly as follows: with probability 1 − , it chooses one of the (up to 4) valid neighboring cells uniformly at random; with probability , it uses its magical powers to teleport to a random cell uniformly at random among the N 2 possibilities (it might teleport to the same cell). Suppose = 12 , N = 10 and that the Jabberwock always starts in X 1 = (1, 1). (a) (2 points) Compute the probability that the Jabberwock will be in X 2 = (2, 1) at time step 2. What about P (X 2 = (4, 4))?
At each time step t, you don’t see Xt but see Et, which is the row that the Jabberwock is in; that is, if Xt = (r, c), then Et = r. You still know that X 1 = (1, 1). (b) (4 points) Suppose we see that E 1 = 1, E 2 = 2, E 3 = 10. Fill in the following table with the distribution over Xt after each time step, taking into consideration the evidence. Your answer should be concise. Hint: you should not need to do any heavy calculations.
You are a bit unsatisfied that you can’t pinpoint the Jabberwock exactly. But then you remembered Lewis told you that the Jabberwock teleports only because it is frumious on that time step, and it becomes frumious independently of anything else. Let us introduce a variable Ft ∈ { 0 , 1 } to denote whether it will teleport at time t. We want to to add these frumious variables to the HMM. Consider the two candidates:
F 1 F 2 F 3
X 1 X 2 X 3 · · ·
E 1 E 2 E 3
F 1 F 2 F 3
X 1 X 2 X 3 · · ·
E 1 E 2 E 3 (A) (B)
(c) (3 points) For each model, circle the conditional independence assumptions above which are true in that model. (d) (2 points) Which Bayes net is more appropriate for the problem domain here, (A) or (B)? Justify your answer.
For the following questions, your answers should be fully general for models of the structure shown above, not specific to the teleporting Jabberwock. For full credit, you should also simplify as much as possible (including pulling constants outside of sums, etc.). (e) (2 points) For (A), express P (Xt+1, e1:t+1, f1:t+1) in terms of P (Xt, e1:t, f1:t) and the CPTs used to define the network. Assume the E and F nodes are all observed.
(f ) (2 points) For (B), express P (Xt+1, e1:t+1, f1:t+1) in terms of P (Xt, e1:t, f1:t) and the CPTs used to define the network. Assume the E and F nodes are all observed.
Consider the following Naive-Bayes model for diagnosing whether your cat is (H)ungry. Signs of hunger include that the cat is (T)hin, (M)eowing, or (W)eak.
h 0. ¬h 0.
h t 0. h ¬t 0. ¬h t 0. ¬h ¬t 0.
h m 0. h ¬m 0. ¬h m 0. ¬h ¬m 0.
h w 0. h ¬w 0. ¬h w 0. ¬h ¬w 1.
(a) (3 points) If your cat is thin and meowing, but not weak, what is the probability that he is hungry?
(b) (2 points) Which of the following smoothing options might have been applied to produce the CPTs above from training data? Circle the best answer:
(i) Laplace smoothing only might have been applied (ii) Linear interpolation only might have been applied (iii) Neither could have been applied (iv) Either might have been applied
(c) (2 points) Assume that no smoothing has been applied (so these are the maximum likelihood estimates). Compute the linear interpolation smoothed estimate of Plin(w|h) using α = 0.5.
(d) (2 points) In a single word, state why smoothing is necessary.
Imagine you cannot tell whether your cat is weak or not. (e) (2 points) Is it correct to simply skip over any unobserved evidence variables when classifying in a Naive Bayes model? That is, will you get the same answer as if you had marginalized out the missing nodes? Briefly justify why or why not.
Now return to the original probabilities, reprinted here:
h 0. ¬h 0.
h t 0. h ¬t 0. ¬h t 0. ¬h ¬t 0.
h m 0. h ¬m 0. ¬h m 0. ¬h ¬m 0.
h w 0. h ¬w 0. ¬h w 0. ¬h ¬w 1.
You can decide whether or not to give your cat a mega-feast (F) to counteract his (possible) hunger. Your resulting utilities are below:
h f 0 h ¬f - ¬h f 0 ¬h ¬f 10
(f ) (2 points) Draw the decision diagram corresponding to this decision problem.
If you do not know W , but wish to determine whether your cat is weak, you can apply the weak-o-meter test, which reveals the value of W. (g) (3 points) In terms of high-level quantities (MEUs, EUs, conditional probabilities, or similar) and variables, give an expression for the maximum utility you should be willing pay to apply the weak-o-meter, assuming the cat is again thin and meowing?
(h) (2 point) What is the maximum utility you should be willing to pay, as a specific real number?
Most techniques for Markov Decision Processes focus on calculating V ∗(s), the maximum expected utility of state s (the expected discounted sum of rewards accumulated when starting from state s and acting optimally). This maximum expected utility V ∗(s) satisfies the following recursive expression, known as the Bellman Optimality Equation: V ∗(s) = max a
s′
T (s, a, s′) [R(s, a, s′) + γV ∗(s′)].
In this question, instead of measuring the quality of a policy by its expected utility, we will consider the worst-case utility as our measure of quality. Concretely, Lπ^ (s) is the minimum utility it is possible to attain over all (potentially infinite) state-action sequences that can result from executing the policy π starting from state s. L∗(s) = maxπ Lπ^ (s) is the optimal worst-case utility. In words, L∗(s) is the greatest lower bound on the utility of state s: the discounted sum of rewards that an agent acting optimally is guaranteed to achieve when starting in state s.
Let C(s, a) be the set of all states that the agent has a non-zero probability of transferring to from state s using action a. Formally, C(s, a) = {s′^ | T (s, a, s′) > 0 }. This notation may be useful to you.
(a) [3 pts] Express L∗(s) in a recursive form similar to the Bellman Optimality Equation.
(b) [2 pts] Recall that the Bellman update for value iteration is:
Vi+1(s) ← max a
s′
T (s, a, s′) [R(s, a, s′) + γVi(s′)]
Formally define a similar update for calculating Li+1(s) using Li.
(c) [3 pts] From this point on, you can assume that R(s, a, s′) = R(s) (rewards are a function of the current state) and that R(s) ≥ 0 for all s. With these assumptions, the Bellman Optimality Equation for Q-functions is
Q∗(s, a) = R(s) +
s′
T (s, a, s′)
γ max a′^
Q∗(s′, a′)
Let M (s, a) be the greatest lower bound on the utility of state s when taking action a (M is to L as Q is to V ). (In words, if an agent plays optimally after taking action a from state s, this is the utility the agent is guaranteed to achieve.) Formally define M ∗(s, a), in a recursive form similar to how Q∗^ is defined.
(d) [2 pts] Recall that the Q-learning update for maximizing expected utility is:
Q(s, a) ← (1 − α)Q(s, a) + α
R(s) + γ max a′^ Q(s′, a′)
where α is the learning rate, (s, a, s′, R(s)) is the sample that was just experienced (“we were in state s, we took action a, we ended up in state s′, and we received a reward R(s)). Circle the update equation below that results in M (s, a) = M ∗(s, a) when run sufficiently long under a policy that visits all state-action pairs infinitely often. If more than one of the update equations below achieves this, select the one that would converge more quickly. Note that in this problem, we do not know T or C when starting to learn.
(i) C(s, a) ← {s′} ∪ C(s, a) (i.e. add s′^ to C(s, a))
M (s, a) ← (1 − α)M (s, a) + α
R(s) + γ
s′∈C(s,a)
max a′^
M (s′, a′)
(ii) C(s, a) ← {s′} ∪ C(s, a) (i.e. add s′^ to C(s, a))
M (s, a) ← (1 − α)M (s, a) + α
R(s) + γ min s′∈C(s,a)
max a′^
M (s′, a′)
(iii) C(s, a) ← {s′} ∪ C(s, a) (i.e. add s′^ to C(s, a)) M (s, a) ← R(s) + γ min s′∈C(s,a)
max a′^ M (s′, a′)
(iv) M (s, a) ← (1 − α)M (s, a) + α min
M (s, a), R(s) + γ max a′^ M (s′, a′)
(e) [1 pt] Suppose our agent selected actions to maximize L∗(s), and γ = 1. What non-MDP-related technique from this class would that resemble? (a one word answer will suffice)
(f ) [1 pt] Suppose our agent selected actions to maximize L 3 (s) (our estimate of L∗(s) after 3 iterations of our “value-iteration”-like backup in section b) and γ = 1. What non-MDP-related technique from this class would that resemble? (a brief answer will suffice)
(d) [3 pts] Specify an elimination order that is efficient for the query P (X 3 | X 5 = x 5 ) in the Tanb above (including X 3 in your ordering). How many variables are in the biggest factor (there may be more than one; if so, list only one of the largest) induced by variable elimination with your ordering? Which variables are they?
(e) [2 pts] Does it make sense to run Gibbs sampling to do inference in a Tanb? In two or fewer sentences, justify your answer.
(f ) [2 pts] Suppose we are given a dataset of observations of Y and all the variables X 1 ,... , X 6 in the Tanb above. Let C denote the total count of observations, C(Y = y) denotes the number of observations of the event Y = y, C(Y = y, Xi = xi) denotes the count of the times the event Y = y, Xi = xi occurred, and so on. Using the C notation, write the maximum likelihood estimates for all CPTs involving the variable X 4.
(g) [2 pts] In the notation of the question above, write the Laplace smoothed estimates for all CPTs involving the variable X 4 (for amount of smoothing k).
Y
M S
Y
M S
(Nb) (Tanb)
(h) [2 pts] Consider the two graphs on the nodes Y (Pacbaby sees Pacman or not), M (Pacbaby sees a moustache), and S (Pacbaby sees sunglasses) above. Pacbaby observes Y = 1 and Y = −1 (Pacman or not Pacman) 50% of the time. Given Y = 1 (Pacman), Pacbaby observes M = +m (moustache) 50% of the time and S = +s (sunglasses on) 50% of the time. When Pacbaby observes Y = −1, the frequency of observations are identical (i.e. 50% M = ±m and 50% S = ±s). In addition, Pacbaby notices that when Y = +1, anyone with a moustache also wears sunglasses, and anyone without a moustache does not wear sunglasses. If Y = −1, the presence or absence of a moustache has no influence on sunglasses. Based on this information, fill in the CPTs below (you can assume that Pacbaby has the true probabilities of the world).
For Nb (left model) For Tanb (right model)
y P(Y = y) 1 − 1
P(M = m | Y = y) y = 1 y = − 1 m = 1 m = − 1
P(S = s | Y = y) y = 1 y = − 1 s = 1 s = − 1
y P(Y = y) 1 − 1
P(M = m | Y = y) y = 1 y = − 1 m = 1 m = − 1
P(S = s | Y = y, M = m) y = 1 y = − 1 m = 1 m = − 1 m = 1 m = − 1 s = 1 s = − 1
(i) [2 pts] Pacbaby sees a character with a moustache and wearing a pair of sunglasses. What prediction does the Naive Bayes model Nb make? What probability does the Nb model assign its prediction? What prediction does the Tanb model make? What probability does the Tanb-brained Pacbaby assign this prediction? Which (if any) of the predictions assigns the correct posterior probabilities?