Markov Decision Processes, Search, and Constraint Satisfaction, Study notes of Probability and Statistics

Loss-based learning (the practice exams place an emphasis on Naive Bayes instead). ... If it is an unlucky one (L=¬l), it pays 10 with probability 1/5.

Typology: Study notes

2022/2023

Uploaded on 05/11/2023

christin
christin 🇺🇸

4.6

(18)

263 documents

1 / 68

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS221 Practice Final
Autumn 2012
1 Other Finals
The following pages are excerpts from similar classes’ finals. The content is similar to what
we’ve been covering this quarter, so that it should be useful for practicing. Note that the
topics and terminology differ slightly, so feel free to ignore the questions that we did not
cover.
Certain topics are less emphasized in the past exams, but will be more emphasized in
the final for the class. These include:
Weighted CSPs and Markov Nets (the practice exams place more of an emphasis on
Bayes Nets).
Loss-based learning (the practice exams place an emphasis on Naive Bayes instead).
Unsupervised learning (e.g., EM)
Logic (covered in much greater depth in our class)
In contrast, the practice exams cover state space models fairly deeply. State space models
will be less emphasized in the final for the class.
The first portion of the practice exam comes with solutions; the rest are provided as
example problems, but without solutions. In terms of other miscellaneous notes:
Perceptron refers to a classifier using the perceptron loss (see slide 34 in the lecture on
loss minimization).
The forward (and backward) algorithm for HMMs is just an instance of variable elim-
ination, as you did in the first part of your Pacman projects, before implementing
particle filtering. Relatedly, Viterbi is an algorithm to decode the MAP estimate in an
HMM.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44

Partial preview of the text

Download Markov Decision Processes, Search, and Constraint Satisfaction and more Study notes Probability and Statistics in PDF only on Docsity!

CS221 Practice Final

Autumn 2012

1 Other Finals

The following pages are excerpts from similar classes’ finals. The content is similar to what

we’ve been covering this quarter, so that it should be useful for practicing. Note that the

topics and terminology differ slightly, so feel free to ignore the questions that we did not

cover.

Certain topics are less emphasized in the past exams, but will be more emphasized in

the final for the class. These include:

  • Weighted CSPs and Markov Nets (the practice exams place more of an emphasis on

Bayes Nets).

  • Loss-based learning (the practice exams place an emphasis on Naive Bayes instead).
  • Unsupervised learning (e.g., EM)
  • Logic (covered in much greater depth in our class)

In contrast, the practice exams cover state space models fairly deeply. State space models

will be less emphasized in the final for the class.

The first portion of the practice exam comes with solutions; the rest are provided as

example problems, but without solutions. In terms of other miscellaneous notes:

  • Perceptron refers to a classifier using the perceptron loss (see slide 34 in the lecture on

loss minimization).

  • The forward (and backward ) algorithm for HMMs is just an instance of variable elim-

ination, as you did in the first part of your Pacman projects, before implementing

particle filtering. Relatedly, Viterbi is an algorithm to decode the MAP estimate in an

HMM.

NAME: SID#: Login: GSI: 3

  1. (17 points.) Search: A∗^ Variants Queuing variants: Consider the following variants of the A∗^ tree search algorithm. In all cases, g is the cumulative path cost of a node n, h is a lower bound on the shortest path to a goal state, and n′^ is the parent of n. Assume all costs are non-negative.

(i) Standard A∗ (ii) A∗, but we apply the goal test before enqueuing nodes rather than after dequeuing (iii) A∗, but prioritize n by g(n) only (ignoring h(n)) (iv) A∗, but prioritize n by h(n) only (ignoring g(n)) (v) A∗, but prioritize n by g(n) + h(n′) (vi) A∗, but prioritize n by g(n′) + h(n)

(a) (3 points) Which of the above variants are complete, assuming all heuristics are admissible?

(b) (3 points) Which of the above variants are optimal, again assuming all heuristics are admissible?

Upper Bounds: A∗^ exploits lower bounds h on the true completion cost h∗. Suppose now that we also have an upper bound k(n) on the best completion cost (i.e. ∀n, k(n) ≥ h∗(n)). We will now consider A∗^ variants which still use g + h as the queue priority, but save some work by using k as well. Consider the point at which you are inserting a node n into the queue (fringe). (c) (3 points) Assume you are required to preserve optimality. In response to n’s insertion, can you ever delete any nodes m currently on the queue? If yes, state a general condition under which nodes m can be discarded, if not, state why not. Your answer should involve various path quantities (g, h, k) for both the newly inserted node n and other nodes m on the queue.

In a satisficing search, you are only required to find some solution of cost less than some threshold t (if one exists). You need not be optimal. (d) (3 points) In the satisficing case, in response to n’s insertion, can you ever delete any nodes m currently on the queue? If yes, state a general condition, if not, state why not. Your answer should involve various path quantities (g, h, k) for both the newly inserted node n and other nodes m on the queue.

NAME: SID#: Login: GSI: 5

  1. (13 points.) CSPs: A Greater (or Lesser) Chain

Consider the general less-than chain CSP below. Each of the N variables Xi has the domain { 1... M }. The constraints between adjacent variables Xi and Xi+1 require that Xi < Xi+1.

For now, assume N = M = 5. (a) (1 point) How many solutions does the CSP have?

(b) (1 point) What will the domain of X 1 be after enforcing the consistency of only the arc X 1 → X 2?

(c) (2 points) What will the domain of X 1 be after enforcing the consistency of only the arcs X 2 → X 3 then X 1 → X 2?

(d) (2 points) What will the domain of X 1 be after fully enforcing arc consistency?

Now consider the general case for arbitrary N and M. (e) (3 points) What is the minimum number of arcs (big-O is ok) which must be processed by AC-3 (the algorithm which enforces arc consistency) on this graph before arc consistency is established?

(f ) (4 points) Imagine you wish to construct a similar family of CSPs which forces one of the two following types of solutions: either all values must be ascending or all values must be descending, from left to right. For example, if M = N = 3, there would be exactly two solutions: { 1 , 2 , 3 } and { 3 , 2 , 1 }. Explain how to formulate this variant. Your answer should include a constraint graph and precise statements of variables and constraints.

Note that if you pull lever B, the resulting payoff should change your beliefs about what kind of lever B is, and therefore what future payoffs from B might be. For example, if you get the 10 reward, your belief that B is lucky should increase. (d) (2 points) If you are in state (0, 1) and select action B, list the states you might land in and the probability you will land in them.

(e) (2 points) If you are in state (1, 0) and select action B, list the states you might land in and the probability you will land in them.

(f ) (3 points) On the computation tree in (c), clearly mark the probabilities on each branch of any chance nodes.

(g) (3 points) Again in this two-round setting, what is the MEU from the state state, and which first action(s) (A or B or both) give it?

(h) (2 points) If the number of plays N is large enough, the optimal first action will eventually be to pull lever B. Explain why this makes sense using concepts from reinforcement learning.

NAME: SID#: Login: GSI: 9

  1. (15 points.) Bayes Nets: Snuffles

Assume there are two types of conditions: (S)inus congestion and (F)lu. Sinus congestion is is caused by (A)llergy or the flu. There are three observed symptoms for these conditions: (H)eadache, (R)unny nose, and fe(V)er. Runny nose and headaches are directly caused by sinus congestion (only), while fever comes from having the flu (only). For example, allergies only cause runny noses indirectly. Assume each variable is boolean.

A

(i)

S F

R H V

A

(ii)

S F

R H V

A

(iii)

S F

R H V

A

(iv)

S F

R H V

(a) (2 points) Consider the four Bayes Nets shown. Circle the one which models the domain (as described above) best.

(b) (3 points) For each network, if it models the domain exactly as above, write correct. If it has too many conditional independence properties, write extra independence and state one that it has but should not have. If it has too few conditional independence properties, write missing independence and state one that it should have but does not have.

(i)

(ii)

(iii)

(iv)

(c) (3 points) Assume we wanted to remove the Sinus congestion (S) node. Draw the minimal Bayes Net over the remaining variables which can encode the original model’s marginal distribution over the remaining variables.

NAME: SID#: Login: GSI: 11

  1. (19 points.) HMMs: Tracking a Jabberwock

You have been put in charge of a Jabberwock for your friend Lewis. The Jabberwock is kept in a large tugley wood which is conveniently divided into an N × N grid. It wanders freely around the N 2 possible cells. At each time step t = 1, 2 , 3 ,... , the Jabberwock is in some cell Xt ∈ { 1 ,... , N }^2 , and it moves to cell Xt+1 randomly as follows: with probability 1 − , it chooses one of the (up to 4) valid neighboring cells uniformly at random; with probability , it uses its magical powers to teleport to a random cell uniformly at random among the N 2 possibilities (it might teleport to the same cell). Suppose  = 12 , N = 10 and that the Jabberwock always starts in X 1 = (1, 1). (a) (2 points) Compute the probability that the Jabberwock will be in X 2 = (2, 1) at time step 2. What about P (X 2 = (4, 4))?

P (X 2 = (2, 1)) =

P (X 2 = (4, 4)) =

At each time step t, you don’t see Xt but see Et, which is the row that the Jabberwock is in; that is, if Xt = (r, c), then Et = r. You still know that X 1 = (1, 1). (b) (4 points) Suppose we see that E 1 = 1, E 2 = 2, E 3 = 10. Fill in the following table with the distribution over Xt after each time step, taking into consideration the evidence. Your answer should be concise. Hint: you should not need to do any heavy calculations.

t P (Xt, e1:t− 1 ) P (Xt, e1:t)

You are a bit unsatisfied that you can’t pinpoint the Jabberwock exactly. But then you remembered Lewis told you that the Jabberwock teleports only because it is frumious on that time step, and it becomes frumious independently of anything else. Let us introduce a variable Ft ∈ { 0 , 1 } to denote whether it will teleport at time t. We want to to add these frumious variables to the HMM. Consider the two candidates:

F 1 F 2 F 3

X 1 X 2 X 3 · · ·

E 1 E 2 E 3

F 1 F 2 F 3

X 1 X 2 X 3 · · ·

E 1 E 2 E 3 (A) (B)

(A) (B)

X 1 ⊥ X 3 | X 2 X 1 ⊥ X 3 | X 2

X 1 ⊥ E 2 | X 2 X 1 ⊥ E 2 | X 2

X 1 ⊥ F 2 | X 2 X 1 ⊥ F 2 | X 2

X 1 ⊥ E 4 | X 2 X 1 ⊥ E 4 | X 2

X 1 ⊥ F 4 | X 2 X 1 ⊥ F 4 | X 2

E 3 ⊥ F 3 | X 3 E 3 ⊥ F 3 | X 3

E 1 ⊥ F 2 | X 2 E 1 ⊥ F 2 | X 2

E 1 ⊥ F 2 | E 2 E 1 ⊥ F 2 | E 2

(c) (3 points) For each model, circle the conditional independence assumptions above which are true in that model. (d) (2 points) Which Bayes net is more appropriate for the problem domain here, (A) or (B)? Justify your answer.

For the following questions, your answers should be fully general for models of the structure shown above, not specific to the teleporting Jabberwock. For full credit, you should also simplify as much as possible (including pulling constants outside of sums, etc.). (e) (2 points) For (A), express P (Xt+1, e1:t+1, f1:t+1) in terms of P (Xt, e1:t, f1:t) and the CPTs used to define the network. Assume the E and F nodes are all observed.

(f ) (2 points) For (B), express P (Xt+1, e1:t+1, f1:t+1) in terms of P (Xt, e1:t, f1:t) and the CPTs used to define the network. Assume the E and F nodes are all observed.

  1. (18 points.) Classification and VPI: Cat Cravings

Consider the following Naive-Bayes model for diagnosing whether your cat is (H)ungry. Signs of hunger include that the cat is (T)hin, (M)eowing, or (W)eak.

H P (H)

h 0. ¬h 0.

H T P (T |H)

h t 0. h ¬t 0. ¬h t 0. ¬h ¬t 0.

H M P (M |H)

h m 0. h ¬m 0. ¬h m 0. ¬h ¬m 0.

H W P (W |H)

h w 0. h ¬w 0. ¬h w 0. ¬h ¬w 1.

(a) (3 points) If your cat is thin and meowing, but not weak, what is the probability that he is hungry?

(b) (2 points) Which of the following smoothing options might have been applied to produce the CPTs above from training data? Circle the best answer:

(i) Laplace smoothing only might have been applied (ii) Linear interpolation only might have been applied (iii) Neither could have been applied (iv) Either might have been applied

(c) (2 points) Assume that no smoothing has been applied (so these are the maximum likelihood estimates). Compute the linear interpolation smoothed estimate of Plin(w|h) using α = 0.5.

(d) (2 points) In a single word, state why smoothing is necessary.

Imagine you cannot tell whether your cat is weak or not. (e) (2 points) Is it correct to simply skip over any unobserved evidence variables when classifying in a Naive Bayes model? That is, will you get the same answer as if you had marginalized out the missing nodes? Briefly justify why or why not.

NAME: SID#: Login: GSI: 15

Now return to the original probabilities, reprinted here:

H P (H)

h 0. ¬h 0.

H T P (T |H)

h t 0. h ¬t 0. ¬h t 0. ¬h ¬t 0.

H M P (M |H)

h m 0. h ¬m 0. ¬h m 0. ¬h ¬m 0.

H W P (W |H)

h w 0. h ¬w 0. ¬h w 0. ¬h ¬w 1.

You can decide whether or not to give your cat a mega-feast (F) to counteract his (possible) hunger. Your resulting utilities are below:

H F U (H, F )

h f 0 h ¬f - ¬h f 0 ¬h ¬f 10

(f ) (2 points) Draw the decision diagram corresponding to this decision problem.

If you do not know W , but wish to determine whether your cat is weak, you can apply the weak-o-meter test, which reveals the value of W. (g) (3 points) In terms of high-level quantities (MEUs, EUs, conditional probabilities, or similar) and variables, give an expression for the maximum utility you should be willing pay to apply the weak-o-meter, assuming the cat is again thin and meowing?

(h) (2 point) What is the maximum utility you should be willing to pay, as a specific real number?

Q4. [12 pts] Worst-Case Markov Decision Processes

Most techniques for Markov Decision Processes focus on calculating V ∗(s), the maximum expected utility of state s (the expected discounted sum of rewards accumulated when starting from state s and acting optimally). This maximum expected utility V ∗(s) satisfies the following recursive expression, known as the Bellman Optimality Equation: V ∗(s) = max a

s′

T (s, a, s′) [R(s, a, s′) + γV ∗(s′)].

In this question, instead of measuring the quality of a policy by its expected utility, we will consider the worst-case utility as our measure of quality. Concretely, Lπ^ (s) is the minimum utility it is possible to attain over all (potentially infinite) state-action sequences that can result from executing the policy π starting from state s. L∗(s) = maxπ Lπ^ (s) is the optimal worst-case utility. In words, L∗(s) is the greatest lower bound on the utility of state s: the discounted sum of rewards that an agent acting optimally is guaranteed to achieve when starting in state s.

Let C(s, a) be the set of all states that the agent has a non-zero probability of transferring to from state s using action a. Formally, C(s, a) = {s′^ | T (s, a, s′) > 0 }. This notation may be useful to you.

(a) [3 pts] Express L∗(s) in a recursive form similar to the Bellman Optimality Equation.

(b) [2 pts] Recall that the Bellman update for value iteration is:

Vi+1(s) ← max a

s′

T (s, a, s′) [R(s, a, s′) + γVi(s′)]

Formally define a similar update for calculating Li+1(s) using Li.

(c) [3 pts] From this point on, you can assume that R(s, a, s′) = R(s) (rewards are a function of the current state) and that R(s) ≥ 0 for all s. With these assumptions, the Bellman Optimality Equation for Q-functions is

Q∗(s, a) = R(s) +

s′

T (s, a, s′)

[

γ max a′^

Q∗(s′, a′)

]

Let M (s, a) be the greatest lower bound on the utility of state s when taking action a (M is to L as Q is to V ). (In words, if an agent plays optimally after taking action a from state s, this is the utility the agent is guaranteed to achieve.) Formally define M ∗(s, a), in a recursive form similar to how Q∗^ is defined.

(d) [2 pts] Recall that the Q-learning update for maximizing expected utility is:

Q(s, a) ← (1 − α)Q(s, a) + α

R(s) + γ max a′^ Q(s′, a′)

where α is the learning rate, (s, a, s′, R(s)) is the sample that was just experienced (“we were in state s, we took action a, we ended up in state s′, and we received a reward R(s)). Circle the update equation below that results in M (s, a) = M ∗(s, a) when run sufficiently long under a policy that visits all state-action pairs infinitely often. If more than one of the update equations below achieves this, select the one that would converge more quickly. Note that in this problem, we do not know T or C when starting to learn.

(i) C(s, a) ← {s′} ∪ C(s, a) (i.e. add s′^ to C(s, a))

M (s, a) ← (1 − α)M (s, a) + α

R(s) + γ

s′∈C(s,a)

max a′^

M (s′, a′)

(ii) C(s, a) ← {s′} ∪ C(s, a) (i.e. add s′^ to C(s, a))

M (s, a) ← (1 − α)M (s, a) + α

R(s) + γ min s′∈C(s,a)

max a′^

M (s′, a′)

(iii) C(s, a) ← {s′} ∪ C(s, a) (i.e. add s′^ to C(s, a)) M (s, a) ← R(s) + γ min s′∈C(s,a)

max a′^ M (s′, a′)

(iv) M (s, a) ← (1 − α)M (s, a) + α min

M (s, a), R(s) + γ max a′^ M (s′, a′)

(e) [1 pt] Suppose our agent selected actions to maximize L∗(s), and γ = 1. What non-MDP-related technique from this class would that resemble? (a one word answer will suffice)

(f ) [1 pt] Suppose our agent selected actions to maximize L 3 (s) (our estimate of L∗(s) after 3 iterations of our “value-iteration”-like backup in section b) and γ = 1. What non-MDP-related technique from this class would that resemble? (a brief answer will suffice)

(d) [3 pts] Specify an elimination order that is efficient for the query P (X 3 | X 5 = x 5 ) in the Tanb above (including X 3 in your ordering). How many variables are in the biggest factor (there may be more than one; if so, list only one of the largest) induced by variable elimination with your ordering? Which variables are they?

(e) [2 pts] Does it make sense to run Gibbs sampling to do inference in a Tanb? In two or fewer sentences, justify your answer.

(f ) [2 pts] Suppose we are given a dataset of observations of Y and all the variables X 1 ,... , X 6 in the Tanb above. Let C denote the total count of observations, C(Y = y) denotes the number of observations of the event Y = y, C(Y = y, Xi = xi) denotes the count of the times the event Y = y, Xi = xi occurred, and so on. Using the C notation, write the maximum likelihood estimates for all CPTs involving the variable X 4.

(g) [2 pts] In the notation of the question above, write the Laplace smoothed estimates for all CPTs involving the variable X 4 (for amount of smoothing k).

Y

M S

Y

M S

(Nb) (Tanb)

(h) [2 pts] Consider the two graphs on the nodes Y (Pacbaby sees Pacman or not), M (Pacbaby sees a moustache), and S (Pacbaby sees sunglasses) above. Pacbaby observes Y = 1 and Y = −1 (Pacman or not Pacman) 50% of the time. Given Y = 1 (Pacman), Pacbaby observes M = +m (moustache) 50% of the time and S = +s (sunglasses on) 50% of the time. When Pacbaby observes Y = −1, the frequency of observations are identical (i.e. 50% M = ±m and 50% S = ±s). In addition, Pacbaby notices that when Y = +1, anyone with a moustache also wears sunglasses, and anyone without a moustache does not wear sunglasses. If Y = −1, the presence or absence of a moustache has no influence on sunglasses. Based on this information, fill in the CPTs below (you can assume that Pacbaby has the true probabilities of the world).

For Nb (left model) For Tanb (right model)

y P(Y = y) 1 − 1

P(M = m | Y = y) y = 1 y = − 1 m = 1 m = − 1

P(S = s | Y = y) y = 1 y = − 1 s = 1 s = − 1

y P(Y = y) 1 − 1

P(M = m | Y = y) y = 1 y = − 1 m = 1 m = − 1

P(S = s | Y = y, M = m) y = 1 y = − 1 m = 1 m = − 1 m = 1 m = − 1 s = 1 s = − 1

(i) [2 pts] Pacbaby sees a character with a moustache and wearing a pair of sunglasses. What prediction does the Naive Bayes model Nb make? What probability does the Nb model assign its prediction? What prediction does the Tanb model make? What probability does the Tanb-brained Pacbaby assign this prediction? Which (if any) of the predictions assigns the correct posterior probabilities?