Homework 7: Markov Decision Processes in CS 141, Assignments of Computer Science

The instructions and objectives for homework 7 in the cs 141 introduction to ai course, focusing on markov decision processes. Students are expected to understand concepts such as expected minimax, game iteration, interpreting discounting, and gambler's ruin, as well as extend algorithms like value and policy iteration to games of chance. The document also includes problems to be solved using various methods.

Typology: Assignments

Pre 2010

Uploaded on 08/19/2009

koofers-user-i9p
koofers-user-i9p 🇺🇸

5

(1)

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CS 141 Introduction to AI Greenwald
Homework 7: Markov Decision Processes
Due: 5:00 PM, Apr 20, 2009
Contents
1 Expected Minimax 1
2 Game Iteration 2
3 Interpreting Discounting 2
4 Gambler’s Ruin, Revisited 2
5 Model-Based Learning 3
6 Deterministic Q-Learning 3
Objectives
By the end of this homework, you will understand:
1. why deterministic Q-learning converges
2. how discounting relates to dying
By the end of this homework, you will be able to:
1. avert disaster in a casino
Practice
1 Expected Minimax
The goal of this question is to extend the minimax algorithm to games of chance, like backgammon
and monopoly.
(a) Formally define games of chance by extending the definition of game trees.
(b) Extend the minimax algorithm to take as input a game of chance.
pf3

Partial preview of the text

Download Homework 7: Markov Decision Processes in CS 141 and more Assignments Computer Science in PDF only on Docsity!

CS 141 Introduction to AI Greenwald

Homework 7: Markov Decision Processes

Due: 5:00 PM, Apr 20, 2009

Contents

1 Expected Minimax 1

2 Game Iteration 2

3 Interpreting Discounting 2

4 Gambler’s Ruin, Revisited 2

5 Model-Based Learning 3

6 Deterministic Q-Learning 3

Objectives

By the end of this homework, you will understand:

  1. why deterministic Q-learning converges
  2. how discounting relates to dying

By the end of this homework, you will be able to:

  1. avert disaster in a casino

Practice

1 Expected Minimax

The goal of this question is to extend the minimax algorithm to games of chance, like backgammon and monopoly.

(a) Formally define games of chance by extending the definition of game trees.

(b) Extend the minimax algorithm to take as input a game of chance.

CS 141 Homework 7: Markov Decision Processes 5:00 PM, Apr 20, 2009

2 Game Iteration

The goal of this question is to extend value and policy iteration to games of chance, like backgammon and monopoly.

(a) Formally define games of chance by extending the definition of MDPs.

(b) Extend the value iteration algorithm to take as input a game of chance.

(c) Extend the policy iteration algorithm to take as input a game of chance.

Problems

3 Interpreting Discounting

(a) Show that for all Markov reward processes, the value function V is well-defined (i.e., finite- valued) if 0 ≤ γ < 1.

(b) Show that there exists a Markov reward process s.t. the value function V is not necessarily well-defined if γ = 1.

(c) Prove that any Markov reward process M 1 , which for discount factor 0 ≤ γ < 1 yields value function V 1 , can be transformed into another Markov reward process M 2 , s.t. the undiscounted value function V 2 = V 1. [Hint: Introduce a zero-reward, absorbing state end in M 2 s.t. all state-action pairs in M 1 transition to end with probability 1 − γ.]

4 Gambler’s Ruin, Revisited

Consider the following controlled version of Gambler’s Ruin in which the gambler places bets on the outcome of a biased coin flip. Assume the gambler’s worth is between 0 and N (i.e., S = { 0 , 1 ,... , N } ∪ {end}). At each state s, the gambler stakes some amount n in the range A = { 1 ,... , min{s, N − s}}. The coin turns up heads with probability p and tails with probability 1 − p. If the coin turns up heads, the gambler wins the amount he stakes (i.e., he transitions to state s + n); otherwise, the gambler loses the amount he stakes (i.e., he transitions to state s − n). A reward of 1 is received upon transitioning to state N ; all other rewards are 0. The gambler transitions from states 0 and N to the absorbing state end, deterministically. The constraints on the gambler’s range of actions ensure that (i) he stakes at least $1 but no more than his worth; (ii) he stakes no more than N − s, preventing his worth from ever exceeding N.

(a) Draw the corresponding MDP, assuming the gambler’s worth is limited to $5 (i.e., N = 5).

(b) What is the optimal policy if p < 0 .5, N = 100, and γ = 1. Why? (Solve this problem by implementing value iteration or policy iteration.)

(c) What is the optimal policy if p > 0 .5, N = 100, and γ = 1. Why?