












Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The concepts of no-φ-regret learning and approachability theory in the context of vector-valued games. It covers blackwell's approachability theory, the definition of no-φ-regret learning, and the existence of no-φ-regret learning algorithms. The document also touches upon the convexity of the set of φ-equilibria and the relationship between regret and the empirical distribution of play.
Typology: Papers
1 / 20
This page cannot be seen from the preview
Don't miss anything!













Amy Greenwald [email protected] Department of Computer Science Brown University, Box 1910 Providence, RI 02912
Amir Jafari [email protected] Mathematics Department Duke University Durham, NC 27708
Casey Marks [email protected] Department of Computer Science Brown University, Box 1910 Providence, RI 02912
This paper explores a fundamental connection between computational learning theory and game theory through a property we call no-Φ-regret. Given a set of transformations Φ (i.e., mappings from actions to actions), a learning algorithm is said to exhibit no Φ-regret if an agent experiences no regret for playing the actions the algorithm prescribes, rather than playing the transformed actions prescribed by any of the elements of Φ. The existence of no-Φ-regret learning algorithms is established, for all finite Φ. Analogously, a class of game-theoretic equilibria, called ~Φ-equilibria, for ~Φ = (Φi) 1 ≤i≤n, is defined (here n is the number of agents/players). The main contribution of this paper is to show that that the empirical distribution of play of no-Φi-regret algorithms converges to the set of ~Φ-equilibria. The well-known result that the empirical distribution of play of no- internal-regret learning converges to the set of correlated equilibria follows as an immediate corollary of this general theorem. In addition to providing a sufficient condition, a necessary condition for convergence to the set of Φ-equilibria is also derived.~ This work was originally motivated by an attempt to design a no-regret learning scheme that would converge to a tighter solution concept than the set of correlated equilibria. However, it is argued that the strongest form of no-Φ-regret learning is no-internal-regret learning. Hence, the tightest game-theoretic solution concept to which any no-Φ-regret algorithm converges is correlated equilibrium. In particular, Nash equilibrium is not a necessary outcome of learning via any no-Φ-regret algorithms.
Keywords no-regret learning algorithms, convergence to equilibrium
We analyze learning among a group of agents^1 playing an infinitely-repeated matrix game. At each stage, each agent chooses among a set of actions. The outcome, which is jointly determined by all the agents’ choices, assigns a reward to each agent. A learning algorithm is a mapping from a history of past actions, outcomes, and rewards to a current choice of action. Our goal is to characterize the dynamics of multiple agents abiding by “no-regret” learning algorithms. In the no-regret framework, the efficacy of learning is determined by comparing the performance of a learning algorithm to the performance of an alternative set of strategies. At each time t, we compare the action (i.e., pure strategy) at dictated by the learning algorithm with an alternative mixed strategy φ(at). The function φ is called an action transformation. The agent’s regret is the difference between the rewards obtained by playing action at and the rewards it would have expected to obtain had it instead played the transformed action φ(at). Given a set Φ of action transformations, the Φ-regret vector (at time t) is the vector of regrets the agent experiences for not having played according to each φ ∈ Φ. By definition, no-Φ-regret learning algorithms have the property that the time-averaged Φ-regret vector approaches the negative orthant. For example, consider the set of all constant strategies. (A constant strategy always plays action a, for some action a.) Learning algorithms that perform at least as well as this strategy set are said to exhibit no external regret [16]. As another example, consider a strategy that is identical to the strategy dictated by the learning algorithm, except that every play of action a suggested by the learning algorithm is replaced by action a′, for some a and a′. Learning algorithms that perform at least as well as all such strategies are said to exhibit no internal regret [10]. The following results are well-known (see, for example, Hart and Mas-Colell [17, 18]): In two-player, zero-sum, repeated games, if each player plays using a no-external-regret learning algorithm, then each player’s empirical distribution of play converges to his set of minimax strategies.^2 In general- sum, repeated games, if each player plays using a no-internal-regret learning algorithm, then the empirical distribution of play converges to the set of correlated equilibria. In this paper, we define a general class of no-regret learning algorithms, called no-Φ-regret learning algorithms, which spans the spectrum from no external regret to no internal regret, and beyond. The set Φ describes a set of strategies into which the play of a given learning algorithm is transformed. Such a learning algorithm satisfies no-Φ-regret if no regret is experienced for playing as the algorithm prescribes, rather than playing according to any of the transformations of the algorithm’s play prescribed by the elements of Φ. The existence of no-Φ-regret learning algorithms is established here (and elsewhere [3, 13, 21]), for all finite Φ. Analogously, we define a class of game-theoretic equilibria, called Φ-equilibria, for~ ~Φ = (Φi) 1 ≤i≤n. The main contribution of this paper is to show that the empirical distribution of play of no-Φi- regret algorithms converges to the set of Φ-equilibria. We obtain as corollaries of our theorem the~ aforementioned results on convergence of no-external-regret learning (no-internal-regret learning) to the set of minimax equilibria (correlated equilibria) in zero-sum (general-sum) games. Further- more, we establish a necessary condition for convergence to the set of ~Φ-equilibria, namely that the time-averaged Φi-regret experienced by each agent i approaches the negative orthant. This work was originally motivated by an attempt to design a no-regret learning scheme that would converge to a tighter solution concept than the set of correlated equilibria (e.g., the convex hull of the set of Nash equilibria). We imagined that by comparing an agent’s play to a larger
a sequence of opposing actions a′ 1 , a′ 2 ,... ∈ A′, we define a probability space whose universe consists of sequences of the agent’s actions and whose measure can be defined inductively:
P [at = α | aτ = ατ , ∀τ = 1,... , t − 1] = Lt((α 1 , a′ 1 ),... , (αt− 1 , a′ t− 1 ))(α) (1)
for all α ∈ A. In this probability space, we define two sequences of random variables: cumulative rewards Rt =
∑t τ =1 ρ(aτ^ , a ′ τ ) and average rewards ¯ρt^ =^ Rt t. Now, following Blackwell, we define the notion of approachability as follows:
Definition 1 (Approachability) Given an infinitely-repeated vector-valued game Γ∞, a set U ⊆ V , and a learning algorithm L, the set U is said to be approachable by L, if for all ǫ > 0 , there exists t 0 such that for any sequence of opposing actions a′ 1 , a′ 2 ,.. ., P [∃t ≥ t 0 s.t. d (U, ρ¯t) ≥ ǫ] < ǫ.
Hence, if a learning algorithm L approaches a set U ⊆ V , then d(U, ρt) → 0 almost surely. The following theorem [14, 20] gives a sufficient condition for the negative orthant, that is, the set Rd − = {x ∈ Rd^ | xi ≤ 0 , for all 1 ≤ i ≤ d} ⊆ Rd, to be approachable by a learning algorithm L in an infinitely-repeated vector-valued game (A, A′, Rd, ρ)∞^ where d ∈ N and ρ(A × A′) is bounded. For x ∈ Rd, define x+^ by (x+)i = max{xi, 0 }, for all 1 ≤ i ≤ d.
Theorem 2 (Jafari, 2003) Given an infinitely-repeated vector-valued game (A, A′, Rd, ρ)∞^ with d ∈ N and ρ(A × A′) bounded and a learning algorithm L = {Lt}∞ t=1, the negative orthant Rd − ⊆ Rd is approachable by L if there exists a constant c ∈ R such that for all times t ≥ 1 , for all action histories h ∈ Ht− 1 of length t − 1 , and for all opposing actions a′,
(Rt− 1 (h))+^ · ρ(Lt(h), a′) ≤ c (2)
where Rt(h) ≡
∑t τ =1 ρ(aτ^ , a ′ τ )^ and^ ρ(q, a
a∈A q(a)ρ(a, a
Blackwell’s seminal approachability theorem provides a sufficient condition to ensure that, in a vector-valued repeated game, a learner’s average rewards approach any closed set U ⊆ Rn^ [2, 18]. To prove existence of no-Φ-regret algorithms, we rely on Theorem 2, a close cousin of Blackwell’s theorem. On the one hand, our theorem specializes Blackwell’s theorem: it provides a sufficient condition for the negative orthant Rn − ⊆ Rn^ to be approachable, rather than an arbitrary closed subset of Euclidean space. On the other hand, our sufficient condition (Equation 2) is weaker than Blackwell’s original condition: our condition need only hold for some c ∈ R, rather than precisely for c = 0. Moreover, in our framework, the opponents (i.e., not the learner) have at their disposal an arbitrary, rather than merely a finite, set of actions.
2.2 Action Transformations
An action transformation is a function φ : A → ∆(A). Let ΦALL(A) denote the set of all action transformations over the set A. Following Blum and Mansour [3], we let ΦSWAP(A) ⊆ ΦALL(A) denote the set of all action transformations that map actions to distributions with all their weight on a single action (i.e., pure strategies). There are two well-studied subsets of ΦSWAP(A), namely, external and internal action transfor- mations. Let δa ∈ ∆(A) denote the distribution with all its weight on action a. An external action transformation is simply a constant transformation, so for a ∈ A,
φ( EXTa) : x 7 → δa, for all x ∈ A (3)
An internal action transformation behaves like the identity, except on one particular input, so for a, b ∈ A
φ( INTa,b) : x 7 →
δb if x = a δx otherwise
The set of external and internal action transformations are denoted by ΦEXT(A) and ΦINT(A), respectively. Observe that |ΦINT(A)| = |A|^2 − |A| + 1 and |ΦEXT(A)| = |A|. We can represent an action transformation by a stochastic matrix. Given φ ∈ ΦALL(A) and an enumeration of A, we define its matrix representation [φ] as:
[φ]ij = φ(ai)(aj ) (5)
where ak is the kth action in the enumeration. For example, for A = { 1 , 2 , 3 , 4 }, the action transformations φ(2) EXT and φ(23) INT can be represented as:
[φ(2) EXT] =
[φ
(23) INT ] =
2.3 No Φ-Regret Learning
A real-valued game (A, A′, R, r) is a vector-valued game with R ⊆ R. Given a real-valued game Γ = (A, A′, R, r) and a set of action transformations Φ ⊆ ΦALL(A), we define the (vector-valued) Φ-regret game as ΓΦ^ = (A, A′, RΦ, ρΦ), with the vector-valued function ρΦ^ : A × A′^ → RΦ^ given by:^3 ρΦ(a, a′) ≡
ρφ(a, a′)
φ∈Φ
where ρφ(a, a′) = r(φ(a), a′) − r(a, a′) (7)
Here, r(q, a′) =
a∈A q(a)r(a, a ′), for all q ∈ ∆(A). In words, the φth entry in the regret vector
ρΦ(a, a′) describes the difference between the rewards the agent obtains by playing action a and the rewards the agent would have expected to obtain by playing the mixed strategy φ(a) instead, given opposing action a′. We now define no-Φ-regret learning in Blackwell’s approachability framework:
Definition 3 (No-Φ-Regret Learning) Given a real-valued game Γ = (A, A′, R, r) and a finite set of action transformations Φ ⊆ ΦALL(A), a no-Φ-regret learning algorithm L is one that ap- proaches the negative orthant RΦ − ⊆ Rs^ in the infinitely-repeated Φ-regret game
: i.e., for all ǫ > 0 , there exists t 0 such that for any sequence of opposing actions a′ 1 , a′ 2 ,.. .,
P
∃t ≥ t 0 s.t. d
RΦ −, ρ¯Φ t
≥ ǫ
< ǫ (8)
In words, if an agent plays an infinitely-repeated game Γ∞^ as prescribed by a no-Φ-regret learning algorithm, then the time-averaged Φ-regret experienced by the agent converges to the negative or- thant with probability 1, regardless of the sequence of opposing actions. Moreover, this convergence is uniform.
Algorithm 1 No-Regret Learning Algorithm ((A, A′, R, r), Φ ⊆ ΦALL(A)) 1: initialize x 0 = 0 2: for t = 1, 2 ,... , do 3: sample pure action a ∼ qt 4: choose opposing actions a′ t ∈ A′ 5: observe reward vector rt = r(·, a′ t) ∈ RA 6: for all φ ∈ Φ do 7: compute instantaneous regret ytφ = rt · ea[φ] − rt · ea 8: update cumulative regret vector xφt = xφt− 1 + ytφ 9: end for 10: if
xΦ t
= 0 then 11: set qt+1 ∈ ∆(A) arbitrarily 12: else 13: let Mt =
φ∈Φ
xφt
[φ]/
φ∈Φ
xφt
14: solve for a fixed point qt+1 = qt+1Mt 15: end if 16: end for
Algorithm 1 lists the steps in the no-Φ-regret learning algorithm derived in the proof of the existence theorem. At time t, the agent plays the mixed strategy qt by sampling a pure action a according to the distribution qt, after which it observes an |A|-dimensional reward vector rt, where (rt)a = r(a, a′ t), assuming a′ t is the opponents’ pure action vector at time t. Given this reward vector, the agent computes its instantaneous regret in all dimensions φ ∈ Φ: specifically, ρφ(at, a′ t) = r(φ(at), a′ t) − r(at, a′ t), which, since r(q, a′) is an expectation, we compute via dot products in Step 7. The cumulative regret vector is then updated accordingly, after which its positive part is extracted. If this quantity is zero, then the algorithm outputs an arbitrary mixed strategy. Otherwise, a fixed point of the stochastic matrix M derived in Equation 15 is returned.
Complexity Each iteration of Algorithm 1 has time complexity O(max{|Φ||A|^2 , |A|^3 }). Updating the cumulative regret vector in steps 6–9 takes time O(|Φ||A|), since computing instantaneous regret for each φ ∈ Φ (step 7) is an O(|A|) operation. Computing the stochastic matrix M in step 13 takes time O(|Φ||A|^2 ), since each matrix [φ] has dimensions |A| × |A|. Finding the fixed point of an n × n stochastic matrix (step 14), which can be accomplished, for example, via Gaussian elimination, takes O(n^3 ) time. If, however, Φ ⊆ ΦSWAP(A), then the time complexity reduces to O(max{|Φ||A|, |A|^3 }), since in this case, (i) computing instantaneous regret for each φ ∈ Φ (step 7) takes constant time so that updating the cumulative regret vector takes time O(|Φ|); and (ii) computing the stochastic matrix M in step 13 is only an O(|Φ||A|) operation, since there are only |A| nonzero entries in each φ ∈ Φ. In particular, if Φ = ΦINT(A), then the time complexity reduces to O(|A|^3 ), because |ΦINT(A)| = O(|A|^2 ). Moreover, if Φ = ΦEXT(A), then the time complexity reduces even further to O(|A|), because matrix manipulation is not required in the special case of no-external-regret learning. The rows of M are constant: each is a copy of the (normalized) cumulative regret vector, which is precisely the fixed point of M. The space complexity of Algorithm 1 is O(|Φ||A|^2 ) = O(max{|Φ||A|^2 , |A|^2 }) because it is nec- essary to store |Φ| matrices, each with dimensions |A| × |A|, and computing the fixed point of an |A| × |A| stochastic matrix (via Gaussian elimination) requires O(|A|^2 ) space. If, however,
Φ ⊆ ΦSWAP(A), then the space complexity reduces to O(max{|Φ||A|, |A|^2 }), since, in this case, there are only |A| nonzero entries in each φ ∈ Φ. In particular, if Φ = ΦINT(A) then the space com- plexity reduces to O(|A|^2 ), since it suffices to store cumulative regrets in a matrix of size |A| × |A|. Similarly, if Φ = ΦEXT(A), then the space complexity reduces to O(|A|), since it suffices to store cumulative regrets in a vector of size |A|. Our discussion of the time and space complexity of Algorithm 1 is summarized in Table 1.
Time Space Φ ⊆ ΦALL O(max{|Φ||A|^2 , |A|^3 }) O(|Φ||A|^2 ) Φ ⊆ ΦSWAP O(max{|Φ||A|, |A|^3 }) O(max{|Φ||A|, |A|^2 }) Φ = ΦINT O(|A|^3 ) O(|A|^2 ) Φ = ΦEXT O(|A|) O(|A|)
Table 1: Complexity of No-Φ-Regret Learning
In this section, we define the notion of ~Φ-equilibria, of which correlated, Nash, and minimax equilibria are all special cases. We show that the set of Φ-equilibria is convex, for all~ ~Φ. In a (real-valued) n-player game Γn = 〈(Ai, ri) 1 ≤i≤n〉, each player i chooses an action from the finite set Ai, and the rewards earned by player i are determined by the function ri : A 1 ×... ×An → R. We abbreviate action profile (a 1 ,... , an) by (ai, a−i) ∈ Ai ×
j 6 =i Aj^. Given an n-player game Γn, an action transformation φ ∈ ΦALL(Ai) can be extended to a linear map φ˜ : ∆(A 1 ×... × An) → ∆(A 1 ×... × An) as follows:
φ˜(q)(ai, a−i) ≡
bi∈Ai
q(bi, a−i)φ(bi)(ai) (16)
It is easily verified that φ˜ is indeed a probability distribution.
Definition 5 (~Φ-Equilibrium) Given an n-player game Γn = 〈(Ai, ri) 1 ≤i≤n〉 and a vector ~Φ = (Φi) 1 ≤i≤n such that Φi ⊆ ΦALL(Ai) for 1 ≤ i ≤ n, an element q ∈ ∆(A 1 ×... × An) is called a ~Φ-equilibrium if ri(q) ≥ ri( φ˜(q)), for all players i and for all φ ∈ Φi.
If for all players i, each Φi is of the same type, e.g., Φi = ΦEXT(Ai), then we refer to the ~Φ-equilibrium accordingly, e.g., ΦEXT-equilibrium.
4.1 Examples of Φ~-Equilibria
Correlated, Nash, and minimax equilibria are all special cases of ~Φ-equilibria.
Correlated Equilibrium Given an n-player game Γn = 〈(Ai, ri) 1 ≤i≤n〉 let Φi = ΦINT(Ai) for all
players i. For q ∈ ∆(A 1 ×... × An), the expression ri( φ˜( INTαβ )(q)) simplifies as follows: for α, β ∈ Ai,
Here, q−i ∈ ∆
j 6 =i Aj
. Thus, q ∈ ∆(A 1 ×... × An) is a ΦEXT-equilibrium if and only if
ri(q) ≥ ri(α, q−i) for all players i and for all α ∈ Ai, which is the definition of coarse correlated equilibrium (also called weak correlated equilibrium) [23].
Nash Equilibrium Given an n-player game Γn = 〈(Ai, ri) 1 ≤i≤n〉, a Nash equilibrium [24] is an independent element q ∈ ∆(A 1 ×... × An) such that r(q) ≥ r(q 1 ,... , ai,... , qn), for all players i and for all actions ai ∈ Ai. An element q ∈ ∆(A 1 ×... × An) is called independent if it can be written as the product of n independent elements qi ∈ ∆(Ai): i.e., q = q 1 ×... × qn. Thus, by definition, a Nash equilibrium is an independent coarse correlated equilibrium. However, a Nash equilibrium is also an independent correlated equilibrium. Therefore, the set of independent coarse correlated equilibria and independent correlated equilibria coincide. In general, however, a coarse correlated equilibrium need not be a correlated equilibrium. This observation is intuitive for general-sum games, but perhaps less so for zero-sum games. In the following zero-sum game, with row as maximizer and column as minimizer, the joint distribution with half its weight on (T,L) and the other half on (B,M) is a coarse correlated equilibrium, but not a correlated equilibrium. It is a coarse correlated equilibrium because row has no incentive to deviate from its marginal distribution (half its weight on T and half on B), and column has no incentive to deviate from its marginal distribution (half its weight on L and half on M). If column were to deviate to R, it would expect to lose 12 instead of 0. It is not, however, a correlated equilibrium: if column is advised to play L, then row is playing T, in which case column actually prefers to play R, where it would win 1 instead of 0.
L M R T 0 0 − 1 B 0 0 2
Figure 1: Sample Zero-Sum Game.
Zero-Sum Games In the case of two-player, zero-sum games, we obtain the following results for coarse correlated equilibria (and consequently correlated equilibria):
Proposition 6 Given a two-player, zero-sum game Γ with reward function r and value v. If q is a coarse correlated equilibrium, then (i) r(q) = v and (ii) each player’s marginal distribution is an optimal strategy (i.e., optimal for the maximizing player means: guarantees he wins at least v; optimal for the minimizing player means: guarantees he loses at most v).
Proof Let q 1 and q 2 denote the marginal distributions of the maximizer and the minimizer in q, respectively. First, r(q) ≥ maxα∈A 1 r(α, q 2 ) ≥ v since q is a coarse correlated equilibrium and v is the value of the game. Symmetrically, r(q) ≤ maxβ∈A 2 r(q 1 , β) ≤ v. Hence, r(q) = v. Second, applying the definition of coarse correlated equilibrium again together with the above result, v = r(q) ≥ maxα∈A 1 r(α, q 2 ), so by playing q 2 , player 2 loses at most v. Symmetrically, v = r(q) ≤ maxβ∈A 2 r(q 1 , β), so by playing q 1 , player 1 wins at least v.
Note that the sets of coarse correlated equilibria and minimax equilibria need not coincide, since the former allows for correlations while the latter does not. For this reason, we refer to coarse correlated equilibria in two-player, zero-sum games as generalized minimax equilibria.
4.2 Properties of ~Φ-Equilibrium
Next we discuss two convexity properties of the set of Φ-equilibria.~
Proposition 7 Given an n-player game Γn = 〈(Ai, ri) 1 ≤i≤n〉 and a vector ~Φ = (Φi) 1 ≤i≤n such that Φi ⊆ ΦALL(Ai) for 1 ≤ i ≤ n, the set of ~Φ-equilibria is convex.
Proof If q, q′^ ∈ ∆(A 1 ×... × An) are both ~Φ-equilibria, then ri(q) ≥ ri( φ˜(q)) and ri(q′) ≥ ri( φ˜(q′)), for all players i and for all φ ∈ Φi. Because ri and φ˜ are linear, it follows that
ri(αq + (1 − α)q′) = αri(q) + (1 − α)ri(q′) ≥ αri( φ˜(q)) + (1 − α)ri( φ˜(q′)) = ri(α φ˜(q) + (1 − α) φ˜(q′)) = ri( φ˜(αq + (1 − α)q′))
for all α ∈ [0, 1] and for all players i. Thus, αq + (1 − α)q′^ is a Φ-equilibrium.~
Given a set of actions A, let I denote the identity map: i.e., I(a) = δa for all a ∈ A.
Definition 8 Given a set of actions A and a set of action transformations Φ ⊆ ΦALL(A), we define the super convex hull of Φ, denoted sch(Φ), as follows:
sch(Φ) =
∑^ k
j=
αj φj
(^) + βI
k ∈ N, φj ∈ Φ, αj ≥ 0 , β ∈ R, and
∑^ k
j=
αj + β = 1
Proposition 9 Given an n-player game Γn = 〈(Ai, ri) 1 ≤i≤n〉 and a vector ~Φ = (Φi) 1 ≤i≤n such that Φi ⊆ ΦALL(Ai) for 1 ≤ i ≤ n, if q is a Φ~-equilibrium, then q is also a Φ~′-equilibrium, where Φ^ ~′^ = (sch(Φi)) 1 ≤i≤n.
Proof Let i be an arbitrary player and let φ∗^ be an arbitrary element of sch(Φi). Since q is a ~Φ-equilibrium, ri(q) ≥ ri( φ˜i(q)), for all players i and for all φ ∈ Φi. Choose k ∈ N, φj ∈ Φ and
αj ≥ 0 for all 1 ≤ j ≤ k, and β ∈ R such that φ∗^ =
k j=1 αj^ φj
∑k j=1 αj^ +^ β^ = 1. Because ri and φ˜j are linear, it follows that
ri(q) =
∑^ k
j=
αj ri(q) + βri(q)
∑^ k
j=
αj ri( φ˜j (q)) + βri(q)
= ri
∑^ k
j=
αj φ˜j (q) + βI(q)
= ri
φ˜∗(q)
Proof By Corollary 10, it suffices to show that, as t → ∞, the average Φi-regret through time t experienced by each player i converges to the negative orthant if and only if for all players i and for all φi ∈ Φi, ri( φ˜i(zt)) − ri(zt) → R− as t → ∞. First, for arbitrary player i,
ri(zt) =
t
∑^ t
τ =
ri(ai,τ , a−i,τ ) (35)
Second, for arbitrary player i and for arbitrary φi ∈ Φi,
ri( φ˜i(zt)) =
ai,a−i
φ^ ˜i(zt)(ai, a−i)ri(ai, a−i) (36)
ai,a−i
bi∈Ai
zt(bi, a−i)φi(bi)(ai)ri(ai, a−i) (37)
bi,a−i
zt(bi, a−i)ri(φi(bi), a−i) (38)
t
∑^ t
τ =
ri(φi(ai,τ ), a−i,τ ) (39)
In Equation 36, we expand the definition of expected rewards, whereas in Equation 38 we collapse this definition. Equation 37 relies on the definition of φ˜i, the extension of φi. Equation 39 follows from the definition of the empirical distribution zt in Equation 34. Therefore,
ri( φ˜i(zt)) − ri(zt) =
t
∑^ t
τ =
(ri(φi(ai,τ ), a−i,τ ) − ri(ai,τ , a−i,τ )) (40)
t
∑^ t
τ =
ρφi^ (ai,τ , a−i,τ ) (41)
From this equivalence, the conclusion follows immediately.
By Theorem 11, if the time-averaged Φi-regret experienced by each player i converges to the negative orthant with probability 1, then empirical distribution of play converges to the set of ~Φ-equilibria with probability 1. But if each player i plays according to a no-Φi-regret learning
algorithm, then the time-averaged Φi-regret experienced by each player i converges to the negative orthant with probability 1, regardless of the sequence of opposing actions: i.e., on any run of the game. From this discussion, we draw the following general conclusion:
Theorem 12 Given an n-player game Γn and a vector of sets of action transformations Φ =~ (Φi) 1 ≤i≤n such that Φi ⊆ ΦALL(Ai) is finite for 1 ≤ i ≤ n. If all players i play no-Φi-regret learning algorithms, then the empirical distribution of play converges to the set of Φ~-equilibria of Γn with probability 1.
Thus, we see that if all players abide by no-internal-regret algorithms, then the distribution of play converges to the set of correlated equilibria. Morever, in two-player, zero-sum games if all players abide by no-external-regret algorithms, then the distribution of play converges to the set of generalized minimax equilibria, that is, the set of minimax-valued joint distributions. Again, by Proposition 6, this latter result implies that each player’s empirical distribution of play converges to his set of minimax strategies, under the stated assumptions.
Perhaps surprisingly, no internal regret is the strongest form of no Φ-regret, for finite Φ. It follows from the results in Section 5 that the tightest game-theoretic solution concept to which no-Φ-regret learning algorithms converge is correlated equilibrium. In particular, Nash equilibrium is not a necessary outcome of learning via no-Φ-regret algorithms.
Theorem 13 Given a real-valued game (A, A′, R, r), if a learning algorithm L satisfies no internal regret, then L also satisfies no Φ-regret for all finite sets Φ ⊆ ΦALL(A).
The proof of this theorem follows immediately from the following lemmas.
Lemma 14 Given a real-valued game (A, A′, R, r), for all Φ ⊆ ΦALL(A) and Φ′^ ⊆ sch(Φ), there
exists a constant c > 0 such that d
′ − ,^ ρ¯Φ
′ t
≤ c d
RΦ −, ρ¯Φ t
, for all t.
Proof For each φ′^ ∈ Φ′^ ⊆ sch(Φ) there exist k ∈ N, φj ∈ Φ and αj ≥ 0 for all 1 ≤ j ≤ k, and
β ∈ R such that φ′^ =
k j=1 αj^ φj
∑k j=1 αj^ +^ β^ = 1. Now
ρφ
′ (a, a′) = r(φ′(a), a′) − r(a, a′) (42)
= r
∑^ k
j=
αj φj + βI
(^) (a), a′
(^) − r(a, a′) (43)
∑^ k
j=
αj r(φj (a), a′) + (β − 1)r(a, a′) (44)
∑^ k
j=
αj (r(φj (a), a′) − r(a, a′)) (45)
∑^ k
j=
αj ρφj^ (a, a′) (46)
Line 45 follows because
∑k j=1 αj^ = 1^ −^ β. Thus, we can define a linear transformation F : RΦ^ → RΦ ′ such that F (ρΦ(a, a′)) = ρΦ ′ (a, a′), for all a, a′. Because the αi are all non-negative, F exhibits the following property:
ρΦ(a, a′) ∈ RΦ − ⇒ F (ρΦ(a, a′)) = ρΦ ′ (a, a′) ∈ RΦ ′ − (47)
i.e., F (RΦ −) ⊆ RΦ −′. Further, because F is linear,
d
′ − ,^ ρ¯ Φ′ t
= d
′ − , F^
ρ¯Φ t
≤ d
ρ¯Φ t
≤ c d
RΦ −, ρ¯Φ t
where c > 0 is the operator norm of F.
Lemma 15 Given a real-valued game (A, A′, R, r), if a learning algorithm L satisfies no Φ-regret for some finite set Φ ⊆ ΦALL(A), then L also satisfies no Φ′-regret, for all finite sets Φ′^ ⊆ sch(Φ).
In this section, we relate our theorems on the existence of no-Φ-regret algorithms (Theorem 4), the convergence of no-Φ-regret learning to game-theoretic equilibria (Theorem 12), and the power of no-internal-regret (Theorem 13) to results published elsewhere.
7.1 On the Existence of No-Regret Algorithms
In Theorem 4, we rely on Theorem 2 to establish the existence of no-Φ-regret learning algorithms, for all finite Φ ⊆ ΦALL. The algorithms presented are, in the terminology of Greenwald et al. [15], action-regret-based learning algorithms, which means that regret at time t is computed with respect to the action at as in Equations 6 and 7, as opposed distribution-regret-based learning algorithms, in which regret at time t is calculated with respect to the distribution qt as follows:
ρΦ(q, a′) ≡
ρφ(q, a′)
φ∈Φ
where ρφ(q, a′) =
a∈A
q(a)[r(φ(a), a′) − r(a, a′)] (65)
Many well-known no-regret learning algorithms arise as instances, or close cousins, of action- or distribution-regret-based variants of Algorithm 1:
φt , we arrive at Freund and Schapire’s Hedge algorithm [11] when Φ = ΦEXT(A), and an instance of an algorithm discussed by Cesa-Bianchi and Lugosi [5] when Φ = ΦINT(A).
Lehrer [21] derives a “wide range no-regret theorem” analogous to our existence result. Lehrer’s approach combines “replacing schemes,” functions from H × A to A, with “activeness functions” from H × A to { 0 , 1 }. Given a replacing scheme g and an activeness function I, Lehrer’s framework compares the agent’s rewards to the rewards that could have been obtained by playing action g(ht, at), but only if I(ht, at) = 1, yielding a general form of action regret. Lehrer establishes the existence of action-regret-based no-regret algorithms whose regret with respect to any countable set of pairs of replacing schemes and activeness functions, averaged over the number of times each pair is “active,” approaches the negative orthant.
“ R t(ij)
”+ and (Qt)ij =
“ R( tij)
”+ for i 6 = j. Our algorithm plays the fixed point of a matrix A which can be written as A = I + P i |(Q^1 t )ii | Qt. Foster and Vohra’s algorithm plays the fixed point of a matrix A′^ which can be written as A′^ = I + (^) maxi|^1 (Qt )ii| Qt. It can be shown that A and A′^ have the same set of fixed points: If q is a fixed point of A, then qA = q ⇒ q + P i |(Q^1 t )ii | qQ = q ⇒ qQ = 0 ⇒ qA′^ = qI = q. And similarly in the other direction.
Because Lehrer deals with countable sets of replacing schemes, whereas we restrict attention to finite sets of transformations, it may seem that Lehrer’s theorem immediately subsumes ours. However, in Lehrer’s framework, the co-domain of each transformation is A, not ∆(A), as it is in ours. In other words, in our framework, actions are transformed into mixed, rather than pure, strategies. This turns turns out to yield no additional power however. By Theorem 13, it suffices to consider the set of internal action transformations, each element of which can be expressed simply as a function from A → A. Hence, in light of Theorem 13, Lehrer’s theorem can in fact be viewed as subsuming our existence theorem. Blum and Mansour’s [3] framework is similar to Lehrer’s, but yields results about distribution regret. Their “modification rules” are the same as replacing schemes, but instead of activeness functions, they pair modification rules with “time selection functions,” which are functions from N to the interval [0, 1]. The rewards an agent could have obtained under each modification rule are weighted according to how “awake” the rule is, as indicated by the corresponding time selection function. They present a method that, given a collection of algorithms whose external distribution regret is bounded above by f (t) at time t, generates an algorithm whose swap distribution regret (and hence, internal distribution regret) is bounded above by |A|f (t). Cesa-Bianchi and Lugosi [5] develop a framework of “generalized” distribution regret. They rely on a notion of “experts,” which they define as functions from N to A, and following Lehrer, they pair experts f 1 ,... , fN with activation functions Ii : A×N → { 0 , 1 }. At time t, for each i, if Ii(at, t) = 1, they compare the agent’s rewards to the rewards the agent could have obtained by playing fi(t). This approach is more general than our action-transformation framework in that alternatives may depend on time. At the same time, it is more limited in that it does not naturally represent swap regret (but this is not necessarily a shortcoming, in light of Theorem 13). Cesa-Bianchi and Lugosi’s calculations yield bounds on generalized distribution regret. From the bounds derived by Blum and Mansour and Cesa-Bianchi and Lugosi, one can infer the existence of distribution-regret-based no-regret learning algorithms, by applying the Hoeffding- Azuma lemma (see, for example, the Appendix of Cesa-Bianchi and Lugosi [6]). Finally, it has been observed that swap regret is bounded above by |A| times internal regret. Hence, no-internal-regret implies no-swap-regret,^5 which in turn implies no-Φ-regret, for all Φ ⊆ ΦALL, because the elements of any Φ can be constructed as a convex combination of the elements of ΦSWAP. It follows from this observation that every no-internal-regret algorithm—such as the algorithms of Foster and Vohra [10], Hart and Mas-Colell [18], Cesa-Bianchi and Lugosi [5], and Young [27]—is a no-Φ-regret algorithm, for all Φ ⊆ ΦALL. In other words, the existence of (action- or distribution-based) no-internal-regret algorithms implies the existence of (action- or distribution- based) no-Φ-regret algorithms, for all Φ ⊆ ΦALL.
7.2 On the Connection between Learning and Games
Several authors before us have explored the connection between no-regret (and related) learning algorithms and game-theoretic equilibria. This literature originates with Foster and Vohra [9], who present a (calibrated) learning algorithm such that if all players play according to it, the empirical distribution of play converges to the set of correlated equilibria. Hart and Mas-Colell [17] exhibit a simple adaptive procedure such that if all players follow this procedure, then the time-averaged internal regret vector of each player converges to zero almost surely, and the empirical distribution of play converges to the set of correlated equilibrium almost surely. Their algorithm is not no-internal-regret, however; it is not regret-minimizing against an
[5] N. Cesa-Bianchi and G. Lugosi. Potential-based algorithms in on-line prediction and game theory. Machine Learning, 51(3):239–261, 2003.
[6] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
[7] A. Cournot. Recherches sur les Principes Mathematics de la Theorie de la Richesse. Hachette,
[8] D. Foster and R. Vohra. A randomization rule for selecting forecasts. Operations Research, 41(4):704–709, 1993.
[9] D. Foster and R. Vohra. Calibrated learning and correlated equilibrium. Games and Economic Behavior, 21:40–55, 1997.
[10] D. Foster and R. Vohra. Regret in the on-line decision problem. Games and Economic Behav- ior, 29:7–35, 1999.
[11] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Computational Learning Theory: Proceedings of the Second European Conference, pages 23–37. Springer-Verlag, 1995.
[12] D. Fudenberg and D. K. Levine. Universal consistency and cautious fictitious play. Journal of Economic Dyanmics and Control, 19:1065–1090, 1995.
[13] A. Greenwald and A. Jafari. A general class of no-regret algorithms and game-theoretic equi- libria. In Proceedings of the 2003 Computational Learning Theory Conference, pages 1–11, August 2003.
[14] A. Greenwald, A. Jafari, and C. Marks. Blackwell’s approachability theorem: A generalization in a special case. Technical Report CS-06-01, Brown University, Department of Computer Science, January 2006.
[15] A. Greenwald, Z. Li, and C. Marks. Bounds for regret-matching algorithms. Technical Report CS-06-10, Brown University, Department of Computer Science, June 2006.
[16] J. Hannan. Approximation to Bayes risk in repeated plays. In M. Dresher, A.W. Tucker, and P. Wolfe, editors, Contributions to the Theory of Games, volume 3, pages 97–139. Princeton University Press, 1957.
[17] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127–1150, 2000.
[18] S. Hart and A. Mas-Colell. A general class of adaptive strategies. Journal of Economic Theory, 98(1):26–54, 2001.
[19] M. Herbster and M. K. Warmuth. Tracking the best expert. Machine Learning, 32(2):151–178,
[20] A. Jafari. On the Notion of Regret in Infinitely Repeated Games. Master’s Thesis, Brown University, Providence, May 2003.
[21] E. Lehrer. A wide range no-regret theorem. Games and Economic Behavior, 42(1):101–115,
[22] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information and Computation, 108:212 – 261, 1994.
[23] H. Moulin and J. P. Vial. Strategically zero-sum games. International Journal of Game Theory, 7:201–221, 1978.
[24] J. Nash. Non-cooperative games. Annals of Mathematics, 54:286–295, 1951.
[25] J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54:298–301,
[26] G. Stoltz and G. Lugosi. Learning correlated equilibria in games with compact sets of strategies. Games and Economic Behavior, To Appear.
[27] P. Young. Strategic Learning and its Limits. Oxford University Press, Oxford, 2004.
Lemma 17 Let (X, dX ) be a compact metric space and let (Y, dY ) be a metric space. Let {xt} be an X-valued sequence, and let S be a nonempty, closed subset of Y. If f : X → Y is continuous and if f −^1 (S) is nonempty, then dX (xt, f −^1 (S)) → 0 as t → ∞ if and only if dY (f (xt), S) → 0 as t → ∞.
Proof We write d = dX and d = dY , since the appropriate choice of distance metric is always clear from the context. To prove the forward implication, assume d(xt, f −^1 (S))) → 0 as t → ∞. Choose t 0 s.t. for all t ≥ t 0 , d(xt, f −^1 (S)) < δ 2. Observe that for all xt and for all γ > 0, there
exists q( t γ)∈ f −^1 (S) s.t. d(xt, q t( γ)) < d(xt, f −^1 (S)) + γ. Now, since d(xt, q ( δ 2 ) t )^ <^ δ 2 +^
δ 2 =^ δ, by the continuity of f , d(f (xt), f (q ( δ 2 ) t ))^ < ǫ, for all^ ǫ >^ 0. Therefore,^ d(f^ (xt), S)^ < ǫ, since^ f^ (q
( δ 2 ) t )^ ∈^ S. To prove the reverse implication, assume d(f (xt), S) → 0 as t → ∞. We must show that for all ǫ > 0, there exists a t 0 s.t. for all t ≥ t 0 , d(xt, f −^1 (S)) < ǫ. Define T = {x ∈ X | d(x, f −^1 (S)) ≥ ǫ}. If T = ∅, the claim holds. Otherwise, observe that T can be expressed as the complement of the union of open balls, so that T is closed and thus compact. Define g : X → R as g(x) = d(f (x), S). By assumption S is closed; hence, g(x) > 0, for all x. Because T is compact, g achieves some minimum value, say L > 0, on T. Choose t 0 s.t. d(f (xt), S) < L for all t ≥ t 0. Thus, for all t ≥ t 0 , g(xt) < L ⇒ xt ∈/ T ⇒ d(xt, f −^1 (S)) < ǫ.