Multiarmed Bandit Problem: Regret Analysis and Algorithms | Papers Computer Graphics

THE NONSTOCHASTIC MULTIARMED BANDIT PROBLEM∗

PETER AUER†, NICOL `

O CESA-BIANCHI‡, YOAV FREUND§,AND

ROBERT E. SCHAPIRE¶

SIAM J. COMPUT.c

2002 Society for Industrial and Applied Mathematics

Vol. 32, No. 1, pp. 48–77

Abstract. In the multiarmed bandit problem, a gambler must decide which arm of Knon-

identical slot machines to play in a sequence of trials so as to maximize his reward. This classical

problem has received much attention because of the simple model it provides of the trade-off between

exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to

give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions

about the statistics of the slot machines.

In this work, we make no statistical assumptions whatsoever about the nature of the process

generating the payoffs of the slot machines. We give a solution to the bandit problem in which an

adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In

a sequence of Tplays, we prove that the per-round payoff of our algorithm approaches that of the

best arm at the rate O(T−1/2). We show by a matching lower bound that this is the best possible.

We also prove that our algorithm approaches the per-round payoff of any set of strategies at a

similar rate: if the best strategy is chosen from a pool of Nstrategies, then our algorithm approaches

the per-round payoff of the strategy at the rate O((log N)1/2T−1/2). Finally, we apply our results to

the problem of playing an unknown repeated matrix game. We show that our algorithm approaches

the minimax payoff of the unknown game at the rate O(T−1/2).

Key words. adversarial bandit problem, unknown matrix games

AMS subject classifications. 68Q32, 68T05, 91A20

PII. S0097539701398375

1. Introduction. In the multiarmed bandit problem, originally proposed by

Robbins [17], a gambler must choose which of Kslot machines to play. At each time

step, he pulls the arm of one of the machines and receives a reward or payoff (possibly

zero or negative). The gambler’s purpose is to maximize his return, i.e., the sum of

the rewards he receives over a sequence of pulls. In this model, each arm is assumed to

deliver rewards that are independently drawn from a fixed and unknown distribution.

As reward distributions differ from arm to arm, the goal is to find the arm with the

highest expected payoff as early as possible and then to keep gambling using that best

arm.

The problem is a paradigmatic example of the trade-off between exploration and

exploitation. On the one hand, if the gambler plays exclusively on the machine that

he thinks is best (“exploitation”), he may fail to discover that one of the other arms

actually has a higher expected payoff. On the other hand, if he spends too much time

∗Received by the editors November 18, 2001; accepted for publication (in revised form) July 7,

2002; published electronically November 19, 2002. An early extended abstract of this paper appeared

in Proceedings of the 36th Annual Symposium on Foundations of Computer Science, 1995, IEEE

Computer Society, pp. 322–331.

http://www.siam.org/journals/sicomp/32-1/39837.html

†Institute for Theoretical Computer Science, Graz University of Technology, A-8010 Graz, Austria

([email protected]). This author gratefully acknowledges the support of ESPRIT Working

Group EP 27150, Neural and Computational Learning II (NeuroCOLT II).

‡Department of Information Technology, University of Milan, I-26013 Crema, Italy (cesa-bianchi@

dti.unimi.it). This author gratefully acknowledges the support of ESPRIT Working Group EP 27150,

Neural and Computational Learning II (NeuroCOLT II).

§Banter Inc. and Hebrew University, Jerusalem, Israel ([email protected]).

¶AT&T Labs – Research, Shannon Laboratory, Florham Park, NJ 07932-0971 (schapire@research.

att.com).

Multiarmed Bandit Problem: Regret Analysis and Algorithms, Papers of Computer Graphics

Related documents

Partial preview of the text

Download Multiarmed Bandit Problem: Regret Analysis and Algorithms and more Papers Computer Graphics in PDF only on Docsity!

THE NONSTOCHASTIC MULTIARMED BANDIT PROBLEM∗

THE NONSTOCHASTIC MULTIARMED BANDIT PROBLEM 49

THE NONSTOCHASTIC MULTIARMED BANDIT PROBLEM 51

GA(T )

def

∑^ T

∑^ T

52 AUER, CESA-BIANCHI, FREUND, AND SCHAPIRE

∑T

54 AUER, CESA-BIANCHI, FREUND, AND SCHAPIRE

∑^ K

∑^ K

∑^ K

∑^ K

∑^ K

∑^ K

[

) 2 ]

∑^ K

∑^ K

∑^ K

∑^ K

WT +

W 1

∑^ T

∑^ K

THE NONSTOCHASTIC MULTIARMED BANDIT PROBLEM 55

WT +

W 1

∑^ T

∑^ T

∑^ T

∑^ K

[

]

∑^ T

∑^ T

∑^ K

∑^ K

THE NONSTOCHASTIC MULTIARMED BANDIT PROBLEM 57

∑^ K

2 R−^1 ≤

K

 K

 K

 = K

∑^ T

∑^ R

∑R

( T

∑^ R

 K

58 AUER, CESA-BIANCHI, FREUND, AND SCHAPIRE

[

]

∑T

KT , T }.

60 AUER, CESA-BIANCHI, FREUND, AND SCHAPIRE

P

P

= P

{ T

≤ P

∑T

KT

= P

∑T

KT

[

∑T

KT

))]

∑T

KT ),

KT

KT