國立虎尾科技大學 |

Thompson Sampling for Bandit Problems.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Thompson Sampling for Bandit Problems./
作者:	Liu, Che-Yu.
面頁冊數:	1 online resource (113 pages)
附註:	Source: Dissertation Abstracts International, Volume: 79-07(E), Section: B.
Contained By:	Dissertation Abstracts International79-07B(E).
標題:	Artificial intelligence. -
電子資源:	click for full text (PQDT)
ISBN:	9780355626421

Thompson Sampling for Bandit Problems.
Liu, Che-Yu.

Thompson Sampling for Bandit Problems. - 1 online resource (113 pages)

Source: Dissertation Abstracts International, Volume: 79-07(E), Section: B.

Thesis (Ph.D.)--Princeton University, 2018.

Includes bibliographical references

Bandit problems are the most basic examples of the sequential decision making problems with limited feedback and an exploitation/exploration trade-off. In these problems, an agent repeatedly selects an action out of a pool of candidates and receives a reward sampled from the action's reward distribution. The agent's goal is to maximize the sum of rewards that he receives over time. The trade-off at each time step is between exploitation of the actions that already produced high rewards in the past and exploration of the poorly understood actions that have the potential to yield even higher rewards in the future. Bandit problems arise naturally in many applications, such as clinical trials, project management and online news recommendation.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018

Mode of access: World Wide Web

ISBN: 9780355626421Subjects--Topical Terms:

559380
Artificial intelligence.
Index Terms--Genre/Form:

554714
Electronic books.

Thompson Sampling for Bandit Problems.
LDR:03591ntm a2200373Ki 4500 001 918837
005 20181106104112.5
006 m o u
007 cr mn||||a|a||
008 190606s2018 xx obm 000 0 eng d
020 $a 9780355626421
035 $a (MiAaPQ)AAI10643204
035 $a (MiAaPQ)princeton:12387
035 $a AAI10643204
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Liu, Che-Yu. $3 1193268
245 1 0 $a Thompson Sampling for Bandit Problems.
264 0 $c 2018
300 $a 1 online resource (113 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertation Abstracts International, Volume: 79-07(E), Section: B.
500 $a Adviser: Sebastien Bubeck.
502 $a Thesis (Ph.D.)--Princeton University, 2018.
504 $a Includes bibliographical references
520 $a Bandit problems are the most basic examples of the sequential decision making problems with limited feedback and an exploitation/exploration trade-off. In these problems, an agent repeatedly selects an action out of a pool of candidates and receives a reward sampled from the action's reward distribution. The agent's goal is to maximize the sum of rewards that he receives over time. The trade-off at each time step is between exploitation of the actions that already produced high rewards in the past and exploration of the poorly understood actions that have the potential to yield even higher rewards in the future. Bandit problems arise naturally in many applications, such as clinical trials, project management and online news recommendation.
520 $a Thompson Sampling is a popular strategy to solve bandit problems. It selects actions using the "probability matching" principle. In the first part of this thesis, we analyze Thompson Sampling from several different angles. First, we prove a tight bound on Thompson Sampling's performance when the performance is averaged with respect to the prior distribution that Thompson Sampling uses as input. Next, we look at the more realistic non-averaged performance of Thompson Sampling. We quantify the sensitivity of Thompson Sampling's (non-averaged) performance to the choice of input prior, by providing matching upper and lower bounds. Finally, we illustrate Thompson Sampling's ability to optimally exploit prior knowledge by thoroughly analyzing its behavior in a non-trivial example.
520 $a In the second part of this thesis, we switch our focus to the most-correlated-arms identification problem, where the actions' reward distributions are assumed to be jointly Gaussian and the goal is to find actions with the most mutually correlated rewards. In this problem and unlike in bandit problems, we focus on exploring the actions to acquire as much relevant information as possible and only exploit the acquired information at the end to return the set of correlated actions. We propose two adaptive action-selection strategies and show that they can have significant advantages over the non-adaptive uniform sampling strategy. Our proposed algorithms rely on a novel correlation estimator. The use of this accurate estimator allows us to get improved results for a wide range of problem instances.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2018
538 $a Mode of access: World Wide Web
650 4 $a Artificial intelligence. $3 559380
650 4 $a Computer science. $3 573171
650 4 $a Statistics. $3 556824
655 7 $a Electronic books. $2 local $3 554714
690 $a 0800
690 $a 0984
690 $a 0463
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a Princeton University. $b Operations Research and Financial Engineering. $3 1182940
773 0 $t Dissertation Abstracts International $g 79-07B(E).
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10643204 $z click for full text (PQDT)