Language:
English
繁體中文
Help
Login
Back
Switch To:
Labeled
|
MARC Mode
|
ISBD
Nonparametric General Reinforcement ...
~
The Australian National University (Australia).
Nonparametric General Reinforcement Learning.
Record Type:
Language materials, manuscript : Monograph/item
Title/Author:
Nonparametric General Reinforcement Learning./
Author:
Leike, Jan.
Description:
1 online resource (197 pages)
Notes:
Source: Dissertation Abstracts International, Volume: 75-01C.
Subject:
Computer science. -
Online resource:
click for full text (PQDT)
Nonparametric General Reinforcement Learning.
Leike, Jan.
Nonparametric General Reinforcement Learning.
- 1 online resource (197 pages)
Source: Dissertation Abstracts International, Volume: 75-01C.
Thesis (Ph.D.)--The Australian National University (Australia), 2016.
Includes bibliographical references
Reinforcement learning problems are often phrased in terms of Markov decision processes (MDPs). In this thesis we go beyond MDPs and consider reinforcement learning in environments that are non-Markovian, non-ergodic and only partially observable. Our focus is not on practical algorithms, but rather on the fundamental underlying problems: How do we balance exploration and exploitation? How do we explore optimally? When is an agent optimal? We follow the nonparametric realizable paradigm: we assume the data is drawn from an unknown source that belongs to a known countable class of candidates.
Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018
Mode of access: World Wide Web
Subjects--Topical Terms:
573171
Computer science.
Index Terms--Genre/Form:
554714
Electronic books.
Nonparametric General Reinforcement Learning.
LDR
:04086ntm a2200361K 4500
001
913687
005
20180622095235.5
006
m o u
007
cr mn||||a|a||
008
190606s2016 xx obm 000 0 eng d
035
$a
(MiAaPQ)AAI10587503
035
$a
(MiAaPQ)AustNatlU1885111080
035
$a
AAI10587503
040
$a
MiAaPQ
$b
eng
$c
MiAaPQ
100
1
$a
Leike, Jan.
$3
1186623
245
1 0
$a
Nonparametric General Reinforcement Learning.
264
0
$c
2016
300
$a
1 online resource (197 pages)
336
$a
text
$b
txt
$2
rdacontent
337
$a
computer
$b
c
$2
rdamedia
338
$a
online resource
$b
cr
$2
rdacarrier
500
$a
Source: Dissertation Abstracts International, Volume: 75-01C.
502
$a
Thesis (Ph.D.)--The Australian National University (Australia), 2016.
504
$a
Includes bibliographical references
520
$a
Reinforcement learning problems are often phrased in terms of Markov decision processes (MDPs). In this thesis we go beyond MDPs and consider reinforcement learning in environments that are non-Markovian, non-ergodic and only partially observable. Our focus is not on practical algorithms, but rather on the fundamental underlying problems: How do we balance exploration and exploitation? How do we explore optimally? When is an agent optimal? We follow the nonparametric realizable paradigm: we assume the data is drawn from an unknown source that belongs to a known countable class of candidates.
520
$a
First, we consider the passive (sequence prediction) setting, learning from data that is not independent and identically distributed. We collect results from artificial intelligence, algorithmic information theory, and game theory and put them in a reinforcement learning context: they demonstrate how an agent can learn the value of its own policy.
520
$a
Next, we establish negative results on Bayesian reinforcement learning agents, in particular AIXI. We show that unlucky or adversarial choices of the prior cause the agent to misbehave drastically. Therefore Legg-Hutter intelligence and balanced Pareto optimality, which depend crucially on the choice of the prior, are entirely subjective. Moreover, in the class of all computable environments every policy is Pareto optimal. This undermines all existing optimality properties for AIXI.
520
$a
However, there are Bayesian approaches to general reinforcement learning that satisfy objective optimality guarantees: We prove that Thompson sampling is asymptotically optimal in stochastic environments in the sense that its value converges to the value of the optimal policy. We connect asymptotic optimality to regret given a recoverability assumption on the environment that allows the agent to recover from mistakes. Hence Thompson sampling achieves sublinear regret in these environments.
520
$a
AIXI is known to be incomputable. We quantify this using the arithmetical hierarchy, and establish upper and corresponding lower bounds for incomputability. Further, we show that AIXI is not limit computable, thus cannot be approximated using finite computation. However there are limit computable epsilon-optimal approximations to AIXI. We also derive computability bounds for knowledge-seeking agents, and give a limit computable weakly asymptotically optimal reinforcement learning agent.
520
$a
Finally, our results culminate in a formal solution to the grain of truth problem: A Bayesian agent acting in a multi-agent environment learns to predict the other agents' policies if its prior assigns positive probability to them (the prior contains a grain of truth). We construct a large but limit computable class containing a grain of truth and show that agents based on Thompson sampling over this class converge to play epsilon-Nash equilibria in arbitrary unknown computable multi-agent environments.
533
$a
Electronic reproduction.
$b
Ann Arbor, Mich. :
$c
ProQuest,
$d
2018
538
$a
Mode of access: World Wide Web
650
4
$a
Computer science.
$3
573171
650
4
$a
Artificial intelligence.
$3
559380
655
7
$a
Electronic books.
$2
local
$3
554714
690
$a
0984
690
$a
0800
710
2
$a
ProQuest Information and Learning Co.
$3
1178819
710
2
$a
The Australian National University (Australia).
$3
1186624
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10587503
$z
click for full text (PQDT)
based on 0 review(s)
Multimedia
Reviews
Add a review
and share your thoughts with other readers
Export
pickup library
Processing
...
Change password
Login