![]() |
CiteULike | ![]() |
ransofodo's CiteULike | ![]() |
![]() |
|
![]() |
Register | ![]() |
Log in | ![]() |
Open Theoretical Questions in Reinforcement Learningby: Richard Sutton
|
Reviews
[Write a review of this article]
Find related articles from these CiteULike users
Find related articles with these CiteULike tags
Posting History
AbstractReinforcement learning (RL) concerns the problem of a learning agent interacting with its environment to achieve a goal. Instead of being given examples of desired behavior, the learning agent must discover by trial and error how to behave in order to get the most reward. The environment is a Markov decision process (MDP) with state set, $$ \mathcalS $$ , and action set, $$ \mathcalA $$ . The agent and the environment interact in a sequence of discrete steps, t = 0, 1, 2,... The state and action at one time step, $$ s_t ∈ \mathcalS $$ and $$ a_t ∈ \mathcalA $$ , determine the probability distribution for the state at the next time step, $$ s_t + 1 ∈ \mathcalS $$ and, jointly, the distribution for the next reward, r t+1 ∈ ℜ. The agent’s objective is to chose each aint to maximize the subsequent return: $$ R_t = ∑\limits_k = 0^∞ γ ^k r_t + 1 + k , $$ where the discount rate, 0 ≤ γ ≤ 1, determines the relative weighting of immediate and delayed rewards. In some environments, the interaction consists of a sequence of episodes, each starting in a given state and ending upon arrival in a terminal state, terminating the series above. In other cases the interaction is continual, without interruption, and the sum may have an infinite number of terms (in which case we usually assume γ < 1). Infinite horizon cases with γ = 1 are also possible though less common (e.g., see Mahadevan, 1996).
BibTeX record
RIS record