Front cover image for Reinforcement learning : an introduction

Reinforcement learning : an introduction

Richard S. Sutton (Author), Andrew G. Barto (Author)
"Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms"--Provided by publisher
Print Book, English, 2018
Second edition View all formats and editions
The MIT Press, Cambridge, Massachusetts, 2018
xxii, 526 pages : illustrations (some color) ; 24 cm
9780262039246, 0262039249
1043175824
Introduction. Reinforcement Learning
Examples
Elements of Reinforcement Learning
Limitations and Scope
An Extended Example: Tic-Tac-Toe
Summary
Early History of Reinforcement Learning
Tabular Solution Methods. Multi-armed Bandits. A k-armed Bandit Problem
Action-value Methods
The 10-armed Testbed
Incremental Implementation
Tracking a Nonstationary Problem
Optimistic Initial Values
Upper-Confidence-Bound Action Selection
Gradient Bandit Algorithms
Associative Search (Contextual Bandits)
Summary
Finite Markov Decision Processes
The Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Unified Notation for Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Optimal Value Functions
Optimality and Approximation
Summary
Dynamic Programming. Policy Evaluation (Prediction)
Policy Improvement
Policy Iteration
Value Iteration
Asynchronous Dynamic Programming
Generalized Policy Iteration
Efficiency of Dynamic Programming
Summary
Monte Carlo Methods. Monte Carlo Prediction
Monte Carlo Estimation of Action Values
Monte Carlo Control
Monte Carlo Control without Exploring Starts
Off-policy Prediction via Importance Sampling
Incremental Implementation
Off-policy Monte Carlo Control
*Discounting-aware Importance Sampling
*Per-decision Importance Sampling
Summary
Temporal-Difference Learning. TD Prediction
Advantages of TD Prediction Methods
Optimality of TD(0)
Sarsa: On-policy TD Control
Q-learning: Off-policy TD Control
Expected Sarsa
Maximization Bias and Double Learning Games, Afterstates, and Other Special Cases
Summary
n-step Bootstrapping. n-step TD Prediction
n-step Sarsa
n-step Off-policy Learning
*Per-decision Methods with Control Variates
Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
*A Unifying Algorithm: n-step Q(u)
Summary
Planning and Learning with Tabular Methods. Models and Planning
Dyna: Integrated Planning, Acting, and Learning
When the Model Is Wrong
Prioritized Sweeping
Expected vs. Sample Updates
Trajectory Sampling
Real-time Dynamic Programming
Planning at Decision Time
Heuristic Search
Rollout Algorithms
Monte Carlo Tree Search
Summary of the Chapter
Summary of Part I: Dimensions
Approximate Solution Methods. On-policy Prediction with Approximation. Value-function Approximation
The Prediction Objective (VE) Stochastic-gradient and Semi-gradient Methods
Linear Methods
Feature Construction for Linear Methods
Polynomials
Fourier Basis
Coarse Coding
Tile Coding
Radial Basis Functions
Selecting Step-Size Parameters Manually
Nonlinear Function Approximation: Artificial Neural Networks
Least-Squares TD
Memory-based Function Approximation
Kernel-based Function Approximation
Looking Deeper at On-policy Learning: Interest and Emphasis
Summary
On-policy Control with Approximation. Episodic Semi-gradient Control
Semi-gradient n-step Sarsa
Average Reward: A New Problem Setting for Continuing Tasks
Deprecating the Discounted Setting
Differential Semi-gradient n-step Sarsa
Summary
*Off-policy Methods with Approximation. Semi-gradient Methods
Examples of Off-policy Divergence The Deadly Triad
Linear Value-function Geometry
Gradient Descent in the Bellman Error
The Bellman Error is Not Learnable
Gradient-TD Methods
Emphatic-TD Methods
Reducing Variance
Summary
Eligibility Traces. The A-return
TD(A)
n-step Truncated A-return Methods
Redoing Updates: Online A-return Algorithm
True Online TD(A)
*Dutch Traces in Monte Carlo Learning
Sarsa(A)
Variable A and ry
Off-policy Traces with Control Variates
Watkins's Q(A) to Tree-Backup(A)
Stable Off-policy Methods with Traces
Implementation Issues
Conclusions
Policy Gradient Methods. Policy Approximation and its Advantages
The Policy Gradient Theorem
REINFORCE: Monte Carlo Policy Gradient
REINFORCE with Baseline
Actor-Critic Methods Policy Gradient for Continuing Problems
Policy Parameterization for Continuous Actions
Summary
Looking Deeper. Psychology. Prediction and Control
Classical Conditioning
Blocking and Higher-order Conditioning
The Rescorla-Wagner Model
The TD Model
TD Model Simulations
Instrumental Conditioning
Delayed Reinforcement
Cognitive Maps
Habitual and Goal-directed Behavior
Summary
Neuroscience
Neuroscience Basics
Reward Signals, Reinforcement Signals, Values, and Prediction Errors
The Reward Prediction Error Hypothesis
Dopamine
Experimental Support for the Reward Prediction Error Hypothesis
TD Error/Dopamine Correspondence
Neural Actor-Critic
Actor and Critic Learning Rules
Hedonistic Neurons
Collective Reinforcement Learning
Model-based Methods in the Brain Addiction
Summary
Applications and Case Studies. TD-Gammon
Samuel's Checkers Player
Watson's Daily-Double Wagering
Optimizing Memory Control
Human-level Video Game Play
Mastering the Game of Go
AlphaGo
AlphaGo Zero
Personalized Web Services
Thermal Soaring
Frontiers. General Value Functions and Auxiliary Tasks
Temporal Abstraction via Options
Observations and State
Designing Reward Signals
Remaining Issues
Experimental Support for the Reward Prediction Error Hypothesis