Reinforcement learning : an introduction

Authors:Richard S. Sutton (Author), Andrew G. Barto (Author)

Summary:"Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms"--Provided by publisher

Print Book, English, 2018

Edition:Second edition View all formats and editions

Publisher: The MIT Press, Cambridge, Massachusetts, 2018

Series:

Adaptive computation and machine learning

Physical Description:xxii, 526 pages : illustrations (some color) ; 24 cm

ISBN:

9780262039246, 0262039249

OCLC Number / Unique Identifier:1043175824

Subjects:

54.72 artificial intelligence

Apprentissage par renforcement (Intelligence artificielle)

COMPUTERS Machine Theory

Computers and IT

Machine Learning

Reinforcement (Psychology)

Reinforcement learning

Reinforcement, Psychology

Renforcement (Psychologie)

Additional Physical Form Entry:

Online version:

Reinforcement learning.

Sutton, Richard S.

1091191532

Contents:

Introduction. Reinforcement Learning

Examples

Elements of Reinforcement Learning

Limitations and Scope

An Extended Example: Tic-Tac-Toe

Summary

Early History of Reinforcement Learning

Tabular Solution Methods. Multi-armed Bandits. A k-armed Bandit Problem

Action-value Methods

The 10-armed Testbed

Incremental Implementation

Tracking a Nonstationary Problem

Optimistic Initial Values

Upper-Confidence-Bound Action Selection

Gradient Bandit Algorithms

Associative Search (Contextual Bandits)

Summary

Finite Markov Decision Processes

The Agent-Environment Interface

Goals and Rewards

Returns and Episodes

Unified Notation for Episodic and Continuing Tasks

Policies and Value Functions

Optimal Policies and Optimal Value Functions

Optimality and Approximation

Summary

Dynamic Programming. Policy Evaluation (Prediction)

Policy Improvement

Policy Iteration

Value Iteration

Asynchronous Dynamic Programming

Generalized Policy Iteration

Efficiency of Dynamic Programming

Summary

Monte Carlo Methods. Monte Carlo Prediction

Monte Carlo Estimation of Action Values

Monte Carlo Control

Monte Carlo Control without Exploring Starts

Off-policy Prediction via Importance Sampling

Incremental Implementation

Off-policy Monte Carlo Control

*Discounting-aware Importance Sampling

*Per-decision Importance Sampling

Summary

Temporal-Difference Learning. TD Prediction

Advantages of TD Prediction Methods

Optimality of TD(0)

Sarsa: On-policy TD Control

Q-learning: Off-policy TD Control

Expected Sarsa

Maximization Bias and Double Learning Games, Afterstates, and Other Special Cases

Summary

n-step Bootstrapping. n-step TD Prediction

n-step Sarsa

n-step Off-policy Learning

*Per-decision Methods with Control Variates

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

*A Unifying Algorithm: n-step Q(u)

Summary

Planning and Learning with Tabular Methods. Models and Planning

Dyna: Integrated Planning, Acting, and Learning

When the Model Is Wrong

Prioritized Sweeping

Expected vs. Sample Updates

Trajectory Sampling

Real-time Dynamic Programming

Planning at Decision Time

Heuristic Search

Rollout Algorithms

Monte Carlo Tree Search

Summary of the Chapter

Summary of Part I: Dimensions

Approximate Solution Methods. On-policy Prediction with Approximation. Value-function Approximation

The Prediction Objective (VE) Stochastic-gradient and Semi-gradient Methods

Linear Methods

Feature Construction for Linear Methods

Polynomials

Fourier Basis

Coarse Coding

Tile Coding

Radial Basis Functions

Selecting Step-Size Parameters Manually

Nonlinear Function Approximation: Artificial Neural Networks

Least-Squares TD

Memory-based Function Approximation

Kernel-based Function Approximation

Looking Deeper at On-policy Learning: Interest and Emphasis

Summary

On-policy Control with Approximation. Episodic Semi-gradient Control

Semi-gradient n-step Sarsa

Average Reward: A New Problem Setting for Continuing Tasks

Deprecating the Discounted Setting

Differential Semi-gradient n-step Sarsa

Summary

*Off-policy Methods with Approximation. Semi-gradient Methods

Examples of Off-policy Divergence The Deadly Triad

Linear Value-function Geometry

Gradient Descent in the Bellman Error

The Bellman Error is Not Learnable

Gradient-TD Methods

Emphatic-TD Methods

Reducing Variance

Summary

Eligibility Traces. The A-return

TD(A)

n-step Truncated A-return Methods

Redoing Updates: Online A-return Algorithm

True Online TD(A)

*Dutch Traces in Monte Carlo Learning

Sarsa(A)

Variable A and ry

Off-policy Traces with Control Variates

Watkins's Q(A) to Tree-Backup(A)

Stable Off-policy Methods with Traces

Implementation Issues

Conclusions

Policy Gradient Methods. Policy Approximation and its Advantages

The Policy Gradient Theorem

REINFORCE: Monte Carlo Policy Gradient

REINFORCE with Baseline

Actor-Critic Methods Policy Gradient for Continuing Problems

Policy Parameterization for Continuous Actions

Summary

Looking Deeper. Psychology. Prediction and Control

Classical Conditioning

Blocking and Higher-order Conditioning

The Rescorla-Wagner Model

The TD Model

TD Model Simulations

Instrumental Conditioning

Delayed Reinforcement

Cognitive Maps

Habitual and Goal-directed Behavior

Summary

Neuroscience

Neuroscience Basics

Reward Signals, Reinforcement Signals, Values, and Prediction Errors

The Reward Prediction Error Hypothesis

Dopamine

Experimental Support for the Reward Prediction Error Hypothesis

TD Error/Dopamine Correspondence

Neural Actor-Critic

Actor and Critic Learning Rules

Hedonistic Neurons

Collective Reinforcement Learning

Model-based Methods in the Brain Addiction

Summary

Applications and Case Studies. TD-Gammon

Samuel's Checkers Player

Watson's Daily-Double Wagering

Optimizing Memory Control

Human-level Video Game Play

Mastering the Game of Go

AlphaGo

AlphaGo Zero

Personalized Web Services

Thermal Soaring

Frontiers. General Value Functions and Auxiliary Tasks

Temporal Abstraction via Options

Observations and State

Designing Reward Signals

Remaining Issues

Experimental Support for the Reward Prediction Error Hypothesis