Readings

Bertsekas = Bertsekas, Dimitri P. Dynamic Programming and Optimal Control. 2 vols. Belmont, MA: Athena Scientific, 2007. ISBN: 9781886529083.

Bertsekas and Tsitsiklis = Bertsekas, Dimitri P., and John N. Tsitsiklis. Neuro-Dynamic Programming. Belmont, MA: Athena Scientific, 1996. ISBN: 9781886529106.

LEC # TOPICS READINGS
1 Markov Decision Processes

Finite-Horizon Problems: Backwards Induction

Discounted-Cost Problems: Cost-to-Go Function, Bellman's Equation
Bertsekas Vol. 1, Chapter 1.
2 Value Iteration

Existence and Uniqueness of Bellman's Equation Solution

Gauss-Seidel Value Iteration
Bertsekas Vol. 2, Chapter 1.

Bertsekas and Tsitsiklis, Chapter 2.
3 Optimality of Policies derived from the Cost-to-Go Function

Policy Iteration

Asynchronous Policy Iteration
Bertsekas Vol. 2, Chapter 1.

Bertsekas and Tsitsiklis, Chapter 2.
4 Average-Cost Problems

Relationship with Discounted-Cost Problems

Bellman's Equation

Blackwell Optimality
Bertsekas Vol. 2, Chapter 4.
5 Average-Cost Problems

Computational Methods
Bertsekas Vol. 2, Chapter 4.
6 Application of Value Iteration to Optimization of Multiclass Queueing Networks

Introduction to Simulation-based Methods Real-Time Value Iteration
Chen, R. R., and S. P. Meyn. "Value Iteration and Optimization of Multiclass Queueing Networks."Queueing Systems 32 (1999): 65-97.

Bertsekas and Tsitsiklis, Chapter 5.
7 Q-Learning

Stochastic Approximations
Bertsekas and Tsitsiklis, Chapters 4 and 5.
8 Stochastic Approximations: Lyapunov Function Analysis

The ODE Method

Convergence of Q-Learning
Bertsekas and Tsitsiklis, Chapters 4 and 5.
9 Exploration versus Exploitation: The Complexity of Reinforcement Learning Kearns, M. , and S. Singh. "Near-Optional Reinforcement Learning in Polynomial Time." Machine Learning 49, no. 2 (Nov 2002): 209-232.
10 Introduction to Value Function Approximation

Curse of Dimensionality

Approximation Architectures
Bertsekas and Tsitsiklis, Chapter 6.
11 Model Selection and Complexity Hastie, Tibshirani, and Friedmann. Chapter 7 in The Elements of Statistical Learning. New York: Springer, 2003. ISBN: 9780387952840.
12 Introduction to Value Function Approximation Algorithms

Performance Bounds
Bertsekas and Tsitsiklis, Chapter 6.
13 Temporal-Difference Learning with Value Function Approximation Bertsekas and Tsitsiklis, Chapter 6.
14 Temporal-Difference Learning with Value Function Approximation (cont.) Bertsekas and Tsitsiklis, Chapter 6.

de Farias, D. P., and B. Van Roy. "On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning."
15 Temporal-Difference Learning with Value Function Approximation (cont.)

Optimal Stopping Problems

General Control Problems
Bertsekas and Tsitsiklis, Chapter 6.

de Farias, D. P., and B. Van Roy. "On the Existence of Fixed Points for Approximate Value Iteration and Temporal-Difference Learning."

Bertsekas, Borkar, and Nedic. "Improved temporal Difference Methods with Linear Function Approximation."
16 Approximate Linear Programming de Farias, D. P., and B. Van Roy. "The Linear Programming Approach to Approximate Dynamic Programming."
17 Approximate Linear Programming (cont.) de Farias, D. P., and B. Van Roy. "The Linear Programming Approach to Approximate Dynamic Programming."
18 Efficient Solutions for Approximate Linear Programming de Farias D. P., and B. Van Roy. "On Constraint Sampling in the Linear Programming Approach to Approximate Dynamic Programming."

Calafiori, and Campi. "Uncertain Convex Programs: Randomized Solutions and Confidence Levels."
19 Efficient Solutions for Approximate Linear Programming: Factored MDPs Guestrin, et al. "Efficient Solution Algorithms for Factored MDPs."

Schuurmans, and Patrascu. "Direct Value Approximation for Factored MDPs."
20 Policy Search Methods Marbach, and Tsitsiklis. "Simulation-Based Optimization of Markov Reward Processes." (PDF)
21 Policy Search Methods (cont.) Baxter, and Bartlett. "Infinite-Horizon Policy-Gradient Estimation."
22 Policy Search Methods for POMDPs

Application: Call Admission Control

Actor-Critic Methods
Baxter, and Bartlett. "Infinite-Horizon Policy-Gradient Estimation."

Baxter, and Bartlett. "Experiments with Infinite-Horizon Policy-Gradient Estimation."

Konda, and Tsitsiklis. "Actor-Critic Algorithms." (PDF)
23 Guest Lecture: Prof. Nick Roy

Approximate POMDP Compression
Roy, and Gordon. "Exponential Family PCA for Belief Compression in POMDPs."
24 Policy Search Methods: PEGASUS

Application: Helicopter Control
Ng, and Jordan. "PEGASUS: A policy search method for large MDPs and POMDPs."

Ng, et al. "Autonomous Helicopter Flight via Reinforcement Learning."

 

Complementary Reading

Even-Dar, and Mansour. "Learning Rates for Q-Learning.'' Journal of Machine Learning Research 5 (2003): 1-25.

Barron. "Universal approximation bounds for superpositions of a sigmoidal function." IEEE Transactions on Information Theory 39 (1993): 930-944.

Tesauro. "Temporal-Difference Learning and TD-Gammon'' Communications of the ACM 38, no. 3 (1995).