Reinforcement Learning and Stochastic Optimization

Warren B. Powell, Professor Emeritus, Princeton University

Citation: Warren B. Powell, Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions, John Wiley and Sons, Hoboken, 2022 (1,100 pages).

Cover of Warren B. Powell's book 'Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions,' John Wiley and Sons, 2022 Reinforcement Learning and Stochastic Optimization: A unified framework for sequential decisions is the first textbook to offer a comprehensive, unified framework of the rich field of sequential decisions under uncertainty. Up to now, this rich problem class has been fragmented into at least 15 distinct fields that have been studied under names such as dynamic programming, stochastic programming, optimal control, simulation optimization, optimal learning, and multiarmed bandit problems. Recently these have been primarily associated with “reinforcement learning” and “stochastic optimization.”

This page is organized as follows:


Purchasing

The book is available from Amazon (at a continually varying price) or Wiley. Both have their own e-book formats.

Audience

The book requires little more than a good course in probability and statistics / machine learning (and supporting linear algebra). There are occasional forays that draw on linear programming. The presentation is designed for people who want to plan sequential decision problems, with an emphasis on modeling and computation. Applications are drawn from numerous topics in engineering (electrical, civil, mechanical, chemical), physical sciences, computer science, social sciences, economics and finance, operations research and industrial engineering, business applications, and statistics / machine learning.

Tutorial

A four-part video tutorial (recorded March 13, 2022, based on my presentation at the Informs Optimization Society conference):

Major book themes and features

  1. Central to the book is that we are addressing sequential decision problems: decision, information, decision, information, … which combines making decisions over time (or iterations / experiments) under uncertainty. Every problem addressed in the dramatically growing literature on reinforcement learning can be described in this way.
  2. We use a universal modeling framework with five parts (state variables, decision variables, exogenous information, transition function, and objective function) that applies to any sequential decision problem (the framework draws heavily from the optimal control literature). Central to the framework is optimizing over policies.
  3. We introduce four (meta)classes of policies (PFAs, CFAs, VFAs, and DLAs) that we claim are universal — they include any method proposed in the literature, or used in practice. The policies are distinguished by their computational characteristics. Each of the four classes can produce an optimal policy for special problem classes, but this is rare.
  4. The book is written for people who want to write software for real-world problems, using methods that scale not just in terms of size, but also complexity. Mathematical notation is designed to be translated directly into software (there is a one-to-one relationship between software and our mathematical modeling framework).
  5. To help first-time readers, many sections are marked with * indicating they can be skipped the first time through the book. Sections marked with ** indicate more advanced mathematics (we do not use measure-theoretic terminology, but we have a section that introduces the interested reader who would like to learn it).
  6. There are over 370 exercises, organized into seven categories at the end of each chapter. These consist of review questions, modeling, computation, theory, problem solving, problems drawn from the companion volume Sequential Decision Analytics and Modeling (with supporting Python modules), and finally a “diary” problem that the reader chooses in chapter 1, and then uses to answer a question at the end of each chapter. This allows readers to work on an application familiar to them.
  7. There is a supplementary materials webpage here.

Chapter summaries

Below I briefly summarize each chapter. Individual chapters can be downloaded by clicking on the link for that chapter. These are pre-publication versions of chapter 1 (Introduction), chapter 9 (on modeling), and chapter 11 (which summarizes the four classes of policies). The page numbers do not agree with the published version.

Table of contents

A scan of the 'Applications' section of the RLSO book index, showing a long alphabetical list of applications referenced in the book Index — check out the entry “Applications.” Shown to the right is a peek at the list of applications listed in the Index. These range from sketches of models to mentions of contexts where a method might apply.


The roots of the book

This book used my 2011 book, Approximate Dynamic Programming: Solving the Curses of Dimensionality, as a starting point. Chapter 3 on online learning evolved from chapter 8 in ADP on approximating value functions. The modeling chapter 5 from ADP is now chapter 9 (with major modifications) in RLSO. Chapter 14 in RLSO is based on the old chapter 3 on Markov decision processes, but now includes a section on optimal control and examples of dynamic programs that can be solved exactly. Chapter 15 is an entirely new chapter on backward approximate dynamic programming. Chapters 16–18 are based directly on the chapters in the ADP book for approximating value functions (they are now labeled as “Forward approximate dynamic programming”). Everything else is completely new.

Warren Powell — wbpowell328@gmail.com