1. Fundamentals
    1. Statement of the Problem
    2. Starting Point – Multi-Armed Bandits
    3. Details
    4. Markov Decision Process (MDP)
      1. Optimal Policy
      2. Dynamic Programming
      3. Time Difference (TD) Learning
      4. Partially Observable MDP (POMDP)
  2. Advanced Topics from DLRLSS2019
    1. POMDP (Pascal)
    2. Off-Policy RL (Doina Precup)
    3. Model-Based RL (Martha White)
    4. Robust RL (Marek Petrik)
    5. Policy Search in Robotics (Jan Peters)
    6. Deep RL (Matteo Hessel)
      1. Deep RL:
        1. Generalization (Anna Harutyunyan)
  3. Playground: Robotics
    1. DLRL: Robotics (R. Mahmood)
  4. Course learning
  5. References

Fundamentals

Statement of the Problem

Starting Point – Multi-Armed Bandits

Details

  • Exploration and Exploitation
    • -greedy
    • optimistic initial values
    • Upper-Confidence Bound (UCB) – (1. higher reward or 2. explore the least uncertain channel to reduce uncertainty)
  • Real World Reinforcement learning (Regime-Shift)

Markov Decision Process (MDP)

From Bandits to MDP: long-term rewards and flexible, interactive environments. In MDP, the agent maximize total future reward (value function) instead of immediate reward.

employing discount factor to take into account of time-decaying effects: is upper-bounded.

  • States:
  • Actions:
  • Value Function
    • state value function
    • action value function
  • transition function joint probability

Optimal Policy

Bellman Equation

Optimal Policies and Optimal Value Function

Bellman Optimality Equation

From Optimal Value Function to Optimal Policy

Monte Carlo Dynamic Programming Time Difference
sampling and averaging the returns of episodes boostrapping over all states combining the previous two
   

Dynamic Programming

requires knowledge about the environments

  • Policy Evaluation and Control
    • Evaluation

    In Theory:

    () -> [Linear Solver] ->

    In practice:

    () -> [Dynamic Programming] ->

    • Control

    (p, ) -> [Dynamic Programming] ->

  • Iterative Policy Evaluation

Greedy:

  • Policy Iteration
    • Policy Improvement Theorem
    • Greedily Improve the Policy
  • General Policy Iteration (less “greedy” version of Policy Iteration)
    • Synchronous State Update -> Asynchronous State Update

Time Difference (TD) Learning

TD method combines Monte Carlo and Dynamic Programming

Partially Observable MDP (POMDP)

Advanced Topics from DLRLSS2019

  1. POMDP (Pascal)

  2. Off-Policy RL (Doina Precup)

  3. Model-Based RL (Martha White)

  4. Robust RL (Marek Petrik)

  • Solver: Linear Programming (duality): transform min-max problem into an optimization problem using Linear Program reformulation.
  • Robust MDP
  • Bayesian Approach

  • Ref: Robust Optimization (Bel-Tal)
  1. Policy Search in Robotics (Jan Peters)

  • Model-Free: using data samples {, , } to directly update policy
    • Policy gradient
    • Natural gradient

    • Expectation Maximization
    • Information Geometry Constraints
  • Model-Based: using data samples {, , } to build a model of the environment, and choose action by “planning”.
    • Greedy Updates: PILCO *
  1. Deep RL (Matteo Hessel)

  • learn policy directly
  • learn model ( -> ) -> infer policy by planning

Deep RL:

an OpenAI Introduction

  • Gradient-Descend
  • Optimization
  • DQN (paper)
  • parallel experience stream
    • Async RL (Mnih et al, 2016)
  • Generalization
    • Adaptive Reward Normalization
  • RL-aware DL (inductive bias to suite the reward)
Generalization (Anna Harutyunyan)
  • Within Tasks
    • Auxiliary Tasks
    • Distributional RL
  • Across Tasks
    • Successor Features
    • Universal Value Function Approximation (UVFA) and General Policy Improvement (GPI)
    • USFA (ICLR, 2019) (combining the above three)
  1. Hirerarchical RL (A. Barreto)

Time scale

  1. Multi-Agent RL

  2. Frontier (R. Sutton)

Playground: Robotics

DLRL: Robotics (R. Mahmood)

  • Learning from scratch in real-time (hard! not mature yet!)

Course learning

  • Fundamentals (UA)

    Learning Objectives

    • Structruerizing the problem
    • MDP
    • Policy Evaluation and Policy Improvement
    • Dynamic Programming
  • Solid Course (D. Silver)

    Learning Objectuves *

  • Exercises (\url{https://github.com/dennybritz/reinforcement-learning})

  • Casestudy: DQN (AlphaGo)

  • Casestudy: Robotics (UR5)

    • Basic Workflow:
      • Define the problem:
        • state Space: joint angles and velocities, vector between target and robot fingertip
        • action Space: joint torque
        • reward:
      • Control mode
      • Env: Gym

References