Introduction to Reinforcement Learning

Fundamentals
Advanced Topics from DLRLSS2019
Playground: Robotics
1. DLRL: Robotics (R. Mahmood)
Course learning
References

Fundamentals

Statement of the Problem

Starting Point – Multi-Armed Bandits

Details

Exploration and Exploitation
- $\epsilon$ -greedy
- optimistic initial values
- Upper-Confidence Bound (UCB) – (1. higher reward or 2. explore the least uncertain channel to reduce uncertainty) $A = argmax( Q_t(a) + c \sqrt(\frac{\log t}{N_t(a)}))$
Real World Reinforcement learning (Regime-Shift)

Markov Decision Process (MDP)

From Bandits to MDP: long-term rewards and flexible, interactive environments. In MDP, the agent maximize total future reward (value function) instead of immediate reward.

$\mathbb{E}_\pi [R_{t+1} + R_{t+2} + R_{t+3} ...]$

employing discount factor $\gamma$ to take into account of time-decaying effects: $\mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} ...]$ is upper-bounded.

States: $S_t$
Actions: $A \in \mathcal{A}(S_t)$
Value Function
- state value function $v_\pi(s) = \mathbb{E}_\pi [G_t \vert s]$
- action value function $q_\pi (s, a) = \mathbb{E}_\pi [G_t | s, a]$
transition function joint probability $p(s', r \vert s, a)$

Optimal Policy $\pi_*$

Bellman Equation

$v_\pi(s) = \sum_a \pi(a|s) \sum_{s',r}p(s',r|s,a) [r + \gamma v_\pi(s')]$ $q_\pi(s,a) = \sum_{s',r}p(s',r|s,a) [r + \gamma \sum_{a'} \pi(a'|s') q_\pi(s',a')]$

Optimal Policies and Optimal Value Function

$\#(policies) = (\#actions)^{(\#states)}$ $v_{\pi *}(s) = \max_{\pi} v_\pi (s),\, \forall s \in \mathcal{S}$ $q_{\pi *}(s, a) = \max_{\pi} q_\pi (s, a),\, \forall s \in \mathcal{S} \, \textrm{and} \, a \in \mathcal{A}$

Bellman Optimality Equation

$v_* (s) = \max_a \sum_{s',r}p(s',r|s,a) [r + \gamma v_* (s')]$ $q_* (s,a) = \sum_{s',r}p(s',r|s,a) [r + \gamma \max_{a'} q_*(s',a')]$

From Optimal Value Function to Optimal Policy

Monte Carlo	Dynamic Programming	Time Difference
sampling and averaging the returns of episodes	boostrapping over all states	combining the previous two
$v(s_t) \leftarrow v(s_t) + \alpha [G_t - v(s_t)]$

Dynamic Programming

requires knowledge about the environments

Policy Evaluation and Control
- Evaluation
In Theory:

( $\pi, p, \gamma$ ) -> [Linear Solver] -> $v_\pi$

In practice:

( $\pi, p, \gamma$ ) -> [Dynamic Programming] -> $v_\pi$
- Control
(p, $\gamma$ ) -> [Dynamic Programming] -> $\pi_*$
Iterative Policy Evaluation

$v_{k+1}(s) \leftarrow \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a)[r + \gamma v_k(s')]$

Greedy:

$\pi_*(s) = argmax_a \sum_{s',r}p(s',r|s,a) [r + \gamma v_\pi(s')]$

Policy Iteration
- Policy Improvement Theorem
- Greedily Improve the Policy

$\pi_1 \xrightarrow{E} v_{\pi_1} \xrightarrow{I} \pi_2 \to v_{\pi_2} \to ... \to \pi_* \to v_{\pi_*} \to \pi_* (\textrm{stop})$

General Policy Iteration (less “greedy” version of Policy Iteration)
- Synchronous State Update -> Asynchronous State Update

Time Difference (TD) Learning

TD method combines Monte Carlo and Dynamic Programming

Partially Observable MDP (POMDP)

Advanced Topics from DLRLSS2019

POMDP (Pascal)
Off-Policy RL (Doina Precup)
Model-Based RL (Martha White)
Robust RL (Marek Petrik)

Solver: Linear Programming (duality): transform min-max problem into an optimization problem using Linear Program reformulation.
Robust MDP
Bayesian Approach
Ref: Robust Optimization (Bel-Tal)

Policy Search in Robotics (Jan Peters)

Model-Free: using data samples {, , } to directly update policy
- Policy gradient
- Natural gradient
  $KL(\pi_{\theta+\delta \theta}||\pi_\theta) \sim \delta \theta^T G(\theta) \delta \theta$
- Expectation Maximization
- Information Geometry Constraints
Model-Based: using data samples {, , } to build a model of the environment, and choose action by “planning”.
- Greedy Updates: PILCO *

Deep RL (Matteo Hessel)

learn policy directly
learn model ( $s_t, r_t$ -> $s_{t+1}, r_{t+1}$ ) -> infer policy by planning

Deep RL:

an OpenAI Introduction

Gradient-Descend
Optimization
DQN (paper)
parallel experience stream
- Async RL (Mnih et al, 2016)
Generalization
- Adaptive Reward Normalization
RL-aware DL (inductive bias to suite the reward)

Generalization (Anna Harutyunyan)

Within Tasks
- Auxiliary Tasks
- Distributional RL
Across Tasks
- Successor Features
- Universal Value Function Approximation (UVFA) and General Policy Improvement (GPI)
- $\to$ USFA (ICLR, 2019) (combining the above three)

Hirerarchical RL (A. Barreto)

Time scale

Multi-Agent RL
Frontier (R. Sutton)

Playground: Robotics

DLRL: Robotics (R. Mahmood)

Learning from scratch in real-time (hard! not mature yet!)

Course learning

Fundamentals (UA)

Learning Objectives
- Structruerizing the problem
- MDP
- Policy Evaluation and Policy Improvement
- Dynamic Programming
Solid Course (D. Silver)

Learning Objectuves *
Exercises (\url{https://github.com/dennybritz/reinforcement-learning})
Casestudy: DQN (AlphaGo)
Casestudy: Robotics (UR5)
- Basic Workflow:
  - Define the problem:
    - state Space: joint angles and velocities, vector between target and robot fingertip
    - action Space: joint torque
    - reward: $R=-\lVert d \rVert$
  - Control mode
  - Env: Gym

References

Reinforcement Learning

Fundamentals

Statement of the Problem

Starting Point – Multi-Armed Bandits

Details

Markov Decision Process (MDP)

Optimal Policy \pi_*

Dynamic Programming

Time Difference (TD) Learning

Partially Observable MDP (POMDP)

Advanced Topics from DLRLSS2019

POMDP (Pascal)

Off-Policy RL (Doina Precup)

Model-Based RL (Martha White)

Robust RL (Marek Petrik)

Policy Search in Robotics (Jan Peters)

Deep RL (Matteo Hessel)