Fundamentals
Statement of the Problem
Starting Point – Multi-Armed Bandits
Details
- Exploration and Exploitation
- -greedy
- optimistic initial values
- Upper-Confidence Bound (UCB) – (1. higher reward or 2. explore the least uncertain channel to reduce uncertainty)
- Real World Reinforcement learning (Regime-Shift)
Markov Decision Process (MDP)
From Bandits to MDP: long-term rewards and flexible, interactive environments. In MDP, the agent maximize total future reward (value function) instead of immediate reward.
employing discount factor to take into account of time-decaying effects: is upper-bounded.
- States:
- Actions:
- Value Function
- state value function
- action value function
- transition function joint probability
Optimal Policy
Bellman Equation
Optimal Policies and Optimal Value Function
Bellman Optimality Equation
From Optimal Value Function to Optimal Policy
Monte Carlo | Dynamic Programming | Time Difference |
---|---|---|
sampling and averaging the returns of episodes | boostrapping over all states | combining the previous two |
Dynamic Programming
requires knowledge about the environments
- Policy Evaluation and Control
- Evaluation
In Theory:
() -> [Linear Solver] ->
In practice:
() -> [Dynamic Programming] ->
- Control
(p, ) -> [Dynamic Programming] ->
- Iterative Policy Evaluation
Greedy:
- Policy Iteration
- Policy Improvement Theorem
- Greedily Improve the Policy
- General Policy Iteration (less “greedy” version of Policy Iteration)
- Synchronous State Update -> Asynchronous State Update
Time Difference (TD) Learning
TD method combines Monte Carlo and Dynamic Programming
Partially Observable MDP (POMDP)
Advanced Topics from DLRLSS2019
-
POMDP (Pascal)
-
Off-Policy RL (Doina Precup)
-
Model-Based RL (Martha White)
-
Robust RL (Marek Petrik)
- Solver: Linear Programming (duality): transform min-max problem into an optimization problem using Linear Program reformulation.
- Robust MDP
-
Bayesian Approach
- Ref: Robust Optimization (Bel-Tal)
-
Policy Search in Robotics (Jan Peters)
- Model-Free: using data samples {, , } to directly update policy
- Policy gradient
-
Natural gradient
- Expectation Maximization
- Information Geometry Constraints
- Model-Based: using data samples {, , } to build a model of the environment, and choose
action by “planning”.
- Greedy Updates: PILCO *
-
Deep RL (Matteo Hessel)
- learn policy directly
- learn model ( -> ) -> infer policy by planning
Deep RL:
- Gradient-Descend
- Optimization
- DQN (paper)
- parallel experience stream
- Async RL (Mnih et al, 2016)
- Generalization
- Adaptive Reward Normalization
- RL-aware DL (inductive bias to suite the reward)
Generalization (Anna Harutyunyan)
- Within Tasks
- Auxiliary Tasks
- Distributional RL
- Across Tasks
- Successor Features
- Universal Value Function Approximation (UVFA) and General Policy Improvement (GPI)
- USFA (ICLR, 2019) (combining the above three)
- Hirerarchical RL (A. Barreto)
Time scale
-
Multi-Agent RL
-
Frontier (R. Sutton)
Playground: Robotics
DLRL: Robotics (R. Mahmood)
- Learning from scratch in real-time (hard! not mature yet!)
Course learning
- Fundamentals (UA)
Learning Objectives
- Structruerizing the problem
- MDP
- Policy Evaluation and Policy Improvement
- Dynamic Programming
- Solid Course (D. Silver)
Learning Objectuves *
-
Exercises (\url{https://github.com/dennybritz/reinforcement-learning})
-
Casestudy: DQN (AlphaGo)
-
Casestudy: Robotics (UR5)
- Basic Workflow:
- Define the problem:
- state Space: joint angles and velocities, vector between target and robot fingertip
- action Space: joint torque
- reward:
- Control mode
- Env: Gym
- Define the problem:
- Basic Workflow: