李宏毅RL2018笔记

本篇博客图片很多，加载很慢

一、概述

RL的场景

稀疏奖励问题

RL vs 监督学习

定义：

state
action
reward
episode

优化目标：每个episode的reward和

难点：

reward delay & long-term view
agent action should explore more

框架：

二、Policy Gradient

Learning an Actor

Function
- $Action =\pi(observation)$
- Neural network as Actor
- stochastic, not determistic, probability of taking the action
- nn can generize, traditional table cant
Loss
- Goodness of Actor
- $\pi _ \theta (s)$
- network para $\theta$
- train an episode, get total reward $R_\theta$
- $R_\theta$ is random, so use $\bar{R_\theta}$
- suppose an episode is a trajectory
  - $\tau = {s_1,a_1,r_1,s_2,a_2,r_2,\cdots,s_T,a_T,r_T}$
  - $R(\tau)=\sum_{n=1}^{N}r_{n}$
  - use an actor to play, so each $\tau$ has a sample probility, depends an actor parameter $\theta$ : $P(\tau | \theta)$
  - $\bar{R}_{\theta}=\sum_{\tau}R(\tau)P(\tau|\theta)$
  - actually, use $\pi _ \theta $ to play N times, and sample and collect
  - $\bar{R}_{\theta}=\sum_{\tau}R(\tau)P(\tau|\theta)\approx\frac{1}{N}\sum_{n=1}^{N}R(\tau^n)$ 【1】
Optimization
- pick the best function $\theta^*$ that maximize expected reward
  $\theta^* = \arg\max_{\theta} \bar{R}_{\theta}, \quad \bar{R}_{\theta} = \sum_{\tau} R(\tau)P(\tau|\theta)$
- Gradient ascent update rule
  
  $\theta^{\text{new}} \leftarrow \theta^{\text{old}} + \eta \nabla \bar{R}_{\theta^{\text{old}}}$
  - $\theta = \{w_1, w_2, \cdots, b_1, \cdots\}$ represents neural network weights
  - $\nabla\bar{R}_\theta = \begin{bmatrix} \partial\bar{R}_\theta/\partial w_1 \\ \partial\bar{R}_\theta/\partial w_2 \\ \vdots \\ \partial\bar{R}_\theta/\partial b_1 \\ \vdots \end{bmatrix}$ is the gradient vector
- 梯度计算推导：
  
  $\begin{aligned} \nabla\bar{R}_\theta &= \sum_{\tau}R(\tau)\nabla P(\tau|\theta) \\ &= \sum_{\tau}R(\tau)P(\tau|\theta)\frac{\nabla P(\tau|\theta)}{P(\tau|\theta)} \\ &= \sum_{\tau}R(\tau)P(\tau|\theta)\nabla \log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum_{n=1}^{N}R(\tau^{n})\nabla \log P(\tau^{n}|\theta) \end{aligned}$
  - 第一行：利用 $R(\tau)$ 与 $\theta$ 无关的性质
  - 第二行：乘以和除以 $P(\tau|\theta)$ 构造微分形式。一直觉得这个是神来一笔，一方面构造出了连乘转累加，另一方面构造出了蒙特卡洛形式。不过可能这就是内在联系。
  - 第三行：应用对数微分性质 $\nabla \log f(x) = \frac{\nabla f(x)}{f(x)}$
  - 第四行：蒙特卡洛近似，采样 $N$ 条轨迹估计梯度，见公式【1】
- 轨迹概率的对数梯度：
  
  $\begin{aligned} P(\tau|\theta) &= p(s_1) \prod_{t=1}^{T} p(a_t|s_t,\theta) p(r_t,s_{t+1}|s_t,a_t) \\ \log P(\tau|\theta) &= \log p(s_1) + \sum_{t=1}^{T} \log p(a_t|s_t,\theta) + \sum_{t=1}^{T} \log p(r_t,s_{t+1}|s_t,a_t) \\ \nabla \log P(\tau|\theta) &= \sum_{t=1}^{T} \nabla \log p(a_t|s_t,\theta) \end{aligned}$
  - 轨迹 $\tau = \{s_1,a_1,r_1,s_2,a_2,r_2,\cdots,s_T,a_T,r_T\}$
  - 仅保留与策略参数 $\theta$ 相关的项 $p(a_t|s_t,\theta)$
- 最终梯度表达式：
  
  $\begin{aligned} \nabla\bar{R}_{\theta} &\approx \frac{1}{N}\sum_{n=1}^{N}R(\tau^{n})\nabla \log P(\tau^{n}|\theta) \\ &= \frac{1}{N}\sum_{n=1}^{N}R(\tau^{n})\sum_{t=1}^{T_{n}}\nabla \log p(a_{t}^{n}|s_{t}^{n},\theta) \\ &= \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}R(\tau^{n})\nabla \log p(a_{t}^{n}|s_{t}^{n},\theta) \end{aligned}$
  - $s_{t}^{n}$ ：第 $n$ 次采样中第 $t$ 时刻的状态
  - $a_{t}^{n}$ ：第 $n$ 次采样中第 $t$ 时刻的动作
- Intuitive interpretation
  - If $R(\tau^{n})$ is positive → Tune $\theta$ to increase $p(a_{t}^{n}|s_{t}^{n})$
  - If $R(\tau^{n})$ is negative → Tune $\theta$ to decrease $p(a_{t}^{n}|s_{t}^{n})$
  - Important: The gradient uses the complete trajectory reward $R(\tau^n)$ , not the immediate reward $r_t^n$
关于为什么是log，除以一下action出现的几率，做一个normalization
关于Reward正负，设计一个bias

李宏毅强化学习2018笔记

李宏毅RL2018笔记

一、概述

二、Policy Gradient

三、PPO

四、Q-Learning

五、Actor-Critic

六、Tips

补充、部分数学背景