抱歉,您的浏览器无法访问本站
本页面需要浏览器支持(启用)JavaScript
了解详情 >

李宏毅RL2018笔记

本篇博客图片很多,加载很慢

一、概述

RL的场景

image-20251014221133106

image-20251014221259696

稀疏奖励问题

RL vs 监督学习

定义:

  • state
  • action
  • reward
  • episode

优化目标:每个episode的reward和

难点:

  • reward delay & long-term view
  • agent action should explore more

框架:

image-20251014221958126

image-20251014222023357

讲解了有哪些资料讲的哪部分知识

二、Policy Gradient

Learning an Actor

  1. Function

    • Action=π(observation)Action =\pi(observation)
    • Neural network as Actor
    • stochastic, not determistic, probability of taking the action
    • nn can generize, traditional table cant
  2. Loss

    • Goodness of Actor
    • πθ(s)\pi _ \theta (s)
    • network para θ\theta
    • train an episode, get total reward RθR_\theta
    • RθR_\theta is random, so use Rθˉ\bar{R_\theta}
    • suppose an episode is a trajectory
      • τ=s1,a1,r1,s2,a2,r2,,sT,aT,rT\tau = {s_1,a_1,r_1,s_2,a_2,r_2,\cdots,s_T,a_T,r_T}
      • R(τ)=n=1NrnR(\tau)=\sum_{n=1}^{N}r_{n}
      • use an actor to play, so each τ\tau has a sample probility, depends an actor parameter θ\theta : P(τθ)P(\tau | \theta)
      • Rˉθ=τR(τ)P(τθ)\bar{R}_{\theta}=\sum_{\tau}R(\tau)P(\tau|\theta)
      • actually, use $\pi _ \theta $ to play N times, and sample and collect
      • Rˉθ=τR(τ)P(τθ)1Nn=1NR(τn)\bar{R}_{\theta}=\sum_{\tau}R(\tau)P(\tau|\theta)\approx\frac{1}{N}\sum_{n=1}^{N}R(\tau^n) 【1】
  3. Optimization

    • pick the best function θ\theta^* that maximize expected reward

      θ=argmaxθRˉθ,Rˉθ=τR(τ)P(τθ)\theta^* = \arg\max_{\theta} \bar{R}_{\theta}, \quad \bar{R}_{\theta} = \sum_{\tau} R(\tau)P(\tau|\theta)

    • Gradient ascent update rule

      θnewθold+ηRˉθold\theta^{\text{new}} \leftarrow \theta^{\text{old}} + \eta \nabla \bar{R}_{\theta^{\text{old}}}

      • θ={w1,w2,,b1,}\theta = \{w_1, w_2, \cdots, b_1, \cdots\} represents neural network weights
      • Rˉθ=[Rˉθ/w1Rˉθ/w2Rˉθ/b1]\nabla\bar{R}_\theta = \begin{bmatrix} \partial\bar{R}_\theta/\partial w_1 \\ \partial\bar{R}_\theta/\partial w_2 \\ \vdots \\ \partial\bar{R}_\theta/\partial b_1 \\ \vdots \end{bmatrix} is the gradient vector
    • 梯度计算推导:

      Rˉθ=τR(τ)P(τθ)=τR(τ)P(τθ)P(τθ)P(τθ)=τR(τ)P(τθ)logP(τθ)1Nn=1NR(τn)logP(τnθ)\begin{aligned} \nabla\bar{R}_\theta &= \sum_{\tau}R(\tau)\nabla P(\tau|\theta) \\ &= \sum_{\tau}R(\tau)P(\tau|\theta)\frac{\nabla P(\tau|\theta)}{P(\tau|\theta)} \\ &= \sum_{\tau}R(\tau)P(\tau|\theta)\nabla \log P(\tau|\theta) \\ &\approx \frac{1}{N}\sum_{n=1}^{N}R(\tau^{n})\nabla \log P(\tau^{n}|\theta) \end{aligned}

      • 第一行:利用 R(τ)R(\tau)θ\theta 无关的性质
      • 第二行:乘以和除以 P(τθ)P(\tau|\theta) 构造微分形式。一直觉得这个是神来一笔,一方面构造出了连乘转累加,另一方面构造出了蒙特卡洛形式。不过可能这就是内在联系。
      • 第三行:应用对数微分性质 logf(x)=f(x)f(x)\nabla \log f(x) = \frac{\nabla f(x)}{f(x)}
      • 第四行:蒙特卡洛近似,采样 NN 条轨迹估计梯度,见公式【1】
    • 轨迹概率的对数梯度:

      P(τθ)=p(s1)t=1Tp(atst,θ)p(rt,st+1st,at)logP(τθ)=logp(s1)+t=1Tlogp(atst,θ)+t=1Tlogp(rt,st+1st,at)logP(τθ)=t=1Tlogp(atst,θ)\begin{aligned} P(\tau|\theta) &= p(s_1) \prod_{t=1}^{T} p(a_t|s_t,\theta) p(r_t,s_{t+1}|s_t,a_t) \\ \log P(\tau|\theta) &= \log p(s_1) + \sum_{t=1}^{T} \log p(a_t|s_t,\theta) + \sum_{t=1}^{T} \log p(r_t,s_{t+1}|s_t,a_t) \\ \nabla \log P(\tau|\theta) &= \sum_{t=1}^{T} \nabla \log p(a_t|s_t,\theta) \end{aligned}

      • 轨迹 τ={s1,a1,r1,s2,a2,r2,,sT,aT,rT}\tau = \{s_1,a_1,r_1,s_2,a_2,r_2,\cdots,s_T,a_T,r_T\}
      • 仅保留与策略参数 θ\theta 相关的项 p(atst,θ)p(a_t|s_t,\theta)
    • 最终梯度表达式:

      Rˉθ1Nn=1NR(τn)logP(τnθ)=1Nn=1NR(τn)t=1Tnlogp(atnstn,θ)=1Nn=1Nt=1TnR(τn)logp(atnstn,θ)\begin{aligned} \nabla\bar{R}_{\theta} &\approx \frac{1}{N}\sum_{n=1}^{N}R(\tau^{n})\nabla \log P(\tau^{n}|\theta) \\ &= \frac{1}{N}\sum_{n=1}^{N}R(\tau^{n})\sum_{t=1}^{T_{n}}\nabla \log p(a_{t}^{n}|s_{t}^{n},\theta) \\ &= \frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T_{n}}R(\tau^{n})\nabla \log p(a_{t}^{n}|s_{t}^{n},\theta) \end{aligned}

      • stns_{t}^{n}:第 nn 次采样中第 tt 时刻的状态
      • atna_{t}^{n}:第 nn 次采样中第 tt 时刻的动作
    • Intuitive interpretation

      • If R(τn)R(\tau^{n}) is positive → Tune θ\theta to increase p(atnstn)p(a_{t}^{n}|s_{t}^{n})
      • If R(τn)R(\tau^{n}) is negative → Tune θ\theta to decrease p(atnstn)p(a_{t}^{n}|s_{t}^{n})
      • Important: The gradient uses the complete trajectory reward R(τn)R(\tau^n), not the immediate reward rtnr_t^n
  4. 关于为什么是log,除以一下action出现的几率,做一个normalization

    image-20251025123413324

  5. 关于Reward正负,设计一个bias

    image-20251025123640670

    image-20251025123649001

三、PPO

image-20251208182529815

四、Q-Learning

image-20251208182550626

image-20251208182555067

image-20251208182603244

五、Actor-Critic

image-20251208182618555

image-20251208182623258

六、Tips

image-20251208182646625

image-20251208182652666

补充、部分数学背景

image-20251208182437957

image-20251208182445285

image-20251208182450091

image-20251208182455152

image-20251208182500279

image-20251208182505220

image-20251208182509960

image-20251208182514183

评论