5.策略优化基础 - wolai 笔记

1.Policy-based RL

1.1 Value-base RL vs Policy-based RL

Value-based RL : to learn value function; implicit policy based on the value function
Policy-based RL : no value function; to learn policy directly
Actor-critic : to learn both policy and value function

1.2 Policy-based RL 的优势与劣势

优势:
  • 具有能得到更好的收敛特性:保证收敛于一个 局部最优(最坏情况)或全局最优(最好情况)
  • 策略梯度在高维动作空间更有效
  • 策略梯度可以学习随机策略,而值函数不能(例如石头剪刀布游戏,基于价值的RL可能倾向于一个状态对应一个最佳action,这导致一直出一样的动作。但其实具备随机性玩游戏胜率会更合理)
劣势:
  • 实际情况通常只能学到局部最优
  • 评估的策略具有很大的方差

1.3 Policy分类

Deterministic确定的策略:给定一个状态,返回要采取的某个动作。
Stochastic概率分布的策略:给定一个状态,返回一个概率分布。(类似分类任务返回得到不同类别的概率,这种形式更常用)

1.4 策略优化目标

定义一个最优策略带参函数 πθ(s,a){\pi}_{\theta}(s, a),表示最优策略
Objective : Given a policy approximator πθ(s,a){\pi}_{\theta}(s, a) with parameter θ\theta, find the best θ\theta
Question: How do we measure the quality of a policy πθ{\pi}_{\theta} ?
In episodic environments we can use the start value
即在有终止条件的环境下,假设V1为初始状态,在该环境下v1经过πθ(s,a)πθ(s,a)后能得到一种策略。所以希望使用V1表示最优策略。
J1(θ)=Vπθ(s1)=Eπθ[v1]J_1(\theta)=V^{\pi \theta}(s_1) = E_{\pi \theta}[v1]
In continuing environments:
即在连续无终止环境下,就不能用v1来表示策略了。所以这里定义一个dπθdπθ表达马尔科夫链的一种稳态分布。
  1. we can use the average value
    马尔科夫链每个状态乘以稳态分布值来表示最优策略下的目标函数。
    JavV(θ)=sdπθ(s)Vπθ(s)J_{avV}(\theta)=\sum_{s}d^{\pi \theta}(s)V^{\pi \theta}(s)
  2. or the average reward per time-step
    在不同状态下(外求和)用state-actionReward乘以每个action下执行的概率得到最优策略下的目标函数。
    JavR(θ)=sdπθ(s)aπθ(s,a)R(s,a)J_{avR}(\theta)=\sum_{s}d^{\pi \theta}(s) \sum_a \pi \theta (s,a)R(s,a)
    where dπθd^{\pi \theta} is stationary distribution of Markov chain for πθ\pi_{\theta}
The value of the policy is defined as
J(θ)=Eτπθ[tR(stτ,atτ)]1mmtR(stm,atm)J(\theta) = E_{\tau \sim \pi \theta}[\sum_tR(s^{\tau}_t, a^{\tau}_t)] \\ \approx \frac{1}{m} \sum_m \sum_t R(s^m_t, a^m_t)
It is the same as the value function we defined in the value-based RL
每种策略可以看成不同的轨迹,不断采样时求取轨迹的均值得到最佳策略。
  1. τ\tau is a trajectory sampled from the policy function πθ\pi_{\theta}
  2. Here we ignore discount first
The goal of policy-based RL
θ=argmaxθEτπθ[tR(stτ,atτ))]\theta^*=\underset{\theta}{arg max} E_{\tau \sim \pi \theta}[\sum_t R(s^{\tau}_t, a^{\tau}_t))]

1.5 策略函数的形式

前面说完用不同角度的定义来表示策略函数目标,但是没有说明具体的函数形式,这里可以采用两种策略函数形式来具体定义策略函数。

(1)Softmax policy

Simple policy model : weight actions using linear combination of features ϕ(s,a)Tθ\phi(s,a)^T \theta
Probability of action is proportional to the exponetiated weight
πθ(s,a)=expϕ(s,a)Tθaexpϕ(s,a)Tθ\pi_{\theta}(s,a)=\frac{exp^{\phi(s,a)^T \theta}}{\sum_{a'}exp^{\phi(s,a')^T \theta}}
The score function is
θlogπθ(s,a)=ϕ(s,a)Eπθ[ϕ(s,...)]\nabla_{\theta}log{\pi}_{\theta}(s,a)=\phi(s,a)-E_{\pi_{\theta}}[\phi(s,...)]

(2)Gaussian Policy

在连续行为空间中适用。
In continuous action spaces, a Gaussian policy can be naturally defined
Mean is a linear combination of state features μ(s)=ϕ(s)Tθ\mu(s)=\phi(s)^T\theta
Variance may be fixed σ2\sigma^2 or can also be parameterized
Policy is Gaussian, the continuous aN(μ(s),σ2)a \sim N(\mu(s), \sigma^2)
The score function is
θlogπθ(s,a)=(aμ(s))ϕ(s)σ2\nabla_{\theta}log \pi_\theta(s,a)=\frac{(a-\mu(s))\phi(s)}{\sigma^2}

1.6 优化目标函数方法

Policy-based RL is an optimization problem that find θ\theta that maximizes J(θ)J(θ)
If J(θ)J(θ) is differentiable(可微), we can use gradient-based methods:
  1. gradient ascend
  2. conjugate gradient
  3. quasi-newton
If J(θ)J(θ) is non-differentiable(不可微) or hard to compute the derivative, some derivative-free black-box optimization methods:
  1. Cross-entropy method (CEM)
  2. Hill climbing
  3. Evolution algorithm
当目标函数不可微时,只能采用黑箱优化的方法的进行优化,通过输出结果调整优化。

1.7 Policy Gradient Analytically

Assume policy πθπθ is differentiable whenever it is no-zero
and we can compute the gradient θπθ(s,a)\nabla_{\theta}\pi_\theta(s,a)
Likelihood ratios exploit the following tricks
θπθ(s,a)=πθ(s,a)θπθ(s,a)πθ(s,a)=πθ(s,a)θlogπθ(s,a)\nabla_{\theta}\pi_\theta(s,a) = \pi_\theta(s,a) \frac{\nabla_\theta\pi_\theta(s,a)}{\pi_\theta(s,a)} \\ = \pi_\theta(s,a) \nabla_\theta log\pi_\theta(s,a)
The score function is θlogπθ(s,a)\nabla_{\theta}log\pi_\theta(s,a)

1.8 交叉熵CEM

θ=argmax J(θ)\theta^*=argmax ~ J(\theta)
Treat J(θ)J(\theta) as a black box score function (not differentiable)

1.9 有限差分法 Finite Difference Method

To evaluate policy gradient of πθ(s,a)πθ(s, a)
For each dimension k[1,n]k \in [1, n]
  1. estimate kthkth partial derivative of objective function by perturbing θθ by a small amount ϵ\epsilon in kthkth dimension
    J(θ)θkJ(θ+ϵμk)J(θ)ϵ\frac{\partial J(\theta)}{\partial \theta_k} \approx \frac{J(\theta + \epsilon \mu_k) - J(\theta)}{\epsilon}
    where μk\mu_k is unit vector with 1 in kthkth component, 0 else where
uses nn evaluations to compute policy graident in total nn dimensions
though noisy and inefficient, but works for arbitrary policies, even if policy is not differentiable.

2.Monte-Carlo policy gradient

2.1 Policy Gradient for One-Step MDPs

初始化state后,就求出r=R(s,a)并进行求梯度。
Consider a simple class of one-step MDPs
  1. Starting in state sd(s)s ∼ d(s)
  2. Terminating after one time-step with reward r=R(s,a)r = R(s, a)
Use likelihood ratios to compute the policy gradient
J(θ)=Eπθ[r]=sSd(s)aAπθ(s,a)r\begin{align} J(\theta) &= E_{\pi \theta}[r] \\ &= \sum_{s \in S} d(s) \sum_{a \in A} \pi_\theta(s,a)r \end{align}
The gradient is as
θJ(θ)=sSd(s)aAπθ(s,a)θlogπθ(s,a)r=Eπθ[rθlogπθ(s,a)]\nabla_\theta J(\theta) = \sum_{s \in S}d(s) \sum_{a \in A} \pi_\theta(s,a) \nabla_\theta log \pi_\theta(s,a) r \\ = E_{\pi \theta}[r \nabla_\theta log \pi_\theta(s,a)]

2.2 Policy Gradient for Multi-Step MDPs

区别于单步policy gradient方法,将策略函数和环境进行交互而采取到是一条轨迹。定义Reward(γ)在在每个状态得到的奖励进行求和。P(γ;θ)则可以看成不断与环境交互得到的概率过程,经过初始状态s0后,采取不同的action,再乘以p(st+1,st,at)产生下一个状态。
Denote a state-action trajectory from one episode as τ=(s1,a0,r1,...,sT1,aT1,rT,sT)(πθ,P(st+1st,at))\tau=(s_1, a_0, r_1, ..., s_{T-1}, a_{T-1}, r_T, s_T) \sim (\pi_\theta, P(s_{t+1}|s_t, a_t))
Denote R(τ)=t=0T1R(st,at)R(\tau) = \sum_{t=0}^{T-1}R(s_t, a_t)as the sum of rewards over a trajectory τ\tau
The policy objective is
J(θ)=Eπθ[t=0T1R(st,at)]=τP(τ,θ)R(τ)J(\theta) = E_{\pi_\theta}[\sum_{t=0}^{T-1}R(s_t, a_t)]=\sum_{\tau}P(\tau, \theta)R(\tau)
where P(τ,θ)=μ(s0)t=0T1πθ(atst)p(st+1st,at)P(\tau, \theta)=\mu(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t)p(s_{t+1}|s_t, a_t) denotes the probability over trajectories when executing the policy πθ\pi_\theta
Then our goal is to find the policy parameter θ\theta
θ=argmaxθJ(θ)=argmaxθτP(τ,θ)R(τ)\theta^*=\underset{\theta}{argmax} J(\theta)=\underset{\theta}{argmax} \sum_{\tau}P(\tau, \theta)R(\tau)
Take the gradient with respect to θ\theta:
θJ(θ)=θτP(τ,θ)R(τ)=τθP(τ,θ)R(τ)=τP(τ,θ)P(τ,θ)θP(τ,θ)R(τ)=τP(τ,θ)R(τ)θP(τ,θ)P(τ,θ)=τP(τ,θ)R(τ)θlogP(τ,θ)\begin{align} \nabla_\theta J(\theta) &= \nabla_\theta \sum_{\tau}P(\tau, \theta) R(\tau) \\ &= \sum_{\tau} \nabla_\theta P(\tau, \theta) R(\tau) \\ &=\sum_{\tau} \frac{P(\tau, \theta)}{P(\tau, \theta)} \nabla_\theta P(\tau, \theta)R(\tau) \\ & = \sum_{\tau} P(\tau, \theta) R(\tau) \frac{\nabla_\theta P(\tau, \theta)}{P(\tau, \theta)} \\ & =\sum_{\tau} P(\tau, \theta) R(\tau) \nabla_\theta log P(\tau, \theta) \end{align}
因为P(τ,θ)P(\tau , \theta)是一个不断连乘的过程,计算梯度时这里用个小技巧:乘以一个P(τ,θ)P(\tau , \theta)除以一个P(τ,θ)P(\tau , \theta),将梯度变成log的形式,从而将连乘变成累加。
Approximate with empirical estimate for mm sample paths under policy πθ\pi_\theta
θJ(θ)1mi=1mR(τi)θlogP(τi,θ)\nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m}R(\tau_i) \nabla_\theta log P(\tau_i, \theta)
在上式中用MC去采样很多轨迹去近似导数

2.3 Decomposing the Trajectories into States and Actions

Approximate with empirical estimate for mm sample paths under policy πθ\pi_\theta
θJ(θ)1mi=1mR(τi)θlogP(τi,θ)\nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m}R(\tau_i) \nabla_\theta log P(\tau_i, \theta)
DecomposeθlogP(τ,θ) \nabla_\theta log P(\tau, \theta)
θlogP(τ,θ)=θlog[μ(s0)t=0T1πθ(atst)p(st+1st,at)]=θ[logμ(s0)+t=0T1logπθ(atst)+logp(st+1st,at)]=t=0T1θlogπθ(atst)\begin{align} \nabla_\theta log P(\tau, \theta) & = \nabla_\theta log [\mu(s_0) \prod_{t=0}^{T-1} \pi_\theta(a_t|s_t)p(s_{t+1}|s_t, a_t)] \\ & =\nabla_\theta [log\mu(s_0) + \prod_{t=0}^{T-1} log\pi_\theta(a_t|s_t) + log p(s_{t+1}|s_t, a_t)] \\ &= \sum_{t=0}^{T-1} \nabla_\theta log \pi_\theta(a_t|s_t) \end{align}
log将连乘变累加,将第一项和第三项(与θ无关项)消去了,减少计算量,最后只剩下score function。

2.4 Likelihood Ratio Policy Gradient

Then our goal is to find the policy parameter θ\theta
θ=argmaxθJ(θ)=argmaxθτP(τ,θ)R(τ)\theta^*=\underset{\theta}{argmax} J(\theta)=\underset{\theta}{argmax} \sum_{\tau}P(\tau, \theta)R(\tau)
Approximate with empirical estimate for m sample paths under policy πθ\pi_\theta
θJ(θ)1mi=1mR(τi)θlogP(τi,θ)\nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m}R(\tau_i) \nabla_\theta log P(\tau_i, \theta)
Then we have θlogP(τ,θ)=t=0T1θlogπθ(atst)\nabla_\theta log P(\tau, \theta) = \sum_{t=0}^{T-1} \nabla_\theta log \pi_\theta(a_t|s_t)
θJ(θ)1mi=1mR(τi)t=0T1θlogπθ(atisti)\nabla_\theta J(\theta) \approx \frac{1}{m} \sum_{i=1}^{m}R(\tau_i) \sum_{t=0}^{T-1} \nabla_\theta log \pi_\theta(a_t^i|s_t^i)
It do not need to know the dynamics model!
最后让求梯度变成加和的形式,将多个轨迹上的奖励累加起来,而每条轨迹每一步都有score function的加和,整个过程我们并不需要知道状态转移函数。

2.5 Score Funciton的广义形式

Consider the generic from of E_{\tau \sim \pi_\theta}[R(\tau)] as
θEp(x,θ)[f(x)]=Ep(x,θ)[f(x)θlogp(x,θ)]1Ss=1Sf(xs)θlogp(xs,θ),where xsp(x,θ)\begin{align} \nabla_\theta E_{p(x,\theta)}[f(x)] & = E_{p(x, \theta)}[f(x) \nabla_\theta log p(x, \theta)] \\ & \approx \frac{1}{S} \sum_{s=1}^{S} f(x_s) \nabla_\theta log p(x_s, \theta), where ~ x_s \sim p(x, \theta) \end{align}
  • compute the gradient of an expectation of a function f(x)f(x)
The above gradient can be understood as:
  1. Shift the distribution p through its parameter θ\theta to let its future sample xx achieve higher scores as judged by f(x)f(x)
  2. The direction of f(x)θlogp(x,θ)f(x)\nabla_\theta log p(x, \theta) pushes up the log likelihood of the sample, in proportion to how good it is

2.6 Policy gradient estimator VS Maximum likelihood estimator

Policy gradient estimator:
1Mm=1M(t=1Tθlogπθ(at,mst,m))(t=1Tr(st,m,at,m))\frac{1}{M} \sum_{m=1}^{M}(\sum_{t=1}{T} \nabla_\theta log \pi_\theta(a_{t,m}|s_t, m))(\sum_{t=1}^{T}r(s_{t,m}, a_{t,m}))
Maximum likelihood estimator:
θJML(θ)1Mm=1M(t=1Tθlogπθ(at,mst,m))\nabla_\theta J_{ML}(\theta) \approx \frac{1}{M} \sum_{m=1}^{M}(\sum_{t=1}^{T} \nabla_\theta log \pi_\theta(a_{t,m}|s_{t,m}))
可以发现极大似然估计和策略梯度估计很类似,策略梯度方法多乘了一个reward的函数,这个函数理解为称有正有负的奖励函数。策略梯度估计就类似于加权后的极大似然估计。

3.Reduce the variance of policy gradient

3.1 Large Variance of Policy Gradient

如果采用MC的方法计算gradient虽然能得到无偏估计unbias,但是它的方差是很大的(noisy)
优化改进策略:
  • 引入时序上因果关系,将不必要的项消去
  • 包含一个baseline

3.2 引入时序上的因果关系

Previously θEτ[R]=Eτ[(t=0T1rt)(t=0T1θlogπθ(atst))]\nabla_\theta E_{\tau}[R]=E_{\tau}[(\sum_{t=0}^{T-1} r_t)(\sum_{t=0}^{T-1} \nabla_\theta log \pi_\theta (a_t | s_t))]
We can derive the gradient estimator for a single reward term rtr_{t'} as
θEτ[rt]=Eτ[rtt=0tθlogπθ(atst))]\nabla_\theta E_{\tau}[r_{t'}]=E_{\tau}[r_{t'} \sum_{t=0}^{t'} \nabla_\theta log \pi_\theta (a_t | s_t))]
Summing this formula over tt, we obtain
θJ(θ)=θEτπθ=Eτ[t=0T1rtt=0tθlogπθ(atst)]=Eτ[t=0T1θlogπθ(atst)t=tT1rt]=Eτ[t=0T1Gtθlogπθ(atst)]\begin{align} \nabla_\theta J(\theta) = \nabla_\theta E_{\tau \sim \pi_\theta} & = E_{\tau}[\sum_{t'=0}^{T-1} r_{t'}\sum_{t=0}^{t'} \nabla_\theta log \pi_\theta (a_t | s_t)] \\ & = E_{\tau}[\sum_{t=0}^{T-1} \nabla_\theta log \pi_\theta (a_t | s_t) \sum_{t'=t}^{T-1} r_{t'}] \\ & = E_{\tau}[\sum_{t=0}^{T-1} G_t * \nabla_\theta log \pi_\theta (a_t | s_t)] \end{align}
Therefore we have
θEτπθ=Eτ[t=0T1Gtθlogπθ(atst)]\nabla_\theta E_{\tau \sim \pi_\theta} = E_{\tau}[\sum_{t=0}^{T-1} G_t * \nabla_\theta log \pi_\theta (a_t | s_t)]
Gt=t=tT1rtG_t = \sum_{t'=t}^{T-1} r_{t'} is the return for a trajectory at step tt
Causality : policy at time tt' cannot affect reward at time t when t<tt < t'
Then we can have the following estimated update
θE[R]1mi=0mt=0T1Gt(i)×θlogπθ(atisti)\nabla_\theta E[R] \approx \frac{1}{m} \sum_{i=0}^m \sum_{t=0}^{T-1} G_t^{(i)} \times \nabla_\theta log \pi_\theta (a_t^i | s_t^i)

REINFORCE : A Monte-Carlo policy gradient algorithm

The algorithm simply samples multiple trajectories following the policy πθ\pi_\theta while updating θ\theta using the estimated gradient
Classic parper : Williams (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning: introduces REINFORCE algorithm

3.3 包含 baseline 项

The original update
θEτπθ=Eτ[t=0T1Gtθlogπθ(atst)]\nabla_\theta E_{\tau \sim \pi_\theta} = E_{\tau}[\sum_{t=0}^{T-1} G_t * \nabla_\theta log \pi_\theta (a_t | s_t)]
Gt=t=tT1rtG_t = \sum_{t'=t}^{T-1} r_{t'} is the return for a trajectory which might have high variance
We subtract baseline b(s)b(s) from the policy gradient to reduce variance
θEτπθ=Eτ[t=0T1(Gtb(st))θlogπθ(atst)]\nabla_\theta E_{\tau \sim \pi_\theta} = E_{\tau}[\sum_{t=0}^{T-1} (G_t - b(s_t)) * \nabla_\theta log \pi_\theta (a_t | s_t)]
A good baseline is the expected return
b(st)=E[rt+rt+1+...+rT1]b(s_t) = E[r_t + r_{t+1} + ... + r_{T-1}]
Interpretation : increase the logprob of action a_t proportional to how much returns G_t are better than the expected return
We can prove that baseline b(s) can reduce variance, without changing the expectation:
Eτ[θlogπθ(atst)b(st)]=0Eτ[θlogπθ(atst)(Gtb(st))]=Eτ[θlogπθ(atst)Gt]Eτ[θlogπθ(atst)(Gtb(st))]<Varτ[θlogπθ(atst)Gt]E_\tau[\nabla_\theta log \pi_\theta (a_t | s_t) b(s_t)] = 0 \\ E_\tau[\nabla_\theta log \pi_\theta (a_t | s_t)(G_t - b(s_t))] = E_\tau[\nabla_\theta log \pi_\theta (a_t | s_t) G_t] \\ E_\tau[\nabla_\theta log \pi_\theta (a_t | s_t)(G_t - b(s_t))] < Var_\tau[\nabla_\theta log \pi_\theta (a_t | s_t) G_t]
Thus subtracting a baseline is unbiased in expectation but reduces variance
θJ(θ)=Eτ[t=0T1(Gtbw(st))θlogπθ(atst)]\nabla_\theta J(\theta) = E_\tau[\sum_{t=0}^{T-1} (G_t - b_w(s_t)) * \nabla_\theta log \pi_\theta (a_t | s_t)]
证明引入baseline后不会改变policy gradient的实际值,并且期望不变,方差变小。
Baseline b(s)b(s) can reduce variance, without changing the expectation
bw(s)b_w(s) also has a parameter ww to learn so that we have two set of parameters ww and θθ

Vanilla Policy Gradient Algorithm with Baseline

需要同时估计两个参数baselinewwJθθθ
Sutton, McAllester, Singh, Mansour (1999). Policy gradient methods for reinforcement learning with function approximation

3.4 引入Critic评论家

The update is
θJ(θ)=Eτ[t=0T1Gtθlogπθ(atst)]\nabla_\theta J(\theta) = E_\tau[\sum_{t=0}^{T-1} G_t * \nabla_\theta log \pi_\theta (a_t | s_t)]
In practice, GtG_t is a sample from Monte Carlo policy gradient, which is the unbiased but noisy estimate of Qπθ(st,at)Q^{\pi \theta}(s_t, a_t)
Instead we can use a critic to estimate the action-value function
Qw(s,a)Qπθ(s,a)Q_w(s,a) \approx Q^{\pi \theta}(s,a)
Then the update becomes
θJ(θ)=Eτ[t=0T1Qw(st,at)θlogπθ(atst)]\nabla_\theta J(\theta) = E_\tau[\sum_{t=0}^{T-1} Q_w(s_t, a_t) * \nabla_\theta log \pi_\theta (a_t | s_t)]
Q函数表示当前状态采取行为得到的价值。

4.Actor-Critic

4.1 Actor-Critic 策略梯度方法

θJ(θ)=Eπθ[t=0T1Qw(st,at)θlogπθ(atst)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\sum_{t=0}^{T-1} Q_w(s_t, a_t) * \nabla_\theta log \pi_\theta (a_t | s_t)]
Critic : t=0T1Qw(st,at)\sum_{t=0}^{T-1} Q_w(s_t, a_t), 实际跟环境去交互产生训练数据的一个策略
Actor : θlogπθ(atst)\nabla_\theta log \pi_\theta (a_t | s_t), 实际的价值函数用于评论奖励,类似policy evaluation,当给定当前policy function得到获得的价值。(可以用MC/TD等方法)
It becomes Actor-Critic Policy Gradient
  1. Actor: the policy function used to generate the action
  2. Critic: the value function used to evaluate the reward of the actions
Actor-critic algorithms maintain two sets of parameters
  1. Actor: Updates policy parameters θθ, in direction suggested by critic
  2. Critic: Updates action-value function parameters ww

4.2 Actor-Critic 算法流程

Using a linear value function approximation: Qw(s,a)=ψ(s,a)TwQ_w(s,a)=\psi(s,a)^T w
  • Critic : update ww by a linear TD(0)
  • Actor : update θ\theta by policy gradient

4.3 价值函数和策略函数拟合

We can have two separate functions to approximate value function and policy function, or use a shared network design (feature extraction is shared but output two heads), as below:
之前提高可以加baseline的项减少方差,而Q函数和V函数的关系为:V函数为Q函数的期望。所以这里可以直接将V函数作为baseline进行优化,得到A函数。
Recall Q-function / state-action-value function:
Qπ,γ(s,a)=Eπ[r1+γr2+...s1=s,a1=a]Q^{\pi, \gamma}(s,a)=E_{\pi}[r_1 + \gamma r_2 + ... | s_1=s, a_1=a]
State value function can serve as a great baseline
Vπ,γ(s)=Eπ[r1+γr2+...s1=s]=Eaπ[Qπ,γ(s,a)]\begin{align} V^{\pi, \gamma}(s) &= E_\pi[r_1+\gamma r_2 + ... | s_1=s] \\ & =E_{a \sim \pi}[Q^{\pi, \gamma}(s,a)] \end{align}
Advantage function : combining Q with baseline V
Aπ,γ(s,a)=Qπ,γ(s,a)Vπ,γ(s)A^{\pi, \gamma}(s,a) = Q^{\pi, \gamma}(s,a) - V^{\pi, \gamma}(s)
Then the policy gradient becomes:
θJ(θ)=Eπθ[θlogπθ(s,a)Aπ,γ(s,a)]\nabla_\theta J(\theta) = E_{\pi_\theta}[\nabla_\theta log \pi_\theta(s,a) A^{\pi, \gamma}(s,a)]

4.4 Policy Gradient扩展

当碰到部分不可微的情况时,将样本替代不可微部分,这也将产生很多分支。最后将每个样本传回来的梯度进行加和再近似该不可微部分的梯度。
Another interesting advantage of Policy Gradient is that it allows us to overcome the non-differentiable computation
During training we will produce several samples (indicated by the branches below), and then we'll encourage samples that eventually led to good outcomes (in this case for example measured by the loss at the end)

5.基于价值和基于策略的两大学派RL

5.1 Value-based 的学派

通过动态规划解决强化学习,基于经典强化学习和控制理论应用,从控制论出发。
代表人物:Sutton、David Silver(Alpha Go)
倾向于游戏应用,只要计算资源够,跑很多仿真。

5.2 Policy-based 的学派

主要是机器人背景和机器学习背景,他们更推崇基于策略函数的强化学习方法。
代表人物:Pieter Abbeel, Sergey Levine
着重sample efficiency,更节省采样。


Comment
avatar
Dongnian
A salty fish swimming in the sea of deep learning!
Follow Me
Announcement
Welcome to My Personal Blog!
If Not, Please Visit Gitee Mirror.
Recent Post
Info
Article :
87
Run time :
Total Count :
411.2k
UV :
PV :
Last Push :