4.价值函数近似 - wolai 笔记

1.Introduction

1.1 问题引入

大规模的MDP问题如何估计价值函数?
答:在面对大规模MDP问题时,要避免用table去表示特征(Q-table等),而是采用参数的函数近似的方式去近似估计V、Q、π\pi

1.2 Solution

Estimate with function approximation
v^(s,w)vπ(s)q^(s,a,w)qπ(s,a)π^(a,s,w)π(as)\hat{v}(s, w) \approx v^{\pi}(s) \\ \hat{q}(s, a, w) \approx q^{\pi}(s,a) \\ \hat{\pi}(a,s,w) \approx \pi(a|s)
  1. Generalize from seen states to unseen states
  2. Update the parameter w using MC and TD learning

1.3 Types of value function approximation

1.4 近似方法

  1. Linear combinations of features
  2. Neural networks
  3. Decision trees
  4. Nearest neighbors

1.5 梯度下降

  1. Consider a function j(w)j(w) that is a differentiable function of a parameter vector ww
  2. Goal is to find parameter ww* that minimizes jj
  3. Define the gradient of J(w)J(w) to be
    wJ(w)=(j(w)w1,j(w)w2,...,j(w)wn)T\nabla_wJ(w) = ( \frac {\partial j(w)} {\partial w_1}, \frac {\partial j(w)} {\partial w_2}, ..., \frac {\partial j(w)} {\partial w_n} )^T
  4. Adjust ww in the direction of the negative gradient, where α\alpha is step-size
    Δw=12αwJ(w)\Delta w = - \frac{1}{2} \alpha \nabla_wJ(w)

1.6 Value Function Approximation with an Oracle

OracleGround Truth/真值的意思,想要拟合最准确的v函数,就可以用已知policyv函数的值拟合v函数
  1. We assume that we have the oracle for knowing the true value for vπ(s)v_{\pi}(s) for any given state ss
  2. Then the objective is to find the best approximate representation for vπ(s)v_{\pi}(s)
  3. Thus use the mean squared error and define the loss function as
    J(w)=Eπ[(vπ(s)v^(s,w))2]J(w)=E_{\pi}[(v^{\pi}(s) - \hat{v}(s, w))^2]
  4. Follow the gradient descend to find a local minimum
    Δw=12αwJ(w)wt+1=wt+Δw\Delta w = - \frac{1}{2} \alpha \nabla_wJ(w) \\ w_{t+1} = w_t + \Delta w

2.Prediction

2.1 已知Oracle vπ(s)v^{\pi}(s)

Represent state using a feature vector :
x(s)=(x1(s),...,xn(s))Tx(s) = (x_1(s), ..., x_n(s))^T

(1) 已知V Function是线性

随机梯度下降收敛到全局最优。 因为在线性情况下,只有一个最优解,因此局部最优解为自动收敛到或接近全局最优。
  1. Represent value function by a linear combination of features
    v^(s,w)=x(s)Tw=j=1nxj(s)wj\hat{v}(s, w) = x(s)^Tw= \sum_{j=1}^n x_j(s)w_j
  2. The objective function is quadratic in parameter ww
    J(w)=Eπ[(vπ(s)x(s)Tw)2]J(w)=E_{\pi}[(v^{\pi}(s) - x(s)^Tw)^2]
  3. Thus the update rule is as simple as
    Δw=α(vπ(s)v^(s,w))x(s)\Delta w = \alpha(v_{\pi}(s) - \hat{v}(s,w))x(s)
    Updata=StepSize×PredictionError×FeatureValueUpdata = StepSize \times PredictionError \times FeatureValue
  4. Stochastic gradient descent converges to global optimum. Because in the linear case, there is only one optimum, thus local optimum is automatically converge to or near the global optimum.

(2)Table Lookup Feature

如果用one-hot编码,Table Lookup Feature
  1. Table lookup is a special case of linear value function approximation
  2. Table lookup feature is one-hot vector as follows
    xtable(s)=(1(s=s1),...,1(s=sn))Tx^{table}(s) =(1(s=s_1), ..., 1(s=s_n))^T
  3. Then we can see that each element on the parameter vector ww indicates the value of each individual state
    v^(s,w)=(1(s=s1),....,1(s=sn))(w1,...,wn)T\hat{v}(s, w) = (1(s=s_1), ...., 1(s=s_n))(w_1, ..., w_n)^T
  4. Thus we have v^(sk,w)=wk\hat{v}(s_k,w)=w_k
拟合出的价值函数就是等于wkw_k

2.2 未知Oracle vπ(s)v^{\pi}(s)

(1)Model-free Prediction with VFA

基于采样计算
In practice, no access to oracle of the true valuevπ(s) v^{\pi}(s) for any state ss
Recall model-free prediction
  1. Goal is to evaluate vπv^{\pi} following a fixed policy π\pi
  2. A lookup table is maintained to store estimates vπv^{\pi} or qπq^{\pi}
  3. Estimates are updated after each episode (MC method) or after each step (TD method)
Thus what we can do is to include the function approximation step in the loop
Incremental VFA Prediction Algorithms
  1. We assumed that true value function vπ(s)v^{\pi}(s) given by supervisor / oracle
    Δw=α(vπ(s)v^(st,w))wv^(st,w)\Delta w = \alpha(v^{\pi}(s) - \hat{v}(s_t, w)) \nabla_w \hat{v}(s_t,w)
  2. But in RL there is no supervisor, only rewards
  3. In practice, we substitute the target for vπ(s)v^{\pi}(s)
    1. For MC, the target is the actual return GtG_t
      Δw=α(Gtst,w^)wv^(st,w)\Delta w = \alpha(G_t - \hat{s_t, w}) \nabla_ w \hat{v}(s_t, w)
    2. For TD(0), the target is the TD target Rt+1+γv^(st+1,w)R_{t+1}+\gamma \hat{v}(s_{t+1}, w)
      Δw=α(Rt+1+γv^(St+1,w)v^(st,w))wv^(st,w)\Delta w = \alpha(R_{t+1} + \gamma \hat{v}(S_{t+1},w) - \hat{v}(s_t, w)) \nabla_w \hat{v}(s_t,w)

(2)Monte-Carlo Prediction with VFA

MC returnGt特点:无偏估计、但是有很大噪声(采很多次样才能得出结果),方差大
Return G_t is an unbiased, but noisy sample of true value v^{\pi}(s_t)
Why unbiased?E[Gt]=vπ(st) E[G_t] = v^{\pi}(s_t)
so we have the training data that can be used for supervised learning in VFA:
<S1,G1>,<S2,G2>,...,<St,GT><S_1, G_1>, <S_2, G_2>, ..., <S_t, G_T>
Using linear Monte-Carlo policy evaluation
Δw=α(Gtv^(st,w))wv^(st,w)=α(Gtv^(st,w))x(st)\Delta w = \alpha (G_t - \hat{v}(s_t, w)) \nabla_w \hat{v}(s_t, w) \\ = \alpha(G_t - \hat{v}(s_t, w)) x(s_t)
Monte-Carlo prediction converges, in both linear and non-linear value function approximation

(3)TD Prediction with VFA

TD target特点:有偏估计(取期望时包含了正在优化的W)
TD target Rt+1+γv^(st+1,w)R_{t+1} + \gamma \hat{v}(s_{t+1}, w) is a biased sample of true value vπ(st)v^{\pi}(s_t)
Why biased? It is drawn from our previous estimate, rather than the true value: E[Rt+1+γv^(st+1,w)]vπ(st)E[R_{t+1} + \gamma \hat{v}(s_{t+1}, w)] \neq v^{\pi}(s_t)
We have the training data used for supervised learning in VFA:
<S1,R2+γv^(s2,w)>,<S2,R3+γv^(s3,w)>,...,<ST1,RT><S_1, R_2+\gamma \hat v(s_2, w)>, <S_2, R_3+\gamma \hat v(s_3, w)>, ..., <S_{T-1}, R_T>
Using linear TD(0), the stochastic gradient descend update is
Δw=α(R+γv^(s,w)v^(s,w))wv^(s,w)=α(R+γv^(s,w)v^(s,w))x(s)\Delta w = \alpha (R + \gamma \hat v (s', w) - \hat v (s, w)) \nabla_w \hat v(s, w) \\ = \alpha (R + \gamma \hat v (s', w) - \hat v (s, w)) x(s)
This is also called as semi-gradient, as we ignore the effect of changing the weight vector w on the target
Linear TD(0) converges(close) to global optimum

3.Control

Generalized policy iteration
Policy evaluation: approximate policy evaluation,q^(.,.,w)qπ \hat q(.,.,w) \approx q^{\pi}
Policy improvement: ϵgreedy\epsilon - greedy policy improvement

3.1 Q函数(Action-value Function)估计

(1)已知实际Q函数

可通过计算MSE和梯度下降的方法估计Q函数:
Approximate the action-value function
q^(s,a,w)qπ(s,a)\hat q(s,a,w) \approx q^{\pi}(s,a)
Minimize the MSE (mean-square error) between approximate action-value and true action-value (assume oracle)
J(w)=Eπ[(qπ(s,a)q^(s,a,w))2]J(w) = E_{\pi}[(q^{\pi}(s,a) - \hat q(s,a,w))^2]
Stochastic gradient descend to find a local minimum
Δ=α(qπ(s,a)q^(s,a,w))wq^(s,a,w)\Delta = \alpha (q^{\pi}(s,a)-\hat q(s,a,w)) \nabla_w \hat q(s,a,w)

(2)如果Q函数为线性

Represent state and action using a feature vector
x(s,a)=(x1(s,a),...,xn(s,a))Tx(s,a) = (x_1(s,a), ..., x_n(s,a))^T
Represent action-value function by a linear combination of features
q^(s,a,w)=x(s,a)Tw=j=1nxj(s,a)wj\hat q(s,a,w) = x(s,a)^T w = \sum_{j=1}^{n} x_j(s,a) w_j
Thus the stochastic gradient descend update
Δw=α(qπ(s,a)q^(s,a,w))x(s,a)\Delta w= \alpha(q^{\pi}(s,a) - \hat q(s,a,w))x(s,a )

3.2 Value function approximation

如果没有获得真正的Ground truthQ函数,用GtTD target代替oracle, 从而更新Q函数近似函数的参数。
For MC, the target is return GtG_t
Δw=α(Gtq^(st,at,w))wq^(st,at,w)\Delta w= \alpha (G_t - \hat q(s_t, a_t, w)) \nabla_w \hat q(s_t, a_t, w)
For Sarsa, the target is TD target Rt+1+γq^(st+1,at+1,w)R_{t+1} + \gamma \hat q(s_{t+1}, a_{t+1}, w)
Δw=α(Rt+1+γq^(St+1,at+1,w)q^(st,at,w))wq^(st,at,w)\Delta w= \alpha (R_{t+1} + \gamma \hat q(S_{t+1}, a_{t+1}, w) - \hat q(s_t, a_t, w)) \nabla_w \hat q(s_t, a_t, w)
For Q-learning, the target is TD targetRt+1+γmaxaq^(st+1,a,w) R_{t+1} + \gamma max_a \hat q(s_{t+1}, a, w)
Δw=α(Rt+1+γ maxa q^(St+1,at+1,w)q^(st,at,w))wq^(st,at,w)\Delta w= \alpha (R_{t+1} + \gamma \space \underset{a}{max} \space \hat q(S_{t+1}, a_{t+1}, w) - \hat q(s_t, a_t, w)) \nabla_w \hat q(s_t, a_t, w)
其中,Sarsa for VFA Control :

3.3 VFA Control的收敛问题

TD在使用off-policy或使用非线性函数拟合时会发散(难以收敛)
TD with VFA 得出来目标函数的梯度是不准确的(gradient本身包含了正在优化的参数)
update包含了Bellman backup的近似和价值函数的近似
off-policybehavior policy(采取数据的策略) and target policy(优化的策略)不完全相同,因此价值估计函数难以收敛

3.4 强化学习的死亡三角

强化学习不确定不稳定的因素
  1. Function approximation:近似V或者Q会引入误差
  2. Bootstrapping:TD基于之前来估计当前函数会引入噪声,使得网络过拟合;MC用了实际的return,而且是无偏的,所以稍微好些(如果足够多会近似真值)
  3. Off-policy training:采用behavior policy来采取数据,但优化的是target policy,导致采集的数据和优化的函数是不一样的

3.5 控制算法的收敛性问题

(√) moves around the near-optimal value function

3.6 Batch优化

之前单步优化效率低,Batch-based方法优化batch中所有样本,

3.7 Least Square Prediction

目标:求D数据库中pair的最小均方差
Given the value function approximation v^(s,w)vπ(s)\hat v(s,w) \approx v^{\pi}(s)
The experience DD consisting of <state,value><state, value> pairs (may from one episode or many previous episodes)
D={<s1,v1π>,...,<st,vTπ>}D = \{<s_1, v_1^{\pi}>, ..., <s_t, v_T^{\pi}> \}
Objective: To optimize the parameter ww that best fit all the experience DD
Least squares algorithms are used to minimize the sum-squared error between v^(st,w)\hat v(s_t, w) and the target values vtπv^{\pi}_t
w=argminwED[(vπv^(s,w))2]=argminwt=1T(vtπv^(st,w))2w^* = \underset{w}{arg min} E_D[(v^{\pi} - \hat v(s,w))^2] \\ = \underset{w}{arg min} \sum_{t=1}^{T}(v_t^{\pi} - \hat v(s_t, w))^2
如果数据集太大,每次随机采样部分样本,计算梯度并优化函数,不断迭代。

4.Deep Q Networks

4.1 Deep Reinforcement Learning

Frontier in machine leaning and artificial intelligence
Deep neural networks are used to represent
  1. value function
  2. Policy function (policy gradient methods to be introduced)
  3. World model
Loss function is optimized by stochastic gradient descent(SGD)
Challenges
  1. Efficiency : too many model parameters to optimize
  2. The Deadly Triad for the danger of instability and divergence in training
    1. Nonlinear function approximation
    2. Bootstrapping
    3. Off-policy training

4.2 DQN for Playing Atari Games

End-to-end learning of values Q(s,a) from input pixel frame
Input state ss is a stack of raw pixels from latest 4 frames
Output of Q(s,a)Q(s, a) is 18 joystick / button positions
Reward is the change in score for that step
Network architecture and hyperparameters fixed across all games

4.3 训练DQN问题及解决方法

  1. 样本之间的具有相关性:Experience replay
  2. 不稳定的目标:Fixed Q targets

4.4 DQNs: Experience Replay

To reduce the correlations among samples, store transition (s_t, a_t, r_t, s_{t+1}) in replay memory D
To perform experience replay, repeat the following
  1. sample an experience tuple from the dataset: (s,a,r,s)D(s, a, r, s') \sim D
  2. compute the target value for the sampled tuple: r+γmaxaQ^(s,a,w)r + \gamma max_{a'} \hat Q(s', a', w)
  3. use stochastic gradient descent to update the network weights
    Δw=α(r+γmaxaQ^(s,a,w)Q(s,a,w))wQ^(s,a,w)\Delta w = \alpha (r + \gamma \underset{a'}{max} \hat Q(s', a', w) - Q(s,a,w)) \nabla_w \hat Q(s,a,w)

4.5 DQNs: Fixed Targets

To help improve stability, fix the target weights used in the target calculation for multiple updates
Let a different set of parameter ww^- be the set of weights used in the target, and ww be the weights that are being updated
To perform experience replay with fixed target, repeat the following
  1. sample an experience tuple from the dataset: (s,a,r,s)D(s, a, r, s') \sim D
  2. compute the target value for the sampled tuple: r+γmaxaQ^(s,a,w)r + \gamma max_{a'} \hat Q(s', a', w^-)
  3. use stochastic gradient descent to update the network weights
    Δw=α(r+γmaxaQ^(s,a,w)Q(s,a,w))wQ^(s,a,w)\Delta w = \alpha (r + \gamma \underset{a'}{max} \hat Q(s', a', w^-) - Q(s,a,w)) \nabla_w \hat Q(s,a,w)

4.6 Loss

在计算loss时,我们是通过计算TD target(Q_target)和当前Q value(Q的估计)的差来得到的。
但是其实对于真实的TD target,我们是不知道的,没有概念的。我们也是需要预测的。使用Bellman 方程,我们发现TD target仅仅是当前动作的reward加上通过衰减的下一个state的最高Q value。
问题是我们使用同样的参数(网络权重)来估计targetQ value。这就使得网络权重和TD target具有很强的相关性。 因此,这就意味着训练的每一步,我们的Q value偏移了,同时Q target也偏移了。
所以我们将使用一个单独的固定参数的网络(w-)来预测TD target

4.7结果和Demo



Comment