3.无模型的价值函数估计和控制 - wolai 笔记

1.Model-free RL

Both of the policy iteration and value iteration assume the direct access to the dynamics and rewards of the environment. But, in a lot of real-world problems, MDP model is either unknown or known by too big or too complex to use.
Model-free RL can solve the problems through interaction with the environment
No more direct access to known transition dynamics and reward function
Tarjectories / episodes are collected by the agent's interaction with the environment
Each trajectory / episode contains {S1,A1,R1,A2,R2,...,ST,AT,RT}\{S_1, A_1, R_1, A_2, R_2,...,S_T, A_T,R_T\}

2.Model-free prediction

policy evaluation with access to the model
Estimating the expected return of a particular policy if we don't have access to the MDP models:
  1. Monte Carlo policy evaluation
  2. Temporal Difference (TD) learning

3.Monte Carlo Policy Evaluation

3.1 Overall

蒙特卡洛的方法主要是基于采样的方法,让agent与环境进行交互,得到很多轨迹。每一个轨迹都会得到一个实际的收益。然后直接从实际的收益来估计每个状态的价值
Return : Gt=Rt+1+γRt+2+γ2Rt+3+...G_t = R_{t+1} + \gamma R_{t+2} + {\gamma}^2 R_{t+3} + ... under policy π\pi
Vπ(s)=Eτπ[Gtst=s]V_{\pi}(s) = E_{\tau \sim \pi} [G_t| s_t = s], thus expectation over trajectories τ\tau generated by following π\pi
MC simulation : we can simply sample a lot of trajectories, compute the actual returns for all the trajectories, then average them
MC policy evaluation uses empirical mean return instead of expected return
MC does not require MDP dynamics / reward, no bootstrapping, and does not assume state is Markov
Only applied to episodic MDPs(each episode terminates)

3.2 Alg

当要估计某个状态的时候,从这个状态开始通过数数的办法,数这个状态被访问了多少次,从这个状态开始我们总共得到了多少的return,最后再取empirical mean,就会得到每个状态的价值。
To evaluate state V(s)
  1. Every time-step tt that state ss is vistied in an episode
  2. Increment counter N(s)N(s)+1N(s) \leftarrow N(s) + 1
  3. Increment total return S(s)S(s)+GtS(s) \leftarrow S(s) + G_t
  4. Value is estimated by mean return v(s)=S(s)/N(s)v(s) = S(s) / N(s)
By law of large numbers(大数定律),v(s)vπ(s) v(s) \rightarrow v^{\pi}(s) as N(s)N(s) \rightarrow \infty

3.3 Incremental Mean

写成逐步叠加的办法。
Mean from the average of samples x1,x2,...x_1, x_2, ...
ut=1tj=1txj=1t(xt+j=1txj)=1t(xt+(t1)ut1)=ut1+1t(xtvt1)\begin{align} u_t &=\frac{1}{t}\sum_{j=1}^{t}x_j \\ &= \frac{1}{t}(x_t + \sum_{j=1}^{t}x_j) \\ &= \frac{1}{t}(x_t + (t-1)u_{t-1}) \\ &= u_{t-1} + \frac{1}{t}(x_t - v_{t-1}) \end{align}
通过采样很多次,就可以平均;然后进行分解,把加和分解成最新的xtx_t的值和最开始到t1t-1的值;然后把上一个时刻的平均值也带入进去,分解成(t1)(t-1)乘以上一个时刻的平均值;这样就可以把上一个时刻的平均值和这一个时刻的平均值建立一个联系。
通过这种方法,当得到一个新的xtx_t的时候,和上一个平均值进行相减,把这个差值作为一个残差乘以一个learning rate,加到当前的平均值就会更新我们的现在的值。这样就可以把上一时刻的值和这一时刻更新的值建立一个联系。

3.4 Incremental MC Updates

把蒙特卡洛方法也写成Incremental MC更新的方法。
Collect one episode (S1,A1,R1,...,St)(S_1, A_1, R_1, ..., S_t)
For each state sts_t with computed return GtG_t
N(St)N(St)+1v(St)v(St)+1N(St)(Gtv(St))N(S_t) \leftarrow N(S_t) + 1 \\ v(S_t) \leftarrow v(S_t) + \frac{1}{N(S_t)}(G_t - v(S_t))
Or use a running mean (old episodes are forgotten). Good for non-stationary problems
v(St)v(St)+α(Gtv(St))v(S_t) \leftarrow v(S_t) + \alpha (G_t - v(S_t))

3.5 Difference between DP and MC for policy evaluation

Dynamic Programming (DP) computes viv_i by bootstrapping the rest of the expected return by the value estimate vi1v_{i-1}
Iteration on Bellman expectation backup:
vi(s)aAπ(as)(R(s,a)+γsSP(ss,a)vi1(s))v_i(s) \leftarrow \sum_{a \in A} \pi(a|s) (R(s,a) + \gamma \sum_{s' \in S}P(s'|s,a)v_{i-1}(s'))
MC updates the empirical mean return with one sampled episode
v(St)v(St)+α(Gi,tv(St))v(S_t) \leftarrow v(S_t) + \alpha(G_{i, t} - v(S_t))

3.6 MC相对于DP的优点

  1. MC可应用于环境未知(transition function)
  2. 对于环境比较复杂的情况,转移概率很难计算,MC基于采样,不需要计算。
  3. 计算量的优势:DP要考虑所有的状态转移情况才能更新一步。MC在估计单个状态时与总价值无关。效率更高。

4.Temporal-Difference (TD) Learning

TDDPMC折中的一种办法,利用bootstrapping可以从不完整的episode学习。也是基于采样的可用于unknown MDP(transition/rewards)环境中

4.1 Alg : TD(0)

TD(0)指往前看一步。
Objective: learn vπv_{\pi} online from experience under policy π\pi
Simplest TD algorithm : TD(0)
Update v(St)v(S_t) toward estimated return Rt+1+γv(St+1)R_{t+1} + \gamma v(S_{t+1})
v(St)v(St)+α(Rt+1+γv(St+1v(St)))v(S_t) \leftarrow v(S_t) + \alpha(R_{t+1} + \gamma v(S_{t+1} - v(S_t)))
Rt+1+γv(St+1)R_{t+1} + \gamma v(S_{t+1}) is called TD target
δt=Rt+1+γv(S(t+1)v(St)){\delta}_t = R_{t+1} + \gamma v(S_(t+1) - v(S_t)) is called the TD error
Comparison : Incremental Monte-Carlo
Update v(St)v(S_t) toward actual return GtG_t given an episode ii
v(St)v(St)+α(Gi,tv(St))v(S_t) \leftarrow v(S_t) + \alpha(G_{i, t} - v(S_t))

4.2 Advantages of TD over MC

  1. TD每经过一个timestep都可以更新;MC要等一个episode结束后才能更新。
  2. TD可以从不完整的序列学习,MC要用完整的序列才能学习。
  3. TD可以用在没有终止状态的环境中,MC不可以。
  4. TD利用的MDP的特征(当前状态和之前的状态没有关系),MC并没有利用,它采样并利用了整个轨迹。使得MCnon-Markov环境下更有效。

4.3 n-step TD

n-step TD methods that generalize both one-step TD and MC
We can shift from one to the other smoothly as needed to meet the demands of a particular task.

4.4 n-step TD prediction

Consider the following n-step returns for n=1,2,...,n = 1,2, ..., \infty
n=1(TD)   Gt(1)=Rt+1+γv(St+1)n=2(TD)   Gt(2)=Rt+1+γRt+2+γ2v(St+1)...n=(MC)   Gt()=Rt+1+γRt+2+...+γTt1RTn=1(TD) \space \space \space G^{(1)}_t = R_{t+1} + \gamma v(S_{t+1}) \\ n=2(TD) \space \space \space G^{(2)}_t = R_{t+1} + \gamma R_{t+2} + {\gamma}^2 v(S_{t+1}) \\ ... \\ n=\infty(MC) \space \space \space G^{(\infty)}_t = R_{t+1} + \gamma R_{t+2} + ... + {\gamma}^{T-t-1} R_T
Thus the n-step return is defined as
Gtn=Rt+1+γRt+2+...+γn1Rt+n+γnv(St+n)G^n_t = R_{t+1} + \gamma R_{t+2} + ... + {\gamma}^{n-1}R_{t+n} + {\gamma}^n v(S_{t+n})
n-step TD
v(St)v(St)+α(Gtnv(St))v(S_t) \leftarrow v(S_t) + \alpha(G^n_t - v(S_t))

4.5 Bootstrapping and Sampling for DP, MC, and TD

Bootstrapping: update involves an estimate
  1. MC does not bootstrap
  2. DP bootstraps
  3. TD bootstraps
Sampling: update samples an expectation
  1. MC samples
  2. DP does not sample
  3. TD sample
Dynamic Programming Backup
Monte-Carlo Backup
Temporal-Difference Backup
Unified View of Reinforcement Learning

5.Model-free Control for MDP

Model-free Control : Optimize the value function of an unknow MDP
Generalized Policy Iteration (GPI) with MC and TD

5.1 Policy Iteration

  1. Evaluate the policy π\pi (computing v given current π\pi)
  2. Improve the policy by acting greedily with respect to VπV^{\pi}
Compute the state-action value of a policy \pi :
qπi(s,a)=R(s,a)+γsSP(ss,a)Vπi(s)q^{\pi i}(s, a) = R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)V^{\pi i}(s')}
不知道RP的情况下,如果获得R(s,a)和P(s'|s,a)并作出决策呢?

5.2 基于MC的广义Policy Iteration

蒙特卡罗MC的方法取代DP的方法去估计V/Q函数。
εgreedyε-greedy方法去改进policy improvement获得ππ

(1)Policy Evalution

原来的policy evalution :
qπi(s,a)=R(s,a)+γsSP(ss,a)Vπi(s)q^{\pi i}(s, a) = R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)V^{\pi i}(s')}
现在MC的 Policy evalution

(2)Policy Improvement

考虑 exploration 和 exploitation 的 tradeoff 的问题。
Trade-off between exploration and exploitation (we will talk about this in later lecture)
εGreedyε-Greedy Exploration : Ensuring continual exploration
  1. All actions are tried with non-zero probability
  2. With probability 1ε1 - ε choose the greedy action
  3. With probability εε choose an action at random
π(as)={ϵ/A+1ϵ, if a=argmaxaA Q(s,a)ϵ/A, otherwise{\pi}(a|s) = \left \{ \begin{aligned} \epsilon / |A| + 1 - \epsilon, &\space if \space a^* = argmax_{a \in A} \space {Q(s,a)} \\ \epsilon / |A|, &\space otherwise \\ \end{aligned} \right.

5.3 基于TD的广义Policy Iteration

(1)MC vs TD

TD相比MC的优势:方差小、在线的、处理不完整的序列
TD的方法可以取代DP/MC的方法去估计V函数。同样用εgreedyε-greedy方法去改进policy improvement获得ππ。那如何使用TD的方法获得Q(s,a)Q(s,a)呢?

(2)On-Policy 和 Off-Policy 的对比

On-Policy:学习过程中只利用我们现在正在学的这个策略ππ,采集轨迹数据和优化策略始终都是用这个策略ππ
Off-Policy:保留两个策略,一个是我们要优化的目标策略 target ππ ,另一个是探索的策略 action μμ 。探索的策略μμ就可以更激进,并通过 μμ 来采集数据,采到的数据给 ππ 去学习。
Off-Policy benefits:
  1. Learn about optimal policy while following exploratory policy
  2. Learn from observing humans or other agents
  3. Re-use experience generated from old policies π1,π2,...,πt1π_1, π_2, ..., π_{t-1}

(3)On-Policy TD control:Sarsa

(S A R S A 进行循环的意思)
An episode consists of an alternating sequence of states and state-action pairs:
εgreedyε-greedy policy for one step, then bootstrap the action value function:
Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]
The update is done after every transition from a nonterminal state StS_t
TD target δt=Rt+1+γQ(St+1,At+1)\delta_t = R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})
Sarsa algorithm

(4)n-step Sarsa

Consider the following n-step returns for n=1,2,...,n = 1,2, ..., \infty
n=1(Sarsa)   qt(1)=Rt+1+γQ(St+1,At+1)n=2(TD)   qt(2)=Rt+1+γRt+2+γ2Q(St+2,At+2)...n=(MC)   qt()=Rt+1+γRt+2+...+γTt1RTn=1(Sarsa) \space \space \space q^{(1)}_t = R_{t+1} + \gamma Q(S_{t+1}, A{t+1}) \\ n=2(TD) \space \space \space q^{(2)}_t = R_{t+1} + \gamma R_{t+2} + {\gamma}^2 Q(S_{t+2}, A{t+2}) \\ ... \\ n=\infty(MC) \space \space \space q^{(\infty)}_t = R_{t+1} + \gamma R_{t+2} + ... + {\gamma}^{T-t-1} R_T
Thus the n-step return is defined as
qtn=Rt+1+γRt+2+...+γn1Rt+n+γnQ(St+n,At+n)q^n_t = R_{t+1} + \gamma R_{t+2} + ... + {\gamma}^{n-1}R_{t+n} + {\gamma}^n Q(S_{t+n}, A{t+n})
n-step TD
Q(St,At)Q(St,At)+α(qtnQ(St,At))Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha(q^n_t - Q(S_t, A_t))

(5)Off-Policy TD control:Q-learning

We allow both behavior and target policies to improve
The target policy ππ is greedy on Q(s,a)Q(s, a)
π(St+1)=arg max aQ(St+1,a)\pi(S_{t+1}) = \underset{a'}{arg \space max \space } Q(S_{t+1}, a')
The behavior policy µµ could be totally random, but we let it improve。
following εgreedyε-greedy on Q(s,a)Q(s, a)
Thus Q-learning target:
Rt+1+γQ(St+1,A)=Rt+1+γQ(St+1,arg max Q(St+1,a))=Rt+1+γmax aQ(St+1,a)R_{t+1} + \gamma Q(S_{t+1}, A') \\ = R_{t+1} + \gamma Q(S_{t+1}, arg \space max \space Q(S_{t+1}, a')) \\ = R_{t+1} + \gamma \underset{a'}{max \space } Q(S_{t+1}, a')
Thus the Q-Learning update
Q(St,At)Q(St,At)+α[Rt+1+γmax aQ(St+1,a)Q(St,At]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \underset{a'}{max \space }Q(S_{t+1}, a) - Q(S-t, A_t]

Q-learning algorithm

(6)SarsaQ-learning的区别

1. Sarsa是只利用一个策略π,而Q-Learning有两个策略πμ

Sarsa : On-Policy TD control
Choose action AtA_t from StS_t using policy derived from QQ with εgreedyε-greedy Take action AtA_t, observe Rt+1R_{t+1} and St+1S_{t+1} Choose action At+1A_{t+1} from St+1S_{t+1} using policy derived from QQ withεgreedy ε-greedy Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]
Q-Learning: Off-Policy TD control
Choose action AtA_t from StS_t using policy derived from QQ with εgreedyε-greedy Take action AtA_t, observe Rt+1R_{t+1} and St+1S_{t+1} Then ‘imagine’At+1 A_{t+1} as arg max Q(St+1,a)Q(S_{t+1}, a') in the update target Q(St,At)Q(St,At)+α[Rt+1+γmax aQ(St+1,a)Q(St,At]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma \underset{a'}{max \space }Q(S_{t+1}, a) - Q(S-t, A_t]

2. backup diagram过程不同

Backup diagram for Sarsa and Q-learning
In Sarsa, AA and AA' are sampled from the same policy so it is on-policy
In Q Learning, AA and AA' are from different policies, with AA being more exploratory and AA' determined directly by the max operator

3.Sarsa更保守,Q-Learning更激进冒险

5.4 DP vs TD



Comment