2.马尔可夫决策过程 - wolai 笔记

1.Markov

1.1 马尔科夫链(Markov Chain)

马尔可夫决策过程(MDP)是强化学习的一个基本框架。在马尔可夫决策过程是Fully Observable,部分观测的问题同样可以转化成MDP问题。

(1)Markov Property

状态转移序列满足马尔可夫(马尔可夫链性质下一状态只取决于当前状态,和之前的状态不相关
状态的历史: ht={s1,s2,s3,...,st}h_t=\{s_1, s_2, s_3, ...,s_t\}
如果满足Markov性质,下一状态只与当前状态有关:
p(st+1st)=p(st+1ht)p(st+1st,at)=p(st+1ht,at)p(s_{t+1} | s_t)=p(s_{t+1}|h_t) \\ p(s_{t+1}|s_t, a_t) = p(s_{t+1}|h_t, a_t)
sts_t转成st+1s_{t+1}等于之前所有的状态,未来的转台转移对过去是独立的,只取决于现在

(2)Markov Chain

可以使用状态转移矩阵来表示状态之间变换的概率。
p(st+1=sst=s)p(s_{t+1} = s' | s_t =s )
表示 agentsts_t这个状态下的时候到下一个状态的概率
p=[p(s1s1)p(s2s1)...p(sNs1)p(s1s2)p(s2s2)...p(sNs2).........p(s1sN)p(s2sN)...p(sNsN)]p= \begin{bmatrix} p(s_1|s_1) & p(s_2|s_1) & ... & p(s_N|s_1) \\ p(s_1|s_2) & p(s_2|s_2) & ... & p(s_N|s_2) \\ ... & ... & &... \\ p(s_1|s_N) & p(s_2|s_N) & ... & p(s_N|s_N) \end{bmatrix}
示例:
s3开始,可生成很多轨迹:
s3, s4, s5, s6, s6 s3, s2, s3, s2, s1

1.2马尔可夫奖励过程(MRP)

Markov Reward Process = 马尔科夫链+奖励函数R

(1)Markov Reward Process (MRP)

  1. SS is a (finite) set of states (sSs \in S)
  2. PP is dynamics/transition model that specifies P(St+1=sst=s)P(S_{t+1} = s'|s_t = s)
  3. RR is a reward function R(st=s)=E[rtst=s]R(s_t = s) = E[r_t|s_t = s]
  4. Discount factor γ[0,1]\gamma \in [0,1](折扣系数)

(2)一些概念

Horizon

每一 episode(游戏环节,轨迹长度) 的最大步数

Return

每一步获得的收益之和:
Gt=Rt+1+γRt+2+γ2Rt+3+...+γTt1RTG_t=R_{t+1}+\gamma R_{t+2} + {\gamma}^2 R_{t+3} + ... + {\gamma}^{T-t-1} R_T
希望得到及时的奖励,未来的奖励将会打折扣γγ

Value Function for MRP

表示未来奖励所获得的value
Vt(s)=E[Gtst=s]=E[Rt+1+γRt+2+γ2Rt+3+...+γTt1RTst=s]V_t(s) = E[G_t | s_t=s] \\ = E[R_{t+1} + \gamma R_{t+2} + {\gamma}^2 R_{t+3} + ... + {\gamma}^{T-t-1} R_T | s_t=s]
意味着从这个状态有可能获得多大的价值,也可以看作是对未来可能获得收益的当前价值的表现

(3)为什么需要折扣因子(Discount Factor)

  • 避免马尔可夫有环
  • 将不确定性表现出来,希望得到及时奖励
  • 让奖励变得有价值,尽可能快的得到奖励而不是后面得到
  • γ=0γ=0,只取决于当前奖励;γ=1γ=1,未来奖励和当前奖励权重一样

(4)计算value function

使用Bellman equation计算:
V(s)=R(s)+γsSP(ss)V(s)V(s) =R(s) + \gamma \underset{s' \in S}{\sum } {P(s'|s)V(s')}
其中:
  • R(s)R(s) :立即获得的奖励
  • γsSP(ss)V(s)\gamma \underset{s' \in S}{\sum } {P(s'|s)V(s')}:未来获得的奖励
  • P(ss)P(s'|s):当前状态转移到未来状态的关系

(5)Bellman equation

描述状态之间的迭代(转移)关系
可以写成矩阵形式
[V(s1)V(s2)...V(sN)]=[R(s1)R(s2)...R(sN)]+γ[P(s1s1)P(s2s1)...P(sNs1)P(s1s2)P(s2s2)...P(sNs2)............P(s1sN)P(s2sN)...P(sNsN)][V(s1)V(s2)...V(sN)]\begin{bmatrix} V(s_1) \\ V(s_2) \\ ... \\ V(s_N) \end{bmatrix} = \begin{bmatrix} R(s_1) \\ R(s_2) \\ ... \\ R(s_N) \end{bmatrix} + \gamma \begin{bmatrix} P(s_1|s_1) & P(s_2|s_1) & ... & P(s_N|s_1) \\ P(s_1|s_2) & P(s_2|s_2) & ... & P(s_N|s_2) \\ ... & ... & ... & ... \\ P(s_1|s_N) & P(s_2|s_N) & ... & P(s_N|s_N) \end{bmatrix} \begin{bmatrix} V(s_1) \\ V(s_2) \\ ... \\ V(s_N) \end{bmatrix}
V=R+γPVV = R + \gamma PV
通过矩阵求逆的过程即可求出V(Value function),但是时间复杂度太大了。

(6)迭代方法求解 Value of MRP

  • 动态规划的方法(Dynamic Programming)
  • 蒙特卡洛的方法(Monte-Carlo evaluation)
  • 动态规划和蒙特卡洛的结合(Temporal-Difference learning)

蒙特卡罗方法 Monte-Carlo

将多次轨迹奖励累加,当大于一定轨迹数量后,G除以轨迹数量得到价值Value
类似于采样的方法,在得到一个MRP后,从某一个状态开始,把小船放进去让他随波逐流,这样就会产生一个轨迹(episode);于是就会得到一个奖励,然后积累起来;在积累了一定数量过后,除以轨迹,得到价值。

动态规划 Dynamic Programming

通过一直迭代Belleman equation,当更新的状态和最后差距不大的时候 (收敛),更新停止,输出最新的V’(s)作为当前的价值

时间差分 Temporal-Difference

1.3马尔可夫决策过程(MDP)

Markov Decision Process
马尔可夫决策过程(MDP)即在马尔可夫奖励过程(MRP)基础上增加了决策过程(action)

(1)Markov Decision Process

  1. SS is a finite set of states
  2. AA is a finite set of actions
  3. PaP^a is dynamics/transition model for each action P(st+1=sst=s,at=a)P(s_{t+1}=s'|s_t=s, a_t=a)
    采取某一种行为,未来的状态会不同,不仅依赖于当前的状态,也依赖于当前状态采取的行为,会决定未来的价值走向。
  4. RR is a reward function R(st=s,at=a)=E[rtst=s,at=a]R(s_t=s,a_t=a)=E[r_t|s_t=s, a_t=a]
  5. Discount factor γ[0,1]\gamma \in [0, 1]
MDP 表示:(S,A,P,R,γ)(S, A, P, R, \gamma)

(2)Policy in MDP

  1. Policy specifies what action to take in each state
  2. Give a state, specify a distribution over actions
    π(as)=P(at=ast=s)\pi(a|s) = P(a_t = a|s_t=s)
  3. Policy are stationary (time-independnet),
    Atπ(as) A_t \sim \pi(a|s) for any t>0t > 0
在某一StS_t下,采取什么action 。如果已知policy,即已知π(as)π(a|s),已知每个state执行某个action的概率。

(3)MDP → MRP

当我们已知一个MDPPolicy π\pi 的时候,可以把MDP转化成MRP
  1. Given an MDP(S,A,P,R,γ)(S, A, P, R, \gamma) , and a policy π\pi
  2. The state sequence S1,S2,...S_1, S_2, ... is a Markov process (S,Pπ)(S, P^{\pi})
  3. The state and reward sequence S1,R2,S2,R2,...S_1, R_2, S_2, R_2, ... is a Markov reward process(S,Pπ,Rπ,γ)(S, P^{\pi}, R^{\pi}, \gamma) where:
Pπ(ss)=aAπ(as)P(ss,a)Rπ(s)=aAπ(as)R(s,a)P^{\pi}(s'|s) = \underset{a \in A}{\sum}{\pi(a|s)P(s'|s, a)} \\ R^{\pi}(s) = \underset{a \in A}{\sum}{\pi(a|s)R(s,a)}
policy已知,π(as)π(a|s)已知,MDP问题将转化为MRP,原先MRP需要知道P(ss)P(s'|s),现在变成了需要知道P(ss,a)P(s'|s,a),即在某一状态下,采取某种action转移到下一状态的概率。

(4)MP/MRPMDP的直观区别

多了一层action,有一定的不确定性,actionagent决定。
对于MDP,首先要决定采取某种行为,到达黑色的节点;到达黑色节点后,有多大的概率到达某一个状态,当前状态和未来状态转移过程中多了一层决策行为。

(5)MDP价值函数

The state-value function Vπ(s)V^π(s) of an MDP is the expected return starting from state ss, and following policy ππ :
Vπ(s)=Eπ[Gtst=s]V^{\pi}(s) = E_{\pi}[G_t|s_t=s]
The action-value function qπ(s,a)q^π(s, a) is the expected return starting from state ss, taking action aa, and then following policy ππ
qπ(s,a)=Eπ[Gtst=s,At=a]q^{\pi}(s, a) = E_{\pi}[G_t|s_t=s, A_t = a]
We have the relation between Vπ(s)V^π(s) and qπ(s,a)q^π(s, a)
Vπ(s)=aAπ(as)qπ(s,a)V^{\pi}(s) = \underset{a \in A}{\sum}{\pi(a|s)q^{\pi}(s,a)}

(6)Bellman期望等式

Bellman expectation equation描述了当前状态对未来状态的关联
The state-value function can be decomposed into immediate reward plus discounted value of the successor state,
Vπ(s)=Eπ[Rt+1+γVπ(st+1)st=s]V_{\pi}(s)=E_{\pi}[R_{t+1}+\gamma V_{\pi}(s_{t+1}) | s_t=s ]
The action-value function can similarly be decomposed
qπ(s,a)=Eπ[Rt+1+γqπ(st+1,At+1)st=s,At=a]q^{\pi}(s,a)=E_{\pi}[R_{t+1}+\gamma q^{\pi}(s_{t+1}, A_{t+1}) | s_t=s, A_t=a]
Know:状态值和Q函数
Vπ(s)=aAπ(as)qπ(s,a)V^{\pi}(s) = \underset{a \in A}{\sum}{\pi(a|s)q^{\pi}(s,a)}
qπ(s,a)=R(s,a)+γsSP(ss,a)Vπ(s)q^{\pi}(s, a) = R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)V^{\pi}(s')}
Thus:上面式子交叉带入
当前状态和未来状态关系
Vπ(s)=aAπ(as)(R(s,a)+sSP(ss,a)Vπ(s))V^{\pi}(s) = \underset{a \in A}{\sum}{\pi(a|s)(R(s,a) + \underset{s' \in S}{\sum}{P(s'|s, a)V^{\pi}(s')})}
当前Q函数和未来Q函数关系
qπ(s,a)=R(s,a)+γsSP(ss,a)(aAπ(as)qπ(s,a))q^{\pi}(s, a) = R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)(\underset{a' \in A}{\sum}{\pi(a'|s')q^{\pi}(s',a')})}

(7)Backup推导

两层加和,内层加和是从叶子结点backup到父结点,外层加和是backup到图中的根节点(两个等式同理)。

VπV_{\pi}Backup推导

对于某一个状态,上一个状态是和未来价值线性相关的
Vπ(s)=aAπ(as)(R(s,a)+sSP(ss,a)Vπ(s))V^{\pi}(s) = \underset{a \in A}{\sum}{\pi(a|s)(R(s,a) + \underset{s' \in S}{\sum}{P(s'|s, a)V^{\pi}(s')})}
这里包含两层加和,第一层加和推到黑色节点后再往上走一层推到根节点,即当前状态。
Backup Diagram定义了未来一个状态与上一个状态的关联。

QπQ_{\pi}Backup推导

根节点是q函数的节点,是黑色节点,下一时刻q函数是叶子节点(四个),通过加和将q节点到达白色节点s后,再进行加和推回到当前q函数。
未来q函数到当前q函数的关联性。
qπ(s,a)=R(s,a)+γsSP(ss,a)(aAπ(as)qπ(s,a))q^{\pi}(s, a) = R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)(\underset{a' \in A}{\sum}{\pi(a'|s')q^{\pi}(s',a')})}

2.Policy evaluation in MDP

Evaluate the value of state given a policy π\pi , compute Vπ(s)V_{\pi}(s)
Also called as (Value) prediction
已知MDPpolicy π\pi,计算价值函数,用来评估策略会得到多大的奖励。也叫prediction,预测当前采取的策略未来会产生多大的价值。
Input : MDP <S,A,P,R,γ><S, A, P, R, \gamma> and policy π\pi or MRP <S,Pπ,Rπ,γ><S, P_{\pi}, R_{\pi}, \gamma> output : value function VπV_{\pi}

3.Control in MDP

3.1 MDP决策

(1)Prediction

evaluate a given policy
Input : MDP <S,A,P,R,γ><S, A, P, R, \gamma> and policy π\pi or MRP <S,Pπ,Rπ,γ><S, P^{\pi}, R^{\pi}, \gamma>
Output : Value function VπV^{\pi}

(2)Control

search the optimal policy
Input : MDP <S,A,P,R,γ><S, A, P, R, \gamma>
Output : optimal value function VV^{*} and optimal policy π{\pi}^*

(3)Dynamic Programming

Prediction and Control in MDP can be solved by dynamic programming.
先寻找子问题的最佳解,然后再子问题逐步求解,最终得到原始问题的最佳解。 (MDP中:Bellman equation给出递归分解,价值函数储存并重用解决方案)

3.2 Policy evaluation on MDP

(1) Alg

  1. Objective : Evaluate a given policy π\pi for a MDP
  2. Output : the value function under policy VπV^{\pi}
  3. Solution : iteration on Bellman expectation bakcup
  4. Algorithm : Synchronous backup
    At each iteration t+1t+1, update Vt+1(s)V_{t+1}(s) from Vt(s)V_t(s') for all states sSs \in S where ss' is a successor state of ss
    Vt+1(s)=aAπ(as)(R(s,a)+γP(ss,a)Vt(s)sS)V_{t+1}(s) = \underset{a \in A}{\sum}{\pi(a|s)(R(s,a)+\gamma \underset{s' \in S}{P(s'|s, a)V_t(s')})}
  5. Convergence : v1v_1v2v_2 → ... → vπv_{\pi}

(2)思路

根据MDP+policy转化为MRP, 不断迭代得到V。可以用MC/DP/TD等方法求解。
Vt+1(s)=aAπ(as)(R(s,a)+γP(ss,a)Vt(s)sS)V_{t+1}(s) = \underset{a \in A}{\sum}{\pi(a|s)(R(s,a)+\gamma \underset{s' \in S}{P(s'|s, a)V_t(s')})}
know:
Pπ(ss)=aAπ(as)P(ss,a)Rπ(s)=aAπ(as)R(s,a)P^{\pi}(s'|s) = \underset{a \in A}{\sum}{\pi(a|s)P(s'|s, a)} \\ R^{\pi}(s) = \underset{a \in A}{\sum}{\pi(a|s)R(s,a)}
Thus:
Vt+1(s)=Rπ(s)+γPπ(ss)vt(s)V_{t+1}(s) = R^{\pi}(s) + \gamma P^{\pi}(s'|s)v_t(s')

3.3 Optimal Value Function

The optimal state-value function v(s)v^*(s) is the maximum value function over all policies,
最优V函数指的就是在某种 π\pi 下,V值最大
V(s)=maxπ vπ(s)V^*(s) = \underset{\pi}{max}\space {v^{\pi}(s)}
The optimal policy
最优策略即在最优V函数取某个状态下的最优action
π(s)=arg maxπ vπ(s){\pi}^*(s) = \underset{\pi}{arg \space max}\space { v^{\pi}(s)}
An MDP is "solved" when we know the optimal value
There exists a unique optimal value function, but could be multiple optima policies.

3.4 Finding Optimal Policy

任何一个MDP问题都有确定的最优策略。如果我们知道就可以直接得到optimal policy。
1. An optimal policy can be found by maximizing over q(s,a)q^*(s, a)
π(as)={1, if a=argmaxaA q(s,a)0, otherwise{\pi}^*(a|s) = \left \{ \begin{aligned} 1, &\space if \space a = argmax_{a \in A} \space {q^*(s,a)} \\ 0, &\space otherwise \\ \end{aligned} \right.
2. There is always a deterministic optimal policy for any MDP
3. If we know q(s,a)q^*(s, a), we immediately have the optimal policy

3.5 Policy Search

穷举策略,将会有As{|A|}^{|s|}种可能,复杂度太高=>常见方法:
  • policy iteration
  • value iteration
MDP control即根据值函数求的最佳策略
MDP最优策略是Deterministic、Stationary 、not unique的。

3.6 MDP Control

Compute the optimal policy
π(s)=arg max πvπ(s)\pi^*(s) = \underset{\pi}{arg~max~}v^{\pi}(s)
Optimal policy for a MDP in an infinite horizon problem (agent acts forever) is
  • Deterministic
  • Stationary (does not depend on time step)
  • Unique ? Not necessarily, may have state-actions with identical optimal values

3.7 Policy Iteration

Iterate through the two steps:
  • Evaluate the policy π\pi (computing v given current π\pi)
  • Improve the policy acting greedily with respect tovπ v^\pi
    π=greedy(vπ)\pi'=greedy(v^\pi)

(1) Policy Improvement

Compute the state-action value of a policy π\pi
qπi(s,a)=R(s,a)+γsSP(ss,a)vπi(s)q^{\pi i}(s,a) = R(s,a) + \gamma \sum_{s' \in S} P(s'|s, a) v^{\pi i}(s')
Computer new policy πi+1\pi_{i+1} for all sSs \in S following
可以把q函数看作q table,横轴是状态,纵轴是可能的action。q函数得到后即可得到q table,对于每一个状态。
每个状态采取能够使得Q最大化的action
πi+1(s)=arg max aqπi(s,a)\pi_{i+1}(s) = \underset{a}{arg~max~} q^{\pi i}(s,a)

(2)Monotonic Improvement in Policy

思想:通过采取greedy(arg max)的操作,是会得到更好或者不变的policy,而不会使它价值函数变差
推导过程
If improvements stop,
qπ(s,π(s))=max aAqπ(s,a)=qπ(s,π(s))=vπ(s)q^\pi(s, \pi'(s))=\underset{a \in A}{max~} q^\pi(s,a) = q^\pi(s,\pi(s))=v^\pi(s)
Thus the Bellman optimality equation has been satisfied
vπ(s)=max aAqπ(s,a)v^\pi(s) = \underset{a \in A}{max~}q^\pi(s,a)
Therefor vπ(s)=v(sv^\pi(s) = v^*(s) for all sSs \in S, so π\pi is an optimal policy

3.8 Value Iteration

(1) Bellman Optimality Equation

The optimal value functions are reached by the Bellman optimmality equations:
V(s)=maxa q(s,a)V^{*}(s) = \underset{a}{max}{\space q^{*}(s,a)}
q(s,a)=R(s,a)+γsSP(ss,a)V(s)q^{*}(s, a) = R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)V^{*}(s')}
Thus:
V(s)=maxaR(s,a)+sSP(ss,a)maxaq(s)V^{*}(s) = \underset{a}{max}{R(s,a) + \underset{s' \in S}{\sum}{P(s'|s, a) \underset{a'}{max} q^{*}(s')}}
q(s,a)=R(s,a)+γsSP(ss,a)maxa q(s,a)q^{*}(s, a) = R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)\underset{a'}{max}{\space q^{*}(s',a')}}
只有当整个状态达到最佳状态的时候才满足。

(2) Value Iteration

Value Iteration by turning the Bellman Optimality Equations as update rule.
思想:为了得到最佳的v(s)v^*(s),把Bellman optimality equation当成update rule来进行,把它当作一个迭代的等式,不停迭代,最后逐渐趋向于最佳策略。
1. If we know the solution to subproblem V(s)V^*(s'), which is optimal
2. Then the solution for the optimal V(s)V^*(s) can be found by iteration over the following Bellman Optimality backup rule,
v(s)max aA(R(s,a)+γsSP(ss,a)V(s))v(s) \leftarrow \underset{a \in A}{max \space } {(R(s, a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)V(s')})}
3. The idea of value iteration is to apply these updates iteratively

(3) Alg

Objective : find the optimal policy π\pi
Solution : iteration on the Bellman optimality backup
Value Iteration algorithm:
  1. initialize k=1k = 1 and v0(s)=0v_0(s) = 0 for all state ss
  2. For k=1:Hk = 1 : H
    1. for each state ss
      qk+1(s,a)=R(s,a)+γsSP(ss,a)Vk(s)q_{k+1}(s, a) = R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)V_{k}(s')}
      Vk+1(s)=max aqk+1(s,a)V_{k+1}(s) = \underset{a}{max \space } {q_{k+1}(s,a)}
    2. kk+1k \leftarrow k + 1
  3. To retrieve the Optimal policy after the value iteration:
    π(s)=arg maxa R(s,a)+γsSP(ss,a)Vk+1(s){\pi}(s) = \underset{a }{arg \space max}{\space R(s,a) + \gamma \underset{s' \in S}{\sum}{P(s'|s, a)V_{k+1}(s')}}
动态展现

3.9 Policy Iteration vs Value Iteration

  1. Policy Iterationpolicy evaluationpolicy improvement两部分组成
  2. Value Iteration则是直接利用Bellman Optimality Equation迭代求解,最后再有一个policy extraction,由动作值获得最终策略
Policy Iteration
Value Iteration



ProblemBellman EquationAlgorithm
PredictionBellman Expectation EquationIterative Policy Evaluation
ControlBellman Expectation EquationPolicy Iteration
ControlBellman Optimality EquationValue Iteration
MDP Prediction问题:可直接用Bellman Expectation Equation迭代求解
MDP Control问题:可用Bellman Expectation Equation的方法(Policy iteration)或直接使用Bellman Optimality Equation进行值迭代(Value Iteration)。


Comment
avatar
Dongnian
A salty fish swimming in the sea of deep learning!
Follow Me
Announcement
Welcome to My Personal Blog!
If Not, Please Visit Gitee Mirror.
Recent Post
Info
Article :
87
Run time :
Total Count :
411.2k
UV :
PV :
Last Push :