Types of MDP and its’ solution
Stochastic dynamic programming with factored representations
Craig Boutilier et al., Artificial Intelligence
Equivalence notions and model minimization in Markov decision processes
Robert Givan et al., Artificial Intelligence 2003
-
传统方法:将原始MDP中具有相同属性(equivalent states)的state space进行压缩(reduced/aggregated)。 将此方法应用到Factored MDP以减小state space。通过factored form描述等价的状态模块(blocks),从而减少状态,并且可以通过传统方法求解。
-
文章通过引入一种新的notion of equivalence来表示reduced MDP。得出的reduced model的optimal policy可以很快的转化为原模型的optimal policy
-
Finite sequential machines (FSM):
: a finite set of states : a finite set of inputs (actions) : a set of possible outputs : transition function, a subset of : a mapping from to
-
什么是bisimulation,什么是bisimilarity(section 4)
-
什么是factored MDP
Discovering hidden structure in factored MDPs
Andrey Kolobov et al., AI 2012
-
Basic functions: a state guarantees that a certain trajectory of action outcomes has a positive probability of reaching the goal. 此状态之后为一条必经之路。
-
Nogoods: a state that no trajectory can reach the goal from this state. 此状态后为死路。
-
FFReplan S. Yoon, A. Fern, R. Givan, FF-Replan: A baseline for probabilistic planning, in: ICAPS’07, 2007, pp. 352–359
Planning and Learning with Stochastic Action Sets
Craig Boutilier et al., IJCAI 2018
-
问题:MDP中agent的action是否可用符合一定概率分布。
-
解决问题的两项主要工作:
- Compressed representation of SAS-MDP (embedded MDP vs. compressed MDP),优化MDP减小维度
- 如何在此基础上应用policy iteration, value iteration, linear programming, Q-learning等实现最优化
-
Stochastic action sets (SAS) MDP
- The embedded MDP: Obvious solution, embed the set of available actions into the state itself: 每个state单独记录可用的actions。 优点:便于分析;缺点:难于追踪tractable/优化(搜索空间过大)
- The compressed MDP:
提出的解决方案:将可用的action的Q-value与不可用的action分开处理 将action sets划分为子集(类比于powerset 幂集,但不需要列举出全部子集),每个state对应一个子集。 缺点:明显提升了action的维度;优点:
有一定几率不可用。 - 以
为例
The embedded MDP The compressed MDP & action two different states: , One state with two action (sets): control/policy for select state first ( / )
then corresponding action as policyselect action : and
(不存在up,替换为存在的选项)
select action: and Computational increasement state space decision space - Embbed MDP
- Compressed MDP
- Agent选定action
- 当前state的处在哪个action set(如
,Up和Down均可用;或 ,只有Down可用) - 若选定action在当前state所处的action set中,则执行选定的action,否则从action set中选取action
E.g.: Agent处在
中 - Agent选定Up:若当前action set为 ,执行Up;若当前action set为 ,执行Down - Agent选定Down:若当前action set为 ,执行Down;若当前action set为 ,执行Down - 若选定的action在当前action set中不可用,action set中可用action又不唯一,那么agent下一个action的选择使用decision list
- 若选定的action在当前action set中不可用,action set中可用action又不唯一,那么agent下一个action的选择使用decision list
-
中间件:DL,PDA和ADS
- Decision list (DL): 一个有序列表(permutation),执行action时总是从列表中选取第一个可行的action
- action availability概率计算方法:
- The product distribution assumption (PDA):
- 假设所有概率均独立(进行了一定的近似)。
- 若有一个action set B,对任意一个action k,
的概率为 。 - 那么
的概率为 。 action应该在A中的概率连乘,乘以action不应在A中的概率的连乘。 - 优点:通过概率的近似,避免了计算时(Bellman backups)对所有可行的action进行枚举。
- Arbitrary distributions with sampling (ADS):
- 通过sampling获取在每个state下的action的availability的概率。
- The product distribution assumption (PDA):
- 论文中主要关注用PDA计算action availability概率,在大多数场景下PDA与ADS等价。
-
Related work
- Sleeping bandits
- Canadian Traveller Problem
-
Solver: Q-learning, value iteration, policy iteration
Reinforcement Learning When All Actions are Not Always Available
Yash Chandak et al., AAAI 2020
- 要解决的问题:SAS-MDP的求解方法
- Q-learning:使用function approximators的时候可能恶化,不能保证收敛(lack of convergence guarantees of the Q-learning algorithm, when using function approximators in the MDP setting can potentially get exacerbated in the SAS-MDP setting.)
- 提出用Policy gradient和natural policy gradient algorithms针对SAS-MDPs的改进
Topological value iteration algorithms
Peng Dai et al., Journal Of Artificial Intelligence Research 2011
- Value iteration: inefficient. 提出了两种算法,TVI和FTVI(focused topological value iteration) TVI:首先将MDP分解为strongly-connected components (SCCs),然后顺序求解各个components FTVI:restricts its attention to connected components that are relevant for solving the MDP, 通过heuristic搜索的方式剪枝
Linearly-solvable Markov decision problems
Emanuel Todorov et al., NIPS06
- 提出了Z-learning,归纳出一类MDP问题,此类MDP所具有的特性使其可以在
时间内线性的求得近似解。