Types of MDP and its’ solution

Stochastic dynamic programming with factored representations

Craig Boutilier et al., Artificial Intelligence

Equivalence notions and model minimization in Markov decision processes

Robert Givan et al., Artificial Intelligence 2003

传统方法：将原始MDP中具有相同属性（equivalent states）的state space进行压缩（reduced/aggregated）。将此方法应用到Factored MDP以减小state space。通过factored form描述等价的状态模块（blocks），从而减少状态，并且可以通过传统方法求解。
文章通过引入一种新的notion of equivalence来表示reduced MDP。得出的reduced model的optimal policy可以很快的转化为原模型的optimal policy
Finite sequential machines (FSM):
- : a finite set of states
- : a finite set of inputs (actions)
- : a set of possible outputs
- : transition function, a subset of
- : a mapping from to
什么是bisimulation，什么是bisimilarity（section 4）
什么是factored MDP

Discovering hidden structure in factored MDPs

Andrey Kolobov et al., AI 2012

Basic functions: a state guarantees that a certain trajectory of action outcomes has a positive probability of reaching the goal. 此状态之后为一条必经之路。
Nogoods: a state that no trajectory can reach the goal from this state. 此状态后为死路。
FFReplan S. Yoon, A. Fern, R. Givan, FF-Replan: A baseline for probabilistic planning, in: ICAPS’07, 2007, pp. 352–359

Planning and Learning with Stochastic Action Sets

Craig Boutilier et al., IJCAI 2018

问题：MDP中agent的action是否可用符合一定概率分布。
解决问题的两项主要工作：
1. Compressed representation of SAS-MDP (embedded MDP vs. compressed MDP)，优化MDP减小维度
2. 如何在此基础上应用policy iteration, value iteration, linear programming, Q-learning等实现最优化

Stochastic action sets (SAS) MDP

The embedded MDP: Obvious solution, embed the set of available actions into the state itself: 每个state单独记录可用的actions。优点：便于分析；缺点：难于追踪tractable/优化（搜索空间过大）
The compressed MDP: 提出的解决方案：将可用的action的Q-value与不可用的action分开处理将action sets划分为子集（类比于powerset 幂集，但不需要列举出全部子集），每个state对应一个子集。缺点：明显提升了action的维度；优点：
有一定几率不可用。
以为例

	The embedded MDP	The compressed MDP
& action	two different states: ,	One state with two action (sets):
control/policy for	select state first (/) then corresponding action as policy	select action : and （不存在up，替换为存在的选项） select action : and
Computational increasement	state space	decision space

Embbed MDP
Compressed MDP
1. Agent选定action
2. 当前state的处在哪个action set（如，Up和Down均可用；或，只有Down可用）
3. 若选定action在当前state所处的action set中，则执行选定的action，否则从action set中选取action E.g.: Agent处在中 - Agent选定Up：若当前action set为，执行Up；若当前action set为，执行Down - Agent选定Down：若当前action set为，执行Down；若当前action set为，执行Down
  - 若选定的action在当前action set中不可用，action set中可用action又不唯一，那么agent下一个action的选择使用decision list

中间件：DL，PDA和ADS
- Decision list (DL): 一个有序列表（permutation），执行action时总是从列表中选取第一个可行的action
- action availability概率计算方法：
  1. The product distribution assumption (PDA):
    - 假设所有概率均独立（进行了一定的近似）。
    - 若有一个action set B，对任意一个action k，的概率为。
    - 那么的概率为。 action应该在A中的概率连乘，乘以action不应在A中的概率的连乘。
    - 优点：通过概率的近似，避免了计算时（Bellman backups）对所有可行的action进行枚举。
  2. Arbitrary distributions with sampling (ADS):
    - 通过sampling获取在每个state下的action的availability的概率。
- 论文中主要关注用PDA计算action availability概率，在大多数场景下PDA与ADS等价。
Related work
- Sleeping bandits
- Canadian Traveller Problem
Solver: Q-learning, value iteration, policy iteration

Reinforcement Learning When All Actions are Not Always Available

Yash Chandak et al., AAAI 2020

要解决的问题：SAS-MDP的求解方法
- Q-learning：使用function approximators的时候可能恶化，不能保证收敛（lack of convergence guarantees of the Q-learning algorithm, when using function approximators in the MDP setting can potentially get exacerbated in the SAS-MDP setting.）
- 提出用Policy gradient和natural policy gradient algorithms针对SAS-MDPs的改进

Topological value iteration algorithms

Peng Dai et al., Journal Of Artificial Intelligence Research 2011

Value iteration: inefficient. 提出了两种算法，TVI和FTVI（focused topological value iteration） TVI：首先将MDP分解为strongly-connected components (SCCs)，然后顺序求解各个components FTVI：restricts its attention to connected components that are relevant for solving the MDP, 通过heuristic搜索的方式剪枝

Linearly-solvable Markov decision problems

Emanuel Todorov et al., NIPS06

提出了Z-learning，归纳出一类MDP问题，此类MDP所具有的特性使其可以在时间内线性的求得近似解。

俺寻思

Explorer

mdp_type

Types of MDP and its’ solution

Stochastic dynamic programming with factored representations

Equivalence notions and model minimization in Markov decision processes

Discovering hidden structure in factored MDPs

Planning and Learning with Stochastic Action Sets

Reinforcement Learning When All Actions are Not Always Available

Topological value iteration algorithms

Linearly-solvable Markov decision problems

Table of Contents

Backlinks

俺寻思

Explorer

mdp_type

Types of MDP and its’ solution §

Stochastic dynamic programming with factored representations §

Equivalence notions and model minimization in Markov decision processes §

Discovering hidden structure in factored MDPs §

Planning and Learning with Stochastic Action Sets §

Reinforcement Learning When All Actions are Not Always Available §

Topological value iteration algorithms §

Linearly-solvable Markov decision problems §

Table of Contents

Backlinks

Types of MDP and its’ solution

Stochastic dynamic programming with factored representations

Equivalence notions and model minimization in Markov decision processes

Discovering hidden structure in factored MDPs

Planning and Learning with Stochastic Action Sets

Reinforcement Learning When All Actions are Not Always Available

Topological value iteration algorithms

Linearly-solvable Markov decision problems