The policy iteration algorithm for average reward Markov decision processes with general state space

被引:76
作者
Meyn, SP [1 ]
机构
[1] UNIV ILLINOIS,URBANA,IL 61801
基金
美国国家科学基金会;
关键词
Howard's algorithm; Markov decision processes; multiclass queueing networks; Poisson equation; policy iteration algorithm;
D O I
10.1109/9.650016
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The average cost optimal control problem is addressed for Markov decision processes with unbounded cost, It is found that the policy iteration algorithm generates a sequence of policies which are c-regular (a strong stability condition), where c is the cost function under consideration, This result only requires the existence of an initial c-regular policy and an irreducibility condition on the state space, Furthermore, under these conditions the sequence of relative value functions generated by the algorithm is bounded from below and ''nearly'' decreasing, from which it follows that the algorithm is always convergent. Under further conditions, it is shown that the algorithm does compute a solution to the optimality equations and hence an optimal average cost policy, These results provide elementary criteria for the existence of optimal policies for Markov decision processes with unbounded cost and recover known results for the standard linear-quadratic-Gaussian problem. When these results are specialized to specific applications they reveal new structure for optimal policies, In particular, in the control of multiclass queueing networks, it is found that there is a close connection between optimization of the network and optimal control of a far simpler fluid network model.
引用
收藏
页码:1663 / 1680
页数:18
相关论文
共 50 条
[21]   RECURRENCE CONDITIONS FOR MARKOV DECISION PROCESSES WITH BOREL STATE SPACE: A SURVEY [J].
Hernandez-Lerma, Onesimo ;
Montes-De-Oca, Raul ;
Cavazos-Cadena, Rolando .
ANNALS OF OPERATIONS RESEARCH, 1991, 28 (01) :29-46
[22]  
HERNANDEZLERMA O, 1996, DISCRETE TIME MARKOV, V1
[23]  
HERNANDEZLERMA O, 1995, IN PRESS ACTA APPL M
[24]   ON THE CONVERGENCE OF POLICY ITERATION IN FINITE STATE UNDISCOUNTED MARKOV DECISION-PROCESSES - THE UNICHAIN CASE [J].
HORDIJK, A ;
PUTERMAN, ML .
MATHEMATICS OF OPERATIONS RESEARCH, 1987, 12 (01) :163-176
[25]  
HORDIJK A, 1995, UNIFORM STABILITY CO
[26]  
HORDIJK A, 1977, DYNAMIC PROGRAMMING
[27]  
Howard R., 1960, DYNAMIC PROGRAMMING
[28]  
HUMPHREY J, 1996, P 13 IFAC WORLD C, VB, P19
[29]  
KONDO VR, 1996, LEARNING ALGORITHMS
[30]   Duality and linear programs for stability and performance analysis of queuing networks and scheduling policies [J].
Kumar, PR ;
Meyn, SP .
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 1996, 41 (01) :4-17