Temporal-difference search in computer Go

被引:59
作者
Silver, David [1 ]
Sutton, Richard S. [2 ]
Mueller, Martin [2 ]
机构
[1] UCL, London, England
[2] Univ Alberta, Edmonton, AB, Canada
关键词
Reinforcement learning; Temporal-difference learning; Monte-Carlo search; Simulation based search; Computer Go;
D O I
10.1007/s10994-012-5280-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9x9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9x9 Computer Go Server.
引用
收藏
页码:183 / 219
页数:37
相关论文
共 57 条
[1]  
[Anonymous], 2006, 6062 INRIA
[2]  
[Anonymous], 1994, Adv. Neural Inf. Process. Syst
[3]  
[Anonymous], 1985, Adaptive signal processing prentice-hall
[4]  
[Anonymous], 2008, 23 NAT C ARTIF INTEL
[5]   Finite-time analysis of the multiarmed bandit problem [J].
Auer, P ;
Cesa-Bianchi, N ;
Fischer, P .
MACHINE LEARNING, 2002, 47 (2-3) :235-256
[6]  
Balla R., 2009, 21 INT JOINT C ART I
[7]   Learning to play chess using temporal differences [J].
Baxter, J ;
Tridgell, A ;
Weaver, L .
MACHINE LEARNING, 2000, 40 (03) :243-263
[8]  
Buro M, 1999, LECT NOTES COMPUT SC, V1558, P126
[9]  
Chaslot G., 2008, 8 EUR WORKSH REINF L
[10]  
Coulom R., 2007, COMP GAM WORKSH