Temporal-difference search in computer Go

被引:59
作者
Silver, David [1 ]
Sutton, Richard S. [2 ]
Mueller, Martin [2 ]
机构
[1] UCL, London, England
[2] Univ Alberta, Edmonton, AB, Canada
关键词
Reinforcement learning; Temporal-difference learning; Monte-Carlo search; Simulation based search; Computer Go;
D O I
10.1007/s10994-012-5280-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9x9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9x9 Computer Go Server.
引用
收藏
页码:183 / 219
页数:37
相关论文
共 57 条
[41]  
Silver David., 2009, ICML, volume 382 of ACM International Conference Proceeding Series, V382, P119
[42]   Analytical mean squared error curves for temporal difference learning [J].
Singh, S ;
Dayan, P .
MACHINE LEARNING, 1998, 32 (01) :5-40
[43]   Convergence results for single-step on-policy reinforcement-learning algorithms [J].
Singh, S ;
Jaakkola, T ;
Littman, ML ;
Szepesvári, C .
MACHINE LEARNING, 2000, 38 (03) :287-308
[44]  
Stern D., 2006, P 23 INT C MACH LEAR, P873, DOI DOI 10.1145/1143844.1143954
[45]  
Stoutamire D., 1991, THESIS CASE W RESERV
[46]  
Sturtevant NR, 2008, LECT NOTES COMPUT SC, V5131, P37, DOI 10.1007/978-3-540-87608-3_4
[47]  
Sutton R. S., 1990, Machine Learning: Proceedings of the Seventh International Conference (1990), P216
[48]  
Sutton RS, 2018, ADAPT COMPUT MACH LE, P1
[49]  
Sutton Richard Stuart, 1984, Temporal credit assignment in reinforcement learning
[50]  
Sutton RS, 1996, ADV NEUR IN, V8, P1038