Temporal-difference search in computer Go

被引：59

作者：

Silver, David ^{[1
]}

Sutton, Richard S. ^{[2
]}

Mueller, Martin ^{[2
]}

机构：

[1] UCL, London, England

[2] Univ Alberta, Edmonton, AB, Canada

来源：

MACHINE LEARNING | 2012年 / 87卷 / 02期

关键词：

Reinforcement learning; Temporal-difference learning; Monte-Carlo search; Simulation based search; Computer Go;

D O I：

10.1007/s10994-012-5280-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9x9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9x9 Computer Go Server.

引用

页码：183 / 219

页数：37

共 57 条

[41]

Silver David., 2009, ICML, volume 382 of ACM International Conference Proceeding Series, V382, P119

[42] Analytical mean squared error curves for temporal difference learning [J].

Singh, S ;

Dayan, P .

MACHINE LEARNING, 1998, 32 (01) :5-40

[43] Convergence results for single-step on-policy reinforcement-learning algorithms [J].

Singh, S ;

Jaakkola, T ;

Littman, ML ;

Szepesvári, C .

MACHINE LEARNING, 2000, 38 (03) :287-308

[44]

Stern D., 2006, P 23 INT C MACH LEAR, P873, DOI DOI 10.1145/1143844.1143954

[45]

Stoutamire D., 1991, THESIS CASE W RESERV

[46]

Sturtevant NR, 2008, LECT NOTES COMPUT SC, V5131, P37, DOI 10.1007/978-3-540-87608-3_4

[47]

Sutton R. S., 1990, Machine Learning: Proceedings of the Seventh International Conference (1990), P216

[48]

Sutton RS, 2018, ADAPT COMPUT MACH LE, P1

[49]

Sutton Richard Stuart, 1984, Temporal credit assignment in reinforcement learning

[50]

Sutton RS, 1996, ADV NEUR IN, V8, P1038

← 1 2 3 4 5 6 →