Temporal-difference search in computer Go

被引：59

作者：

Silver, David ^{[1
]}

Sutton, Richard S. ^{[2
]}

Mueller, Martin ^{[2
]}

机构：

[1] UCL, London, England

[2] Univ Alberta, Edmonton, AB, Canada

来源：

MACHINE LEARNING | 2012年 / 87卷 / 02期

关键词：

Reinforcement learning; Temporal-difference learning; Monte-Carlo search; Simulation based search; Computer Go;

D O I：

10.1007/s10994-012-5280-0

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Temporal-difference learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. The key idea is to update a value function from episodes of real experience, by bootstrapping from future value estimates, and using value function approximation to generalise between related states. Monte-Carlo tree search is a recent algorithm for high-performance search, which has been used to achieve master-level play in Go. The key idea is to use the mean outcome of simulated episodes of experience to evaluate each state in a search tree. We introduce a new approach to high-performance search in Markov decision processes and two-player games. Our method, temporal-difference search, combines temporal-difference learning with simulation-based search. Like Monte-Carlo tree search, the value function is updated from simulated experience; but like temporal-difference learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We apply temporal-difference search to the game of 9x9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed an unenhanced Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9x9 Computer Go Server.

引用

页码：183 / 219

页数：37

共 57 条

[1]

[Anonymous], 2006, 6062 INRIA

[2]

[Anonymous], 1994, Adv. Neural Inf. Process. Syst

[3]

[Anonymous], 1985, Adaptive signal processing prentice-hall

[4]

[Anonymous], 2008, 23 NAT C ARTIF INTEL

[5] Finite-time analysis of the multiarmed bandit problem [J].

Auer, P ;

Cesa-Bianchi, N ;

Fischer, P .

MACHINE LEARNING, 2002, 47 (2-3) :235-256

[6]

Balla R., 2009, 21 INT JOINT C ART I

[7] Learning to play chess using temporal differences [J].

Baxter, J ;

Tridgell, A ;

Weaver, L .

MACHINE LEARNING, 2000, 40 (03) :243-263

[8]

Buro M, 1999, LECT NOTES COMPUT SC, V1558, P126

[9]

Chaslot G., 2008, 8 EUR WORKSH REINF L

[10]

Coulom R., 2007, COMP GAM WORKSH

← 1 2 3 4 5 6 →