Tarantula: A vector extension to the alpha architecture

被引:36
作者
Espasa, R [1 ]
Ardanaz, F [1 ]
Emer, J [1 ]
Felix, S [1 ]
Gago, J [1 ]
Gramunt, R [1 ]
Hernandez, I [1 ]
Juan, T [1 ]
Lowney, G [1 ]
Mattina, M [1 ]
Seznec, A [1 ]
机构
[1] Univ Politecn Cataluna, Compaq, UPC Microprocessor Lab, Barcelona, Spain
来源
29TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, PROCEEDINGS | 2002年
关键词
D O I
10.1109/ISCA.2002.1003586
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6, 5]. Tarantula adds to the EV8 core a vector unit capable of 32 double-precision flops per cycle. The vector unit fetches data directly from a 16 MByte second level cache with a peak bandwidth of sixty four 64-bit values per cycle. The whole chip is backed by a memory controller capable of delivering over 64 GBytes/s of raw bandwidth. Tarantula extends the Alpha ISA with new vector instructions that operate on new architectural state. Salient features of the architecture and implementation are: (1) it full), integrates into a virtual-memory cache-coherent system without changes to its coherency protocol, (2) provides high bandwidth for non-unit stride memory accesses, (3) supports gather/scatter instructions efficiently, (4) fully integrates with the EV8 core with a narrow, streamlined interface, rather than acting as a co-processor (5) can achieve a peak of 104 operations per cycle, and (6) achieves excellent 11 real-computation "per transistor and per watt ratios. Our detailed simulations show that Tarantula achieves an average speedup of 5X over EV8, out of a peak speedup in terms of flops of 8X. Furthermore, performance on gather/scatter intensive benchmarks such as Radix Sort is also remarkable: a speedup of almost 3X over EV8 and 15 sustained operations per cycle. Several benchmarks exceed 20 operations per cycle.
引用
收藏
页码:281 / 292
页数:12
相关论文
共 22 条
[1]  
[Anonymous], 1995, IEEE TCCA NEWSLETTER
[2]  
Asanovic K., 1995, HOT CHIPS, V7, P187
[3]  
BANNON P, 2001, MICROPROCESSOR F OCT
[4]  
BANNON P, 1998, MICROPROCESSOR F OCT
[5]  
*CRAY INC, 2001, CRAY SV1
[6]  
DIEFENDORFF K, 1999, MICROPROCESSOR REPOR, V13, P5
[7]   Asim: A performance model framework [J].
Emer, J ;
Ahuja, P ;
Borch, E ;
Klauser, A ;
Luk, CK ;
Manne, S ;
Mukherjee, SS ;
Patil, H ;
Wallace, S ;
Binkert, N ;
Espasa, R ;
Juan, T .
COMPUTER, 2002, 35 (02) :68-+
[8]  
EMER J, 1999, MICROPROCESSOR F OCT
[9]  
ESPASA R, 1997, IEEE MICRO SEP, P20
[10]  
Gwennap L., 1998, MICROPROCESSOR REPOR, V12, P12