Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA

被引：245

作者：

Ma, Yufei ^{[1
]}

Cao, Yu ^{[1
]}

Vrudhula, Sarma ^{[2
]}

Seo, Jae-sun ^{[1
]}

机构：

[1] Arizona State Univ, Sch Elect Comp & Energy Engn, Tempe, AZ 85287 USA

[2] Arizona State Univ, Sch Comp Informat Decis Syst Engn, Tempe, AZ 85287 USA

来源：

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS | 2018年 / 26卷 / 07期

基金：

美国国家科学基金会;

关键词：

Accelerator architectures; convolutional neural networks (CNNs); field-programmable gate array (FPGA); neural network hardware;

D O I：

10.1109/TVLSI.2018.2815603

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

As convolution contributes most operations in convolutional neural network (CNN), the convolution acceleration scheme significantly affects the efficiency and performance of a hardware CNN accelerator. Convolution involves multiply and accumulate operations with four levels of loops, which results in a large design space. Prior works either employ limited loop optimization techniques, e.g., loop unrolling, tiling, and interchange, or only tune some of the design variables after the accelerator architecture and dataflow are already fixed. Without fully studying the convolution loop optimization before the hardware design phase, the resulting accelerator can hardly exploit the data reuse and manage data movement efficiently. This paper overcomes these barriers by quantitatively analyzing and optimizing the design objectives (e.g., memory access) of the CNN accelerator based on multiple design variables. Then, we propose a specific dataflow of hardware CNN acceleration to minimize the data communication while maximizing the resource utilization to achieve high performance. The proposed CNN acceleration scheme and architecture are demonstrated by implementing end-to-end CNNs including NiN, VGG-16, and ResNet-50/ResNet-152 for inference. For VGG-16 CNN, the overall throughputs achieve 348 GOPS and 715 GOPS on Intel Stratix V and Arria 10 FPGAs, respectively.

引用

页码：1354 / 1367

页数：14

共 24 条

[1] [Anonymous], PROC CVPR IEEE
[2] [Anonymous], 2015, 3 INT C LEARN REPR I
[3] [Anonymous], 2017, FPGA 17 P 2017 ACMSI, DOI DOI 10.1145/3020078.3021738
[4] [Anonymous], NETWORK IN NET WORK
[5] COMPILER TRANSFORMATIONS FOR HIGH-PERFORMANCE COMPUTING
BACON, DF
GRAHAM, SL
SHARP, OJ
[J]. ACM COMPUTING SURVEYS, 1994, 26 (04) : 345 - 420
[6] Reconfigurable pipelined 2-D convolvers for fast digital signal processing
Bosi, B
Bois, G
Savaria, Y
[J]. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, 1999, 7 (03) : 299 - 308
[7] Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks
Chen, Yu-Hsin
Krishna, Tushar
Emer, Joel S.
Sze, Vivienne
[J]. IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2017, 52 (01) : 127 - 138
[8] Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
Chen, Yu-Hsin
Emer, Joel
Sze, Vivienne
[J]. 2016 ACM/IEEE 43RD ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE (ISCA), 2016, : 367 - 379
[9] A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things
Du, Li
Du, Yuan
Li, Yilei
Su, Junjie
Kuan, Yen-Cheng
Liu, Chun-Chen
Chang, Mau-Chung Frank
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS, 2018, 65 (01) : 198 - 208
[10] FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates
Guan, Yijin
Liang, Hao
Xu, Ningyi
Wang, Wenqiang
Shi, Shaoshuai
Chen, Xi
Sun, Guangyu
Zhang, Wei
Cong, Jason
[J]. 2017 IEEE 25TH ANNUAL INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES (FCCM 2017), 2017, : 152 - 159

← 1 2 3 →