Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

被引：4452

作者：

He, Kaiming ^{[1
]}

Zhang, Xiangyu ^{[2
]}

Ren, Shaoqing ^{[3
]}

Sun, Jian ^{[1
]}

机构：

[1] Visual Comp Grp, Microsoft Res, Beijing 100080, Peoples R China

[2] Xi An Jiao Tong Univ, Dept Elect Engn, Xian 710049, Peoples R China

[3] Univ Sci & Technol China, Dept Elect Sci & Technol, Hefei 230026, Peoples R China

来源：

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE | 2015年 / 37卷 / 09期

关键词：

Convolutional neural networks; spatial pyramid pooling; image classification; object detection;

D O I：

10.1109/TPAMI.2015.2389824

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g., 224x224) input image. This requirement is "artificial" and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, "spatial pyramid pooling", to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the-art classification results using a single full-image representation and no fine-tuning. The power of SPP-net is also significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102x faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007. In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

引用

页码：1904 / 1916

页数：13

共 39 条

[1] [Anonymous], 2014, ARXIV14031840
[2] [Anonymous], 2013, Caffe: An open source convolutional architecture for fast feature embedding
[3] [Anonymous], ARXIV14053531
[4] LIBSVM: A Library for Support Vector Machines
Chang, Chih-Chung
Lin, Chih-Jen
[J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2011, 2 (03)
[5] The devil is in the details: an evaluation of recent feature encoding methods
Chatfield, Ken
Lempitsky, Victor
Vedaldi, Andrea
Zisserman, Andrew
[J]. PROCEEDINGS OF THE BRITISH MACHINE VISION CONFERENCE 2011, 2011,
[6] Coates A., 2011, International Conference on Machine Learning, P921
[7] Histograms of oriented gradients for human detection
Dalal, N
Triggs, B
[J]. 2005 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOL 1, PROCEEDINGS, 2005, : 886 - 893
[8] Deng J, 2009, PROC CVPR IEEE, P248, DOI 10.1109/CVPRW.2009.5206848
[9] Donahue J., 2013, CoRR
[10] The PASCAL Visual Object Classes Challenge: A Retrospective
Everingham, Mark
Eslami, S. M. Ali
Van Gool, Luc
Williams, Christopher K. I.
Winn, John
Zisserman, Andrew
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2015, 111 (01) : 98 - 136

← 1 2 3 4 →