Deep visual tracking: Review and experimental comparison

被引：414

作者：

Li, Peixia ^{[1
]}

Wang, Dong ^{[1
]}

Wang, Lijun ^{[1
]}

Lu, Huchuan ^{[1
]}

机构：

[1] Dalian Univ Technol, Fac Elect Informat & Elect Engn, Sch Informat & Commun Engn, Dalian, Peoples R China

来源：

PATTERN RECOGNITION | 2018年 / 76卷

关键词：

Visual tracking; Deep learning; CNN; RNN; Pre-training; Online learning; OBJECT TRACKING; NEURAL-NETWORKS;

D O I：

10.1016/j.patcog.2017.11.007

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, deep learning has achieved great success in visual tracking. The goal of this paper is to review the state-of-the-art tracking methods based on deep learning. First, we introduce the background of deep visual tracking, including the fundamental concepts of visual tracking and related deep learning algorithms. Second, we categorize the existing deep-learning-based trackers into three classes according to network structure, network function and network training. For each categorize, we explain its analysis of the network perspective and analyze papers in different categories. Then, we conduct extensive experiments to compare the representative methods on the popular OTB-100, TC-128 and VOT2015 benchmarks. Based on our observations, we conclude that: (1) The usage of the convolutional neural network (CNN) model could significantly improve the tracking performance. (2) The trackers using the convolutional neural network (CNN) model to distinguish the tracked object from its surrounding background could get more accurate results, while using the CNN model for template matching is usually faster. (3) The trackers with deep features perform much better than those with low-level hand-crafted features. (4) Deep features from different convolutional layers have different characteristics and the effective combination of them usually results in a more robust tracker. (5) The deep visual trackers using end-to-end networks usually perform better than the trackers merely using feature extraction networks. (6) For visual tracking, the most suitable network training method is to per-train networks with video information and online fine-tune them with subsequent observations. Finally, we summarize our manuscript and highlight our insights, and point out the further trends for deep visual tracking. (C) 2017 Elsevier Ltd. All rights reserved.

引用

页码：323 / 338

页数：16

共 105 条

[31] Deep Relative Tracking [J].

Gao, Junyu ;

Zhang, Tianzhu ;

Yang, Xiaoshan ;

Xu, Changsheng .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2017, 26 (04) :1845-1858

[32]

Girshick R., 2014, P IEEE C COMP VIS PA, DOI [10.1109/CVPR.2014.81, DOI 10.1109/CVPR.2014.81, 10.1109/cvpr.2014.81]

[33] Fast R-CNN [J].

Girshick, Ross .

2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1440-1448

[34] Region-Based Convolutional Networks for Accurate Object Detection and Segmentation [J].

Girshick, Ross ;

Donahue, Jeff ;

Darrell, Trevor ;

Malik, Jitendra .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (01) :142-158

[35]

Gordon D., 2017, CLIN ORTHOPAEDICS RE

[36]

Hahn M., 2015, CLIN ORTHOPAEDICS RE

[37] Local Sparse Structure Denoising for Low-Light-Level Image [J].

Han, Jing ;

Yue, Jiang ;

Zhang, Yi ;

Bai, Lianfa .

IEEE TRANSACTIONS ON IMAGE PROCESSING, 2015, 24 (12) :5177-5192

[38] Struck: Structured Output Tracking with Kernels [J].

Hare, Sam ;

Golodetz, Stuart ;

Saffari, Amir ;

Vineet, Vibhav ;

Cheng, Ming-Ming ;

Hicks, Stephen L. ;

Torr, Philip H. S. .

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2016, 38 (10) :2096-2109

[39] Mask R-CNN [J].

He, Kaiming ;

Gkioxari, Georgia ;

Dollar, Piotr ;

Girshick, Ross .

2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2017, :2980-2988

[40] Deep Residual Learning for Image Recognition [J].

He, Kaiming ;

Zhang, Xiangyu ;

Ren, Shaoqing ;

Sun, Jian .

2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2016, :770-778

← 1 2 3 4 5 6 7 8 9 10 →