Recognizing 50 human action categories of web videos

被引：410

作者：

Reddy, Kishore K.

Shah, Mubarak

机构：

[1] not available, 4000 Central Florida Blvd, Orlando

来源：

MACHINE VISION AND APPLICATIONS | 2013年 / 24卷 / 05期

关键词：

Action recognition; Web videos; Fusion;

D O I：

10.1007/s00138-012-0450-4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (6 actions), IXMAS (13 actions), and Weizmann (10 actions). Challenges like camera motion, different viewpoints, large interclass variations, cluttered background, occlusions, bad illumination conditions, and poor quality of web videos cause the majority of the state-of-the-art action recognition approaches to fail. Also, an increased number of categories and the inclusion of actions with high confusion add to the challenges. In this paper, we propose using the scene context information obtained from moving and stationary pixels in the key frames, in conjunction with motion features, to solve the action recognition problem on a large (50 actions) dataset with videos from the web. We perform a combination of early and late fusion on multiple features to handle the very large number of categories. We demonstrate that scene context is a very important feature to perform action recognition on very large datasets. The proposed method does not require any kind of video stabilization, person detection, or tracking and pruning of features. Our approach gives good performance on a large number of action categories; it has been tested on the UCF50 dataset with 50 action categories, which is an extension of the UCF YouTube Action (UCF11) dataset containing 11 action categories. We also tested our approach on the KTH and HMDB51 datasets for comparison.

引用

页码：971 / 981

页数：11

共 23 条

[1] The recognition of human movement using temporal templates
Bobick, AF
Davis, JW
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2001, 23 (03) : 257 - 267
[2] Deng J, 2010, LECT NOTES COMPUT SC, V6315, P71, DOI 10.1007/978-3-642-15555-0_6
[3] Dollar P., 2005, VISUAL SURVEILLANCE, V14, P65, DOI DOI 10.1109/VSPETS.2005.1570899
[4] Selection and Context for Action Recognition
Han, Dong
Bo, Liefeng
Sminchisescu, Cristian
[J]. 2009 IEEE 12TH INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2009, : 1933 - 1940
[5] Heng Wang, 2011, 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), P3169, DOI 10.1109/CVPR.2011.5995407
[6] Hong P., 2000, Automatic Face and Gesture Recognition, P410
[7] Ikizler-Cinbis N, 2010, LECT NOTES COMPUT SC, V6311, P494, DOI 10.1007/978-3-642-15549-9_36
[8] Liu Jiaomin, 2009, Proceedings of the 2009 Second International Conference on Intelligent Networks and Intelligent Systems (ICINIS 2009), P15, DOI [10.1109/CVPRW.2009.5206744, 10.1109/ICINIS.2009.13]
[9] Kuehne H, 2011, IEEE I CONF COMP VIS, P2556, DOI 10.1109/ICCV.2011.6126543
[10] Learning realistic human actions from movies
Laptev, Ivan
Marszalek, Marcin
Schmid, Cordelia
Rozenfeld, Benjamin
[J]. 2008 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, VOLS 1-12, 2008, : 3222 - +

← 1 2 3 →