Selection and Context for Action Recognition

被引:43
作者
Han, Dong [1 ]
Bo, Liefeng [2 ]
Sminchisescu, Cristian [1 ]
机构
[1] Univ Bonn, D-5300 Bonn, Germany
[2] TTI, Chicago, IL USA
来源
2009 IEEE 12TH INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV) | 2009年
关键词
D O I
10.1109/ICCV.2009.5459427
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recognizing human action in non-instrumented video is a challenging task not only because of the variability produced by general scene factors like illumination, background, occlusion or intra-class variability, but also because of subtle behavioral patterns among interacting people or between people and objects in images. To improve recognition, a system may need to use not only low-level spatio-temporal video correlations but also relational descriptors between people and objects in the scene. In this paper we present contextual scene descriptors and Bayesian multiple kernel learning methods for recognizing human action in complex non-instrumented video. Our contribution is threefold: (1) we introduce bag-of-detector scene descriptors that encode presence/absence and structural relations between object parts; (2) we derive a novel Bayesian classification method based on Gaussian processes with multiple kernel covariance functions (MKGPC), in order to automatically select and weight multiple features, both low-level and high-level, out of a large collection, in a principled way, and (3) perform large scale evaluation using a variety of features on the KTH and a recently introduced, challenging, Hollywood movie dataset. On the KTH dataset, we obtain 94.1% accuracy, the best result reported to date. On the Hollywood dataset we obtain promising results in several action classes using fewer descriptors and about 9.1% improvement in a previous benchmark test.(1)
引用
收藏
页码:1933 / 1940
页数:8
相关论文
共 34 条
  • [21] Space-time interest points
    Laptev, I
    Lindeberg, T
    [J]. NINTH IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS I AND II, PROCEEDINGS, 2003, : 432 - 439
  • [22] Laptev I., 2008, CVPR JUN
  • [23] Distinctive image features from scale-invariant keypoints
    Lowe, DG
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2004, 60 (02) : 91 - 110
  • [24] The computational perception of scene dynamics
    Mann, R
    Jepson, A
    Siskind, JM
    [J]. COMPUTER VISION AND IMAGE UNDERSTANDING, 1997, 65 (02) : 113 - 128
  • [25] Marszalek M., 2009, CVPR JUN
  • [26] Unsupervised learning of human action categories using spatial-temporal words
    Niebles, Juan Carlos
    Wang, Hongcheng
    Fei-Fei, Li
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2008, 79 (03) : 299 - 318
  • [27] Rasmussen CE, 2005, ADAPT COMPUT MACH LE, P1
  • [28] Ryoo M., 2007, SEM LEARN WORKSH CVP
  • [29] Savarese S., 2008, WMVC, P1
  • [30] Scovanner Paul., 2007, ACM Multimedia