A multiresolution manifold distance for invariant image similarity

被引:36
作者
Vasconcelos, N [1 ]
Lippman, A
机构
[1] Univ Calif San Diego, Dept Elect & Comp Engn, La Jolla, CA 92093 USA
[2] MIT, Media Lab, Cambridge, MA 02139 USA
关键词
Affine transformations; face recognition; image similarity; invariance; manifold distance; multiresolution; robust estimators; semantic movie classification; tangent distance;
D O I
10.1109/TMM.2004.840596
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Accounting for spatial image transformations is a requirement for multimedia problems such as video classification and retrieval, face/object recognition or the creation of image mosaics from video sequences. We analyze a transformation invariant metric recently proposed in the machine learning literature to measure the distance between image manifolds - the tangent distance (TD) - and show that it is closely related to alignment techniques from the motion analysis literature. Exposing these relationships results in benefits for the two domains. On one hand, it allows leveraging on the knowledge acquired in the alignment literature to build better classifiers. On the other, it provides a new interpretation of alignment techniques as one component of a decomposition that has interesting properties for the classification of video. In particular, we embed the TD into a multiresolution framework that makes it significantly less prone to local minima. The new metric - multiresolution tangent distance (MRTD) - can be easily combined with robust estimation procedures, and exhibits significantly higher invariance to image transformations than the TD and the Euclidean distance (ED). For classification, this translates into significant improvements in face recognition accuracy. For video characterization, it leads to a decomposition of image dissimilarity into "differences due to camera motion" plus "differences due to scene activity" that is useful for classification. Experimental results on a movie database indicate that the distance could be used as a basis for the extraction of semantic primitives such as action and romance.
引用
收藏
页码:127 / 142
页数:16
相关论文
共 36 条
[1]   Content-based representation and retrieval of visual media: A state-of-the-art review [J].
Aigrain, P ;
Zhang, HJ ;
Petkovic, D .
MULTIMEDIA TOOLS AND APPLICATIONS, 1996, 3 (03) :179-202
[2]  
AMANDAN P, 1993, MOTION ANAL IMAGE SE, pCH1
[3]  
[Anonymous], 1987, ROBUST REGRESSION OU
[4]  
BERTSEKAS DP, 1995, NONLINEAR PROGRAMMIN
[5]   THE LAPLACIAN PYRAMID AS A COMPACT IMAGE CODE [J].
BURT, PJ ;
ADELSON, EH .
IEEE TRANSACTIONS ON COMMUNICATIONS, 1983, 31 (04) :532-540
[6]  
Devroye L., 1996, A probabilistic theory of pattern recognition
[7]  
Duda R. O., 1973, PATTERN CLASSIFICATI
[8]  
Frey B. J., 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), P416, DOI 10.1109/CVPR.1999.786972
[9]  
Huber P. J., 1981, ROBUST STAT
[10]   Efficient representations of video sequences and their applications [J].
Irani, M ;
Anandan, P ;
Bergen, J ;
Kumar, R ;
Hsu, S .
SIGNAL PROCESSING-IMAGE COMMUNICATION, 1996, 8 (04) :327-351