Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

被引:4033
作者
Krishna, Ranjay [1 ]
Zhu, Yuke [1 ]
Groth, Oliver [2 ]
Johnson, Justin [1 ]
Hata, Kenji [1 ]
Kravitz, Joshua [1 ]
Chen, Stephanie [1 ]
Kalantidis, Yannis [3 ]
Li, Li-Jia [4 ]
Shamma, David A. [5 ]
Bernstein, Michael S. [1 ]
Li Fei-Fei [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
[2] Tech Univ Dresden, Dresden, Germany
[3] Yahoo Inc, San Francisco, CA USA
[4] Snapchat Inc, Los Angeles, CA USA
[5] Ctr Wiskunde & Informat, Amsterdam, Netherlands
关键词
Computer vision; Dataset; Image; Scene graph; Question answering; Objects; Attributes; Relationships; Knowledge; Language; Crowdsourcing; DATABASE; KNOWLEDGE; WORDNET; OBJECT;
D O I
10.1007/s11263-016-0981-7
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
Despite progress in perceptual tasks such as image classification, computers still perform poorly on cognitive tasks such as image description and question answering. Cognition is core to tasks that involve not just recognizing, but reasoning about our visual world. However, models used to tackle the rich content in images for cognitive tasks are still being trained using the same datasets designed for perceptual tasks. To achieve success at cognitive tasks, models need to understand the interactions and relationships between objects in an image. When asked "What vehicle is the person riding?", computers will need to identify the objects in an image as well as the relationships riding(man, carriage) and pulling(horse, carriage) to answer correctly that "the person is riding a horse-drawn carriage." In this paper, we present the Visual Genome dataset to enable the modeling of such relationships. We collect dense annotations of objects, attributes, and relationships within each image to learn these models. Specifically, our dataset contains over 108K images where each image has an average of objects, attributes, and pairwise relationships between objects. We canonicalize the objects, attributes, relationships, and noun phrases in region descriptions and questions answer pairs to WordNet synsets. Together, these annotations represent the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.
引用
收藏
页码:32 / 73
页数:42
相关论文
共 111 条
[71]
Ask Your Neurons: A Neural-based Approach to Answering Questions about Images [J].
Malinowski, Mateusz ;
Rohrbach, Marcus ;
Fritz, Mario .
2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, :1-9
[72]
Malisiewicz Tomasz, 2008, CVPR, P1
[73]
The Stanford CoreNLP Natural Language Processing Toolkit [J].
Manning, Christopher D. ;
Surdeanu, Mihai ;
Bauer, John ;
Finkel, Jenny ;
Bethard, Steven J. ;
McClosky, David .
PROCEEDINGS OF 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: SYSTEM DEMONSTRATIONS, 2014, :55-60
[74]
Mao Junhua., 2014, Explain images with multimodal recurrent neural networks
[75]
Mikolov T., 2013, P INT C LEARN REPR I, V2013, P3781, DOI [10.48550/ARXIV.1301.3781, DOI 10.48550/ARXIV.1301.3781]
[76]
WORDNET - A LEXICAL DATABASE FOR ENGLISH [J].
MILLER, GA .
COMMUNICATIONS OF THE ACM, 1995, 38 (11) :39-41
[77]
Elementary: Large-Scale Knowledge-Base Construction via Machine Learning and Statistical Inference [J].
Niu, Feng ;
Zhang, Ce ;
Re, Christopher ;
Shavlik, Jude .
INTERNATIONAL JOURNAL ON SEMANTIC WEB AND INFORMATION SYSTEMS, 2012, 8 (03) :42-73
[78]
Ordonez V., 2011, ADV NEURAL INFORM PR
[79]
BLEU: a method for automatic evaluation of machine translation [J].
Papineni, K ;
Roukos, S ;
Ward, T ;
Zhu, WJ .
40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, 2002, :311-318
[80]
The SUN Attribute Database: Beyond Categories for Deeper Scene Understanding [J].
Patterson, Genevieve ;
Xu, Chen ;
Su, Hang ;
Hays, James .
INTERNATIONAL JOURNAL OF COMPUTER VISION, 2014, 108 (1-2) :59-81