VQA: Visual Question Answering

被引:218
作者
Agrawal, Aishwarya [1 ]
Lu, Jiasen [1 ]
Antol, Stanislaw [1 ]
Mitchell, Margaret [2 ]
Zitnick, C. Lawrence [3 ]
Parikh, Devi [4 ]
Batra, Dhruv [4 ]
机构
[1] Virginia Tech, Blacksburg, VA 24061 USA
[2] Microsoft Res, Redmond, WA USA
[3] Facebook AI Res, Menlo Pk, CA USA
[4] Georgia Inst Technol, Blacksburg, VA USA
基金
美国国家科学基金会;
关键词
Visual Question Answering;
D O I
10.1007/s11263-016-0966-6
中图分类号
TP18 [人工智能理论];
学科分类号
140502 [人工智能];
摘要
We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing similar to 0.25 M images, similar to 0.76 M questions, and similar to 10 M answers (www.visuaiqa.org) and discuss the information it provides. Numerous baselines and methods for VQA are provided and compared with human performance. Our VQA demo is available on CloudCV (http://cloudcv.org/vqa).
引用
收藏
页码:4 / 31
页数:28
相关论文
共 58 条
[1]
Agrawal H., 2015, Mobile cloud visual media computing, P265
[2]
[Anonymous], INT C MAN DAT
[3]
[Anonymous], ARXIV151105099 CORR
[4]
[Anonymous], 2015, INT C COMP VIS ICCCV
[5]
[Anonymous], ACL WORKSH INT LANG
[6]
[Anonymous], 2015, CVPR
[7]
[Anonymous], 2015, NIPS
[8]
[Anonymous], HLT NAACL
[9]
[Anonymous], 2011, P 24 CVPR
[10]
[Anonymous], 2015, CVPR