ImageNet Large Scale Visual Recognition Challenge

Published on Dec 1, 2015in International Journal of Computer Vision6.071
· DOI :10.1007/s11263-015-0816-y
Olga Russakovsky17
Estimated H-index: 17
(Stanford University),
Jia Deng31
Estimated H-index: 31
(UM: University of Michigan)
+ 9 AuthorsLi Fei-Fei84
Estimated H-index: 84
(Stanford University)
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.
Figures & Tables
  • References (99)
  • Citations (9236)
📖 Papers frequently viewed together
16.6k Citations
2015ICCV: International Conference on Computer Vision
1 Author (Ross Girshick)
3,629 Citations
2014ECCV: European Conference on Computer Vision
3,559 Citations
78% of Scinapse members use related papers. After signing in, all features are FREE.
#1Mark Everingham (University of Leeds)H-Index: 26
#2S. M. Eslami (Microsoft)H-Index: 1
Last. Andrew Zisserman (University of Oxford)H-Index: 133
view all 6 authors...
The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008---2012. The paper is intended for two audiences: algorithm designers, researchers who wa...
1,419 CitationsSource
#1Qiang Chen (IBM)H-Index: 29
#2Zheng Song (NUS: National University of Singapore)H-Index: 14
Last. Shuicheng Yan (NUS: National University of Singapore)H-Index: 87
view all 6 authors...
We investigate how to iteratively and mutually boost object classification and detection performance by taking the outputs from one task as the context of the other one. While context models have been quite popular, previous works mainly concentrate on co-occurrence relationship within classes and few of them focus on contextualization from a top-down perspective, i.e. high-level task context. In this paper, our system adopts a new method for adaptive context modeling and iterative boosting. Fir...
121 CitationsSource
#1Harsh Agrawal (VT: Virginia Tech)H-Index: 8
#2Clint Solomon Mathialagan (VT: Virginia Tech)H-Index: 3
Last. Dhruv Batra (VT: Virginia Tech)H-Index: 38
view all 8 authors...
We are witnessing a proliferation of massive visual data. Unfortunately, scaling existing computer vision algorithms to large datasets leaves researchers repeatedly solving the same algorithmic, logistical, and infrastructural problems. Our goal is to democratize computer vision; one should not have to be a computer vision, big data, and distributed computing expert to have access to state-of-the-art distributed computer vision algorithms. We present CloudCV, a comprehensive system to provide ac...
26 CitationsSource
Jan 1, 2014 in NeurIPS (Neural Information Processing Systems)
#1Bolei Zhou (MIT: Massachusetts Institute of Technology)H-Index: 31
#2Agata Lapedriza (MIT: Massachusetts Institute of Technology)H-Index: 15
Last. Aude Oliva (MIT: Massachusetts Institute of Technology)H-Index: 61
view all 5 authors...
Scene recognition is one of the hallmark tasks of computer vision, allowing definition of a context for object recognition. Whereas the tremendous recent progress in object recognition tasks is due to the availability of large datasets like ImageNet and the rise of Convolutional Neural Networks (CNNs) for learning high-level features, performance at scene recognition has not attained the same level of success. This may be because current deep features trained from ImageNet are not competitive en...
1,578 Citations
Nov 3, 2014 in MM (ACM Multimedia)
#1Yangqing Jia (Google)H-Index: 29
#2Evan Shelhamer (University of California, Berkeley)H-Index: 15
Last. Trevor Darrell (University of California, Berkeley)H-Index: 103
view all 8 authors...
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a ...
6,725 CitationsSource
#1Wanli OuyangH-Index: 39
#2Ping LuoH-Index: 33
Last. Xiaoou TangH-Index: 99
view all 15 authors...
In this paper, we propose multi-stage and deformable deep convolutional neural networks for object detection. This new deep learning object detection diagram has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the deformation of object parts with geometric constraint and penalty. With the proposed multi-stage training strategy, multiple classifiers are jointly optimized to process samples at different diffic...
95 Citations
Sep 6, 2014 in ECCV (European Conference on Computer Vision)
#1Tsung-Yi Lin (Cornell University)H-Index: 16
#2Michael Maire (California Institute of Technology)H-Index: 23
Last. C. Lawrence Zitnick (Microsoft)H-Index: 49
view all 8 authors...
3,559 CitationsSource
Sep 6, 2014 in ECCV (European Conference on Computer Vision)
#1Kaiming He (Microsoft)H-Index: 57
#2Xiangyu Zhang (Xi'an Jiaotong University)H-Index: 25
Last. Jian Sun (Microsoft)H-Index: 88
view all 4 authors...
Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing th...
1,494 CitationsSource
17.2k Citations
Jun 23, 2014 in CVPR (Computer Vision and Pattern Recognition)
#1Ross Girshick (University of California, Berkeley)H-Index: 65
#2Jeff Donahue (University of California, Berkeley)H-Index: 33
Last. Jitendra Malik (University of California, Berkeley)H-Index: 109
view all 4 authors...
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights:...
7,521 CitationsSource
Cited By9236
#1Zilong Hu (MTU: Michigan Technological University)H-Index: 2
#2Jinshan Tang (MTU: Michigan Technological University)H-Index: 20
Last. Jingfeng Jiang (MTU: Michigan Technological University)
view all 4 authors...
Abstract Automated bruised apple detection is an important application in the fruit industry. In this paper, we investigated convolutional neural network-based predictive models for the identification of bruised apples based on shape information (in a form of three-dimensional [3D] surface meshes) acquired from a 3D infrared imaging system. There are often irregularities on bruised apple surfaces, which can be used to differentiate those bruised apples from unbruised ones. In this study, we adop...
#1Peng Liu (HIT: Harbin Institute of Technology)H-Index: 7
#2Ting Xiao (HIT: Harbin Institute of Technology)H-Index: 1
Last. Hongwei Liu (HIT: Harbin Institute of Technology)H-Index: 1
view all 6 authors...
Abstract In the construction of expert and intelligent systems, annotating and curating large datasets is very expensive; hence, there is a need to transfer the knowledge from existing annotated datasets to unlabeled data. However, data that are relevant for a specific application usually differ from publicly available datasets because they are sampled from a different domain. Domain adaptation (DA) has emerged as an efficient technique to compensate for such a domain shift. Recent studies have ...
#1Amin NasiriH-Index: 1
#2Mahmoud OmidH-Index: 36
Last. Amin Taheri-GaravandH-Index: 6
view all 3 authors...
Abstract Egg quality and safety are significant concerns of consumers and modern food industries. This study proposes a novel and precise assessment of egg sorting using a deep convolutional neural network (CNN), which is a state-of-the-art computer vision method to perform classification tasks. To classify unwashed egg images, VGG16 architecture was modified by a global average pooling layer, dense layers, a batch normalization layer, and a dropout layer. The modified model was trained based on...
#1Jiajie Liu (SYSU: Sun Yat-sen University)
#2Han Li (SYSU: Sun Yat-sen University)
Last. Long Chen (SYSU: Sun Yat-sen University)H-Index: 15
view all 6 authors...
Abstract Recently, computer vision has achieved remarkable accomplishments in many domains under the thriving of deep learning. Scene flow estimation turns from the classical manual feature construction to the deep convolutional neural network (DCNN) approaches. In this paper, we review recent works about scene flow, mainly focusing on DCNN methods. We present some milestones of scene flow in recent years, and categorize these methods into supervised and unsupervised based methods. Meanwhile, we...
Sep 21, 2020 in MOBICOM (ACM/IEEE International Conference on Mobile Computing and Networking)
#1Roshan Ayyalasomayajula (UCSD: University of California, San Diego)H-Index: 1
#2Aditya Arun (UCSD: University of California, San Diego)
Last. Dinesh Bharadia (UCSD: University of California, San Diego)H-Index: 14
view all 7 authors...
#1Lukasz Romaszko (Edin.: University of Edinburgh)
#1Lukasz Romaszko (Edin.: University of Edinburgh)H-Index: 6
Last. John Winn (Microsoft)H-Index: 32
view all 3 authors...
Abstract We develop a Learning Direct Optimization(LiDO) method for the refinement of a latent variable model that describes input image x. Our goal is to explain a single image x with an interpretable 3D computer graphics model having scene graph latent variables z (such as object appearance, camera position). Given a current estimate of z we can render a prediction of the image g(z), which can be compared to the image x. The standard way to proceed is then to measure the error E(x, g(z)) betwe...
#2Amin Nasiri (UT: University of Tehran)H-Index: 1
Last. Yudong Zhang (University of Leicester)H-Index: 46
view all 4 authors...
Abstract Assessment and intelligent monitoring of fish freshness are of the utmost importance in yield and trade of fishery products. Rapid and precise assessment of fish freshness using conventional methods considering the great volume of industrial production is challenging. In this study, instead of feature-engineering-based methods, a novel and accurate fish freshness detection is proposed based on the images obtained from common carp and by applying a deep convolutional neural network (CNN)...
2 CitationsSource
#1Qiaoning Yang (Huada: Beijing University of Chemical Technology)
#2Weimin Shi (Huada: Beijing University of Chemical Technology)
Last. Weiguo Lin (Huada: Beijing University of Chemical Technology)
view all 4 authors...
Abstract Crack detection is critical to guaranteeing safety of bridges, highway and other infrastructures. The deep convolution neural network (DCNN) makes it possible to efficiently and accurately implement image classification, and the accumulated knowledge of DCNN in other domains can be reused for crack detection. In this paper, we propose a transfer learning method based on DCNN to detect cracks. The proposed method models the knowledge learned by DCNN and transfers three kinds of knowledge...
#1Yun Zhou (Hunan University)H-Index: 3
#2Yilin Pei (Hunan University)H-Index: 1
Last. Wei-Jian Yi (Hunan University)H-Index: 8
view all 6 authors...
Abstract Accurate information regarding the weight of vehicle loads plays a significant role in maintaining the structural health of bridges. However, the only method currently available for ascertaining the weight of loads is the bridge weigh-in-motion (BWIM) system, which is not widely used because of the high cost of the large device involved. There is therefore a need to develop an effective, low-cost technology to ascertain vehicle loads in relation to spatiotemporal load distribution on lo...
Beneficial from Fully Convolutional Neural Networks (FCNs), saliency detection methods have achieved promising results. However, it is still challenging to learn effective features for detecting salient objects in complicated scenarios, in which i) non-salient regions may have "salient-like" appearance; ii) the salient objects may have different-looking regions. To handle these complex scenarios, we propose a Feature Guide Network which exploits the nature of low-level and high-level features to...