大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015

时间:2021-01-28 23:00:18

大规模视觉识别挑战赛ILSVRC2015各团队结果和方法 Large Scale Visual Recognition Challenge 2015 Large Scale Visual Recognition Challenge 2015 (ILSVRC2015)

Legend:
Yellow background = winner in this task according to this metric; authors are willing to reveal the method

White background = authors are willing to reveal the method

Grey background = authors chose not to reveal the method

Italics = authors requested entry not participate in competition

Object detection (DET)

Task 1a: Object detection with provided training data

Ordered by number of categories won

Team name Entry description Number of object categories won mean AP
MSRA An ensemble for detection. 194 0.620741
Qualcomm Research NeoNet ensemble with bounding box regression. Validation mAP is 54.6 4 0.535745
CUImage Combined multiple models with the region proposals of cascaded RPN, 57.3% mAP on Val2. 2 0.527113
The University of Adelaide 9 models 0 0.514434
MCG-ICT-CAS 2 models on 2 proposals without category information: {[SS+EB]+(SS)}+{[RPN] +(RPN)} 0 0.453622
MCG-ICT-CAS Category aggregation + Co-occurrence refinement with 2 models on 2 proposals: {[SS+EB]+(SS)}+{[RPN] +(RPN)} 0 0.453606
MCG-ICT-CAS Category aggregation with 2 models on 2 proposals: {[SS+EB]+(SS)}+{[RPN] +(RPN)} 0 0.451932
DROPLET-CASIA A combination of 6 models with selective regression 0 0.448598
Trimps-Soushen no extra data 0 0.44635
MCG-ICT-CAS Category aggregation + Co-occurrence refinement with single model on single proposal: [SS+EB] +(SS) 0 0.444896
DROPLET-CASIA A combination of 6 models without regression 0 0.442696
MCG-ICT-CAS Category aggregation + Co-occurrence refinement with single model on single proposal: [RPN] +(RPN) 0 0.440436
Yunxiao A single model, Fast R-CNN baseline 0 0.429423
Trimps-Fudan-HUST Best single model 0 0.421353
HiVision Single detection model 0 0.418474
SYSU_Vision vgg16+edgebox 0 0.402298
SYSU_Vision vgg16+selective search+spl 0 0.394901
FACEALL-BUPT a simple strategy to merge the three results, 38.7% MAP on validation 0 0.386014
FACEALL-BUPT googlenet, fast-rcnn, pretrained on the 1000 classes, selective
search, add ROIPooling after inception_5, input size 786(max 1280),
pooled size 4x4, the mean AP on validation is 37.8%
0 0.378955
DEEPimagine Fusion step 1 and 2 hybrid 0 0.371965
SYSU_Vision vgg16+selective search 0 0.371149
DEEPimagine Category adaptive multi-model fusion stage 3 0 0.370756
DEEPimagine Category adaptive multi-model fusion stage 2 0 0.369696
ITLab - Inha average ensemble of 2 detection models 0 0.366185
ITLab - Inha single detection model(B) with hierarchical BB 0 0.365423
ITLab - Inha single detection model(B) 0 0.359213
ITLab - Inha weighted ensemble of 2 detection models with hierarchical BB 0 0.359109
DEEPimagine Category adaptive multi-model fusion stage 1 0 0.357192
DEEPimagine No fusion. Single deep depth model. 0 0.351264
FACEALL-BUPT alexnet, fast-rcnn, pretrained on the 1000 classes, selective search, the mean AP on validation is 34.9% 0 0.34843
ITLab - Inha single detection model(A) with hierarchical BB 0 0.346269
FACEALL-BUPT googlenet, fast-rcnn, relu layers replaced by prelu layers,
pretrained on the 1000 classes, selective search, add ROIPooling after
inception_4d, the mean AP on validation is 34.0%
0 0.343141
JHL Fast-RCNN with Selective Search 0 0.311042
USTC_UBRI --- 0 0.177112
hustvision yolo-balance 0 0.143461
darkensemble A single model but ensembled pipline trained on part of data due to time limit. 0 0.096197
ESOGU MLCV independent test results 0 0.093048
ESOGU MLCV results after mild suppression 0 0.092512
N-ODG Detection results. 0 0.03008
MSRA A single model for detection. --- 0.588451
Qualcomm Research NeoNet ensemble without bounding box regression. Validation mAP is 53.6 --- 0.531957

[top]

Ordered by mean average precision

Team name Entry description mean AP Number of object categories won
MSRA An ensemble for detection. 0.620741 194
MSRA A single model for detection. 0.588451 ---
Qualcomm Research NeoNet ensemble with bounding box regression. Validation mAP is 54.6 0.535745 4
Qualcomm Research NeoNet ensemble without bounding box regression. Validation mAP is 53.6 0.531957 ---
CUImage Combined multiple models with the region proposals of cascaded RPN, 57.3% mAP on Val2. 0.527113 2
The University of Adelaide 9 models 0.514434 0
MCG-ICT-CAS 2 models on 2 proposals without category information: {[SS+EB]+(SS)}+{[RPN] +(RPN)} 0.453622 0
MCG-ICT-CAS Category aggregation + Co-occurrence refinement with 2 models on 2 proposals: {[SS+EB]+(SS)}+{[RPN] +(RPN)} 0.453606 0
MCG-ICT-CAS Category aggregation with 2 models on 2 proposals: {[SS+EB]+(SS)}+{[RPN] +(RPN)} 0.451932 0
DROPLET-CASIA A combination of 6 models with selective regression 0.448598 0
Trimps-Soushen no extra data 0.44635 0
MCG-ICT-CAS Category aggregation + Co-occurrence refinement with single model on single proposal: [SS+EB] +(SS) 0.444896 0
DROPLET-CASIA A combination of 6 models without regression 0.442696 0
MCG-ICT-CAS Category aggregation + Co-occurrence refinement with single model on single proposal: [RPN] +(RPN) 0.440436 0
Yunxiao A single model, Fast R-CNN baseline 0.429423 0
Trimps-Fudan-HUST Best single model 0.421353 0
HiVision Single detection model 0.418474 0
SYSU_Vision vgg16+edgebox 0.402298 0
SYSU_Vision vgg16+selective search+spl 0.394901 0
FACEALL-BUPT a simple strategy to merge the three results, 38.7% MAP on validation 0.386014 0
FACEALL-BUPT googlenet, fast-rcnn, pretrained on the 1000 classes, selective
search, add ROIPooling after inception_5, input size 786(max 1280),
pooled size 4x4, the mean AP on validation is 37.8%
0.378955 0
DEEPimagine Fusion step 1 and 2 hybrid 0.371965 0
SYSU_Vision vgg16+selective search 0.371149 0
DEEPimagine Category adaptive multi-model fusion stage 3 0.370756 0
DEEPimagine Category adaptive multi-model fusion stage 2 0.369696 0
ITLab - Inha average ensemble of 2 detection models 0.366185 0
ITLab - Inha single detection model(B) with hierarchical BB 0.365423 0
ITLab - Inha single detection model(B) 0.359213 0
ITLab - Inha weighted ensemble of 2 detection models with hierarchical BB 0.359109 0
DEEPimagine Category adaptive multi-model fusion stage 1 0.357192 0
DEEPimagine No fusion. Single deep depth model. 0.351264 0
FACEALL-BUPT alexnet, fast-rcnn, pretrained on the 1000 classes, selective search, the mean AP on validation is 34.9% 0.34843 0
ITLab - Inha single detection model(A) with hierarchical BB 0.346269 0
FACEALL-BUPT googlenet, fast-rcnn, relu layers replaced by prelu layers,
pretrained on the 1000 classes, selective search, add ROIPooling after
inception_4d, the mean AP on validation is 34.0%
0.343141 0
JHL Fast-RCNN with Selective Search 0.311042 0
USTC_UBRI --- 0.177112 0
hustvision yolo-balance 0.143461 0
darkensemble A single model but ensembled pipline trained on part of data due to time limit. 0.096197 0
ESOGU MLCV independent test results 0.093048 0
ESOGU MLCV results after mild suppression 0.092512 0
N-ODG Detection results. 0.03008 0

[top]

Task 1b: Object detection with additional training data

Ordered by number of categories won

Team name Entry description Description of outside data used Number of object categories won mean AP
Amax remove threshold compared to entry1 pre-trained model from classification task;add training examples for class number <1000 165 0.57848
CUImage Combined models with region proposals of cascaded RPN, edgebox and selective search 3000-class classification images from ImageNet are used to pre-train CNN 30 0.522833
MIL-UT ensemble of 4 models (by averaging) VGG-16, BVLC GoogLeNet, Multibox and CLS/LOC data 2 0.469762
Amax Cascade region regression pre-trained model from classification task;add training examples for class number <1000 1 0.577374
MIL-UT ensemble of 4 models (using the weights learned separately for each class by Bayesian optimization on the val2 dataset) VGG-16, BVLC GoogLeNet, Multibox and CLS/LOC data 1 0.467786
Trimps-Soushen 2 times merge COCO 1 0.448106
Trimps-Soushen model 29 COCO 0 0.450794
Trimps-Fudan-HUST top5 region merge COCO 0 0.448739
Futurecrew Ensemble 1 of multiple models without contextual modeling. Validation is 44.1% mAP The CNN was pre-trained on the ILSVRC 2014 CLS dataset 0 0.416015
Futurecrew Ensemble 2 of multiple models without contextual modeling. Validation is 44.0% mAP The CNN was pre-trained on the ILSVRC 2014 CLS dataset 0 0.414619
Futurecrew Faster R-CNN based single detection model. Validation is 41.7% mAP The CNN was pre-trained on the ILSVRC 2014 CLS dataset 0 0.39862
1-HKUST run 2 HKUST-object-100 0 0.239854
1-HKUST baseline run 1 HKUST-object-100 0 0.205873
CUImage Single model 3000-class classification images from ImageNet are used to pre-train CNN --- 0.542021
CUImage Combined multiple models with the region proposals of cascaded RPN 3000-class classification images from ImageNet are used to pre-train CNN --- 0.531459
CUImage Combined models with region proposals of edgebox and selective search 3000-class classification images from ImageNet are used to pre-train CNN --- 0.137037

[top]

Ordered by mean average precision

Team name Entry description Description of outside data used mean AP Number of object categories won
Amax remove threshold compared to entry1 pre-trained model from classification task;add training examples for class number <1000 0.57848 165
Amax Cascade region regression pre-trained model from classification task;add training examples for class number <1000 0.577374 1
CUImage Single model 3000-class classification images from ImageNet are used to pre-train CNN 0.542021 ---
CUImage Combined multiple models with the region proposals of cascaded RPN 3000-class classification images from ImageNet are used to pre-train CNN 0.531459 ---
CUImage Combined models with region proposals of cascaded RPN, edgebox and selective search 3000-class classification images from ImageNet are used to pre-train CNN 0.522833 30
MIL-UT ensemble of 4 models (by averaging) VGG-16, BVLC GoogLeNet, Multibox and CLS/LOC data 0.469762 2
MIL-UT ensemble of 4 models (using the weights learned separately for each class by Bayesian optimization on the val2 dataset) VGG-16, BVLC GoogLeNet, Multibox and CLS/LOC data 0.467786 1
Trimps-Soushen model 29 COCO 0.450794 0
Trimps-Fudan-HUST top5 region merge COCO 0.448739 0
Trimps-Soushen 2 times merge COCO 0.448106 1
Futurecrew Ensemble 1 of multiple models without contextual modeling. Validation is 44.1% mAP The CNN was pre-trained on the ILSVRC 2014 CLS dataset 0.416015 0
Futurecrew Ensemble 2 of multiple models without contextual modeling. Validation is 44.0% mAP The CNN was pre-trained on the ILSVRC 2014 CLS dataset 0.414619 0
Futurecrew Faster R-CNN based single detection model. Validation is 41.7% mAP The CNN was pre-trained on the ILSVRC 2014 CLS dataset 0.39862 0
1-HKUST run 2 HKUST-object-100 0.239854 0
1-HKUST baseline run 1 HKUST-object-100 0.205873 0
CUImage Combined models with region proposals of edgebox and selective search 3000-class classification images from ImageNet are used to pre-train CNN 0.137037 ---

[top]

Object localization (LOC)[top]

Task 2a: Classification+localization with provided training data

Ordered by localization error

Team name Entry description Localization error Classification error
MSRA Ensemble A for classification and localization. 0.090178 0.03567
MSRA Ensemble B for classification and localization. 0.090801 0.03567
MSRA Ensemble C for classification and localization. 0.092108 0.0369
Trimps-Soushen combined 12 models 0.122907 0.04649
Qualcomm Research Ensemble of 9 NeoNets with bounding box regression. Weighted fusion
of classification models with 3 NeoNets used to slightly improve the
classification accuracy. Validation top 5 error rate is 4.84%
(classification).
0.125542 0.04873
Qualcomm Research Ensemble of 9 NeoNets with bounding box regression. Weighted fusion
of classification models. Validation top 5 error rate is 4.86%
(classification) and 14.54% (localization)
0.125926 0.04913
Qualcomm Research Ensemble of 9 NeoNets with bounding box regression. Validation top 5
error rate is 5.06% (classification) and 14.76% (localization)
0.128353 0.05068
Trimps-Soushen combined according category, 13 models 0.1291 0.04581
Trimps-Soushen combined 8 models 0.130449 0.04866
Trimps-Soushen single model 0.133685 0.04649
MCG-ICT-CAS [GoogleNet+VGG+SPCNN+SEL]OWA-[EB200RPN+SSEB1000EB] 0.14686 0.06314
MCG-ICT-CAS [GoogleNet+VGG+SPCNN]-[EB200RPN+SSEB1000EB] 0.147316 0.06407
Lunit-KAIST Class-agnostic AttentionNet, double-pass (original+flipped, fusion weights learnt on the entire validation set). 0.147337 0.07923
MCG-ICT-CAS [GoogleNet+VGG+SPCNN+SEL]AVG-[EB200EB+SSEB1000EB] 0.148157 0.06321
MCG-ICT-CAS [GoogleNet+VGG+SPCNN]-[EB200RPN+SSEB1000RPN] 0.149526 0.06407
Lunit-KAIST Class-agnostic AttentionNet, double-pass (original+flipped, fusion weights learnt on the half of the validation set). 0.149982 0.07451
Lunit-KAIST Class-agnostic AttentionNet, double-pass (original+flipped). 0.150688 0.07333
Tencent-Bestimage Tune GBDT on validation set 0.15548 0.10786
Qualcomm Research Ensemble of 4 NeoNets, no bounding box regression. 0.155875 0.05068
Tencent-Bestimage Model ensemble of objectness 0.156009 0.10826
MCG-ICT-CAS Baseline: [GoogleNet+VGG]-[EB200EB+EB1000EB] 0.158219 0.06698
Lunit-KAIST Class-agnostic AttentionNet, single-pass. 0.159754 0.07245
Tencent-Bestimage Change SVM to GBDT 0.163198 0.0978
Tencent-Bestimage Given a lot of box proposals(eg. from selective
search), the localization task become (a) how to find real boxes that
contain ground truth, (b) what is the category lies in the box.
Enlightened by fast R-CNN, we propose a classification &&
localization framework, which combines the global context information
and local box information by selecting proper (class, box) pairs.
0.193552 0.1401
ReCeption --- 0.195792 0.03581
CUImage Average multiple models. Validation accuracy is 78.28%. 0.212141 0.05858
CIL Ensemble of multiple regression model: all model 0.231945 0.07446
CIL Ensemble of multiple regression models: base model- simple regression 0.232432 0.07444
CIL Ensemble of multiple regression models:base model-middle feature model 0.232526 0.07453
VUNO A combination of CNN and CLSTM (averaging) 0.252951 0.05034
VUNO A combination of two CNNs (averaging) 0.254922 0.05034
CIL Single regression model: middle and last layer feature are combined 0.271406 0.05477
CIL Single regression model with multi-scale prediction 0.273294 0.05477
KAISTNIA_ETRI CNN with recursive localization (further tuned in validation set) I 0.284944 0.07262
KAISTNIA_ETRI CNN with recursive localization (further tuned in validation set) IV 0.2854 0.07384
KAISTNIA_ETRI CNN with recursive localization (further tuned in validation set) III 0.285452 0.07384
KAISTNIA_ETRI CNN with recursive localization (further tuned in validation set) II 0.285608 0.07384
KAISTNIA_ETRI CNN with recursive localization 0.287475 0.073
HiVision Multiple models for classification, single model for localization 0.289653 0.06482
FACEALL-BUPT merge two models to classify the proposals. the top-5
classification error on the validation set is 7.03%, the top-5
localization error on validation is 31.53%
0.314581 0.07502
MIPAL_SNU Localization network finetuned for each class 0.324675 0.08534
FACEALL-BUPT use a multi-scale googlenet model to classify the proposals. the
top-5 classification error on the validation set is 7.03%, the top-5
localization error on validation is 32.76%
0.327745 0.07502
FACEALL-BUPT three models, selective search, classify each proposal, merge
top-3, the top-5 classification error on the validation set is 7.03%,
the top-5 localization error on validation is 32.9%
0.32844 0.07502
MIPAL_SNU Single Localization network 0.355101 0.08534
PCL Bangalore Joint localization and classification 0.422177 0.1294
bioinf@JKU Classification and "split" bboxes 0.455186 0.0918
bioinf@JKU Classification and localization bboxes 0.46808 0.0918
Szokcka Research Group CNN-ensemble, bounding box is fixed + VGG-style network 0.484004 0.06828
Szokcka Research Group CNN-ensemble, bounding box is fixed. 0.484191 0.07338
ITU Hierarchical Google Net Classification + Localization 0.589338 0.11433
Miletos Hierarchical Google Net Classification + Localization 0.589338 0.11433
JHL 5 averaged CNN model (top5) 0.602865 0.0712
APL Multi-CNN Best Network Classifier (Random Forest via Top 5 Guesses) 0.613363 0.13288
Deep Punx Inception6 model trained on image-level annotations (~640K iterations were conducted). 0.614338 0.11104
bioinf@JKU Classification only, no bounding boxes 0.619007 0.0918
bioinf@JKU Classification and bboxes regression 0.62327 0.0918
DIT_TITECH Ensemble of 6 DNN models with 11-16 convolutional and 4-5 pooling layers. 0.627409 0.13893
Deep Punx Inception6 model trained on object-level annotations (~640K iterations were conducted). 0.639183 0.17594
JHL 5 averaged CNN model (top1) 0.668728 0.23944
Deep Punx Model, inspired by Inception7a [2] (~700K iterations were conducted). 0.757796 0.14666
APL Multi-CNN Best Label Classifier (Random Forest via 1000 Class Scores) 0.819291 0.60759

[top]

Ordered by classification error

Team name Entry description Classification error Localization error
MSRA Ensemble A for classification and localization. 0.03567 0.090178
MSRA Ensemble B for classification and localization. 0.03567 0.090801
ReCeption --- 0.03581 0.195792
MSRA Ensemble C for classification and localization. 0.0369 0.092108
Trimps-Soushen combined according category, 13 models 0.04581 0.1291
Trimps-Soushen single model 0.04649 0.133685
Trimps-Soushen combined 12 models 0.04649 0.122907
Trimps-Soushen combined 8 models 0.04866 0.130449
Qualcomm Research Ensemble of 9 NeoNets with bounding box regression. Weighted fusion
of classification models with 3 NeoNets used to slightly improve the
classification accuracy. Validation top 5 error rate is 4.84%
(classification).
0.04873 0.125542
Qualcomm Research Ensemble of 9 NeoNets with bounding box regression. Weighted fusion
of classification models. Validation top 5 error rate is 4.86%
(classification) and 14.54% (localization)
0.04913 0.125926
VUNO A combination of CNN and CLSTM (averaging) 0.05034 0.252951
VUNO A combination of two CNNs (averaging) 0.05034 0.254922
Qualcomm Research Ensemble of 9 NeoNets with bounding box regression. Validation top 5
error rate is 5.06% (classification) and 14.76% (localization)
0.05068 0.128353
Qualcomm Research Ensemble of 4 NeoNets, no bounding box regression. 0.05068 0.155875
CIL Single regression model with multi-scale prediction 0.05477 0.273294
CIL Single regression model: middle and last layer feature are combined 0.05477 0.271406
CUImage Average multiple models. Validation accuracy is 78.28%. 0.05858 0.212141
MCG-ICT-CAS [GoogleNet+VGG+SPCNN+SEL]OWA-[EB200RPN+SSEB1000EB] 0.06314 0.14686
MCG-ICT-CAS [GoogleNet+VGG+SPCNN+SEL]AVG-[EB200EB+SSEB1000EB] 0.06321 0.148157
MCG-ICT-CAS [GoogleNet+VGG+SPCNN]-[EB200RPN+SSEB1000EB] 0.06407 0.147316
MCG-ICT-CAS [GoogleNet+VGG+SPCNN]-[EB200RPN+SSEB1000RPN] 0.06407 0.149526
HiVision Multiple models for classification, single model for localization 0.06482 0.289653
MCG-ICT-CAS Baseline: [GoogleNet+VGG]-[EB200EB+EB1000EB] 0.06698 0.158219
Szokcka Research Group CNN-ensemble, bounding box is fixed + VGG-style network 0.06828 0.484004
JHL 5 averaged CNN model (top5) 0.0712 0.602865
Lunit-KAIST Class-agnostic AttentionNet, single-pass. 0.07245 0.159754
KAISTNIA_ETRI CNN with recursive localization (further tuned in validation set) I 0.07262 0.284944
KAISTNIA_ETRI CNN with recursive localization 0.073 0.287475
Lunit-KAIST Class-agnostic AttentionNet, double-pass (original+flipped). 0.07333 0.150688
Szokcka Research Group CNN-ensemble, bounding box is fixed. 0.07338 0.484191
KAISTNIA_ETRI CNN with recursive localization (further tuned in validation set) II 0.07384 0.285608
KAISTNIA_ETRI CNN with recursive localization (further tuned in validation set) III 0.07384 0.285452
KAISTNIA_ETRI CNN with recursive localization (further tuned in validation set) IV 0.07384 0.2854
CIL Ensemble of multiple regression models: base model- simple regression 0.07444 0.232432
CIL Ensemble of multiple regression model: all model 0.07446 0.231945
Lunit-KAIST Class-agnostic AttentionNet, double-pass (original+flipped, fusion weights learnt on the half of the validation set). 0.07451 0.149982
CIL Ensemble of multiple regression models:base model-middle feature model 0.07453 0.232526
FACEALL-BUPT three models, selective search, classify each proposal, merge
top-3, the top-5 classification error on the validation set is 7.03%,
the top-5 localization error on validation is 32.9%
0.07502 0.32844
FACEALL-BUPT use a multi-scale googlenet model to classify the proposals. the
top-5 classification error on the validation set is 7.03%, the top-5
localization error on validation is 32.76%
0.07502 0.327745
FACEALL-BUPT merge two models to classify the proposals. the top-5
classification error on the validation set is 7.03%, the top-5
localization error on validation is 31.53%
0.07502 0.314581
Lunit-KAIST Class-agnostic AttentionNet, double-pass (original+flipped, fusion weights learnt on the entire validation set). 0.07923 0.147337
MIPAL_SNU Single Localization network 0.08534 0.355101
MIPAL_SNU Localization network finetuned for each class 0.08534 0.324675
bioinf@JKU Classification only, no bounding boxes 0.0918 0.619007
bioinf@JKU Classification and "split" bboxes 0.0918 0.455186
bioinf@JKU Classification and localization bboxes 0.0918 0.46808
bioinf@JKU Classification and bboxes regression 0.0918 0.62327
Tencent-Bestimage Change SVM to GBDT 0.0978 0.163198
Tencent-Bestimage Tune GBDT on validation set 0.10786 0.15548
Tencent-Bestimage Model ensemble of objectness 0.10826 0.156009
Deep Punx Inception6 model trained on image-level annotations (~640K iterations were conducted). 0.11104 0.614338
ITU Hierarchical Google Net Classification + Localization 0.11433 0.589338
Miletos Hierarchical Google Net Classification + Localization 0.11433 0.589338
PCL Bangalore Joint localization and classification 0.1294 0.422177
APL Multi-CNN Best Network Classifier (Random Forest via Top 5 Guesses) 0.13288 0.613363
DIT_TITECH Ensemble of 6 DNN models with 11-16 convolutional and 4-5 pooling layers. 0.13893 0.627409
Tencent-Bestimage Given a lot of box proposals(eg. from selective
search), the localization task become (a) how to find real boxes that
contain ground truth, (b) what is the category lies in the box.
Enlightened by fast R-CNN, we propose a classification &&
localization framework, which combines the global context information
and local box information by selecting proper (class, box) pairs.
0.1401 0.193552
Deep Punx Model, inspired by Inception7a [2] (~700K iterations were conducted). 0.14666 0.757796
Deep Punx Inception6 model trained on object-level annotations (~640K iterations were conducted). 0.17594 0.639183
JHL 5 averaged CNN model (top1) 0.23944 0.668728
APL Multi-CNN Best Label Classifier (Random Forest via 1000 Class Scores) 0.60759 0.819291

[top]

Task 2b: Classification+localization with additional training data

Ordered by localization error

Team name Entry description Description of outside data used Localization error Classification error
Trimps-Soushen extra annotations collected by ourselves extra annotations collected by ourselves 0.122285 0.04581
Amax Validate the classification model we used in DET entry1 share proposal procedure with DET for convinence 0.14574 0.04354
CUImage Average multiple models. Validation accuracy is 79.78%. 3000-class classification images from ImageNet are used to pre-train CNN 0.198272 0.05858
CUImage Combine 6 models 3000-class classification images from ImageNet are used to pre-train CNN 0.19905 0.05858

[top]

Ordered by classification error

Team name Entry description Description of outside data used Classification error Localization error
Amax Validate the classification model we used in DET entry1 share proposal procedure with DET for convinence 0.04354 0.14574
Trimps-Soushen extra annotations collected by ourselves extra annotations collected by ourselves 0.04581 0.122285
CUImage Average multiple models. Validation accuracy is 79.78%. 3000-class classification images from ImageNet are used to pre-train CNN 0.05858 0.198272
CUImage Combine 6 models 3000-class classification images from ImageNet are used to pre-train CNN 0.05858 0.19905

[top]

Object detection from video (VID)[top]

Task 3a: Object detection from video with provided training data

Ordered by number of categories won

Team name Entry description Number of object categories won mean AP
CUVideo Average of models, no outside training data, mAP 73.8 on validation data 28 0.678216
RUC_BDAI We combine RCNN and video segmentation to get the final result. 2 0.359668
ITLab VID - Inha 2 model ensemble with PLS and POMDP v2 0 0.515045
ITLab VID - Inha 2 model ensemble 0 0.513368
ITLab VID - Inha 2 model ensemble with PLS and POMDP 0 0.511743
UIUC-IFP Faster-RCNN + SEQ-NMS-AVG 0 0.487232
UIUC-IFP Faster-RCNN + SEQ-NMS-MAX 0 0.487232
UIUC-IFP Faster-RCNN + SEQ-NMS-MIX 0 0.487232
Trimps-Soushen Best single model 0 0.461155
Trimps-Soushen Single model with main object constraint 0 0.4577
UIUC-IFP Faster-RCNN + Single-Frame-NMS 0 0.433511
1-HKUST YX_submission2_merge479 0 0.421108
1-HKUST YX_submission1_merge475 0 0.417104
1-HKUST YX_submission1_tracker 0 0.415265
HiVision Detection + multi-object tracking 0 0.375203
1-HKUST RH_test6 0 0.367896
1-HKUST RH_test3_tracker 0 0.366048
NICAL proposals+VGG16 0 0.229714
FACEALL-BUPT merge the two results, 25.3% map on validation 0 0.222359
FACEALL-BUPT object detection based on fast rcnn with GoogLeNet, object tracking based on TLD, 24.6% map on validation 0 0.218363
FACEALL-BUPT object detection based on fast rcnn with Alexnet, object tracking based on TLD, 19.1% map on validation 0 0.162115
MPG_UT We sequentially predict bounding boxes in every frame, and predict object categories for given regions. 0 0.128381
ART Vision Based On Global 0 1e-06
CUVideo Single model, no outside training data, mAP 72.5 on validation data --- 0.664121
HiVision Detection only --- 0.372892

[top]

Ordered by mean average precision

Team name Entry description mean AP Number of object categories won
CUVideo Average of models, no outside training data, mAP 73.8 on validation data 0.678216 28
CUVideo Single model, no outside training data, mAP 72.5 on validation data 0.664121 ---
ITLab VID - Inha 2 model ensemble with PLS and POMDP v2 0.515045 0
ITLab VID - Inha 2 model ensemble 0.513368 0
ITLab VID - Inha 2 model ensemble with PLS and POMDP 0.511743 0
UIUC-IFP Faster-RCNN + SEQ-NMS-AVG 0.487232 0
UIUC-IFP Faster-RCNN + SEQ-NMS-MAX 0.487232 0
UIUC-IFP Faster-RCNN + SEQ-NMS-MIX 0.487232 0
Trimps-Soushen Best single model 0.461155 0
Trimps-Soushen Single model with main object constraint 0.4577 0
UIUC-IFP Faster-RCNN + Single-Frame-NMS 0.433511 0
1-HKUST YX_submission2_merge479 0.421108 0
1-HKUST YX_submission1_merge475 0.417104 0
1-HKUST YX_submission1_tracker 0.415265 0
HiVision Detection + multi-object tracking 0.375203 0
HiVision Detection only 0.372892 ---
1-HKUST RH_test6 0.367896 0
1-HKUST RH_test3_tracker 0.366048 0
RUC_BDAI We combine RCNN and video segmentation to get the final result. 0.359668 2
NICAL proposals+VGG16 0.229714 0
FACEALL-BUPT merge the two results, 25.3% map on validation 0.222359 0
FACEALL-BUPT object detection based on fast rcnn with GoogLeNet, object tracking based on TLD, 24.6% map on validation 0.218363 0
FACEALL-BUPT object detection based on fast rcnn with Alexnet, object tracking based on TLD, 19.1% map on validation 0.162115 0
MPG_UT We sequentially predict bounding boxes in every frame, and predict object categories for given regions. 0.128381 0
ART Vision Based On Global 1e-06 0

[top]

Task 3b: Object detection from video with additional training data

Ordered by number of categories won

Team name Entry description Description of outside data used Number of object categories won mean AP
Amax only half of the videos are tracked due to
deadline limits, others are only detected by Faster RCNN (VGG16) without
tempor smooth.
--- 18 0.730746
CUVideo Outside training data (ImageNet 3000-class data ) to pre-train the detection model, mAP 77.0 on validation data ImageNet 3000-class data to pre-train the model 11 0.696607
Trimps-Soushen Models combine with main object constraint COCO 1 0.480542
Trimps-Soushen Combine several models COCO 0 0.487495
BAD VID2015_trace_merge ILSVRC DET 0 0.4343
BAD combined_VIDtrainval_threshold0.1 ILSVRC DET 0 0.390301
BAD VID2015_merge_test_threshold0.1 ILSVRC DET 0 0.383204
BAD combined_test_DET_threshold0.1 ILSVRC DET 0 0.350286
BAD VID2015_VID_test_threshold0.1 ILSVRC DET 0 0.339721
NECLAMA Faster-RCNN as described above. The model relies on a CNN that is pre-trained with the ImageNet-CLS-2012 data for the classification task. 0 0.326489

[top]

Ordered by mean average precision

Team name Entry description Description of outside data used mean AP Number of object categories won
Amax only half of the videos are tracked due to
deadline limits, others are only detected by Faster RCNN (VGG16) without
tempor smooth.
--- 0.730746 18
CUVideo Outside training data (ImageNet 3000-class data ) to pre-train the detection model, mAP 77.0 on validation data ImageNet 3000-class data to pre-train the model 0.696607 11
Trimps-Soushen Combine several models COCO 0.487495 0
Trimps-Soushen Models combine with main object constraint COCO 0.480542 1
BAD VID2015_trace_merge ILSVRC DET 0.4343 0
BAD combined_VIDtrainval_threshold0.1 ILSVRC DET 0.390301 0
BAD VID2015_merge_test_threshold0.1 ILSVRC DET 0.383204 0
BAD combined_test_DET_threshold0.1 ILSVRC DET 0.350286 0
BAD VID2015_VID_test_threshold0.1 ILSVRC DET 0.339721 0
NECLAMA Faster-RCNN as described above. The model relies on a CNN that is pre-trained with the ImageNet-CLS-2012 data for the classification task. 0.326489 0

[top]

Scene Classification (Scene)[top]

Task 4a: Scene classification with provided training data

Team name Entry description Classification error
WM Fusion with product strategy 0.168715
WM Fusion with learnt weights 0.168747
WM Fusion with average strategy 0.168909
WM A single model (model B) 0.172876
WM A single model (model A) 0.173527
SIAT_MMLAB 9 models 0.173605
SIAT_MMLAB 13 models 0.174645
SIAT_MMLAB more models 0.174795
SIAT_MMLAB 13 models 0.175417
SIAT_MMLAB 2 models 0.175868
Qualcomm Research Weighted fusion of two models. Top 5 validation error is 16.45%. 0.175978
Qualcomm Research Ensemble of two models. Top 5 validation error is 16.53%. 0.176559
Qualcomm Research Ensemble of seven models. Top 5 validation error is 16.68% 0.176766
Trimps-Soushen score combine with 5 models 0.179824
Trimps-Soushen score combine with 8 models 0.179997
Trimps-Soushen top10 to top5, label combine with 9 models 0.180714
Trimps-Soushen top10 to top5, label combine with 7 models 0.180984
Trimps-Soushen single model, bn07 0.182357
ntu_rose test_4 0.193367
ntu_rose test_2 0.193645
ntu_rose test_5 0.19397
ntu_rose test_3 0.194262
Mitsubishi Electric Research Laboratories average of VGG16 trained with the standard cross entropy loss and VGG16 trained with weighted cross entropy loss. 0.194346
Mitsubishi Electric Research Laboratories VGG16 trained with weighted cross entropy loss. 0.199268
HiVision Single model with 5 scales 0.199777
DeepSEU Just one CNN model 0.200572
Qualcomm Research Ensemble of two models, trained with dense augmentation. Top 5 validation error is 19.20% 0.20111
HiVision Single model with 3 scales 0.201796
GatorVision modified VGG16 network 0.20268
SamExynos A Combination of Multiple ConvNets (7 Nets) 0.204197
SamExynos A Combination of Multiple ConvNets ( 6 Nets) 0.205457
UIUCMSR VGG-16 model trained using the entire training data 0.206851
SamExynos A Single ConvNet 0.207594
UIUCMSR Using filter panorama in the very bottom convolutional layer in CNNs 0.207925
UIUCMSR Using filter panorama in the top convolutional layer in CNNs 0.208972
ntu_rose test_1 0.211503
DeeperScene A single deep CNN model tuned on the validation set 0.241738
THU-UTSA-MSRA run4 0.253109
THU-UTSA-MSRA run5 0.254369
THU-UTSA-MSRA run1 0.256104
SIIT_KAIST-TECHWIN averaging three models 0.261284
SIIT_KAIST-TECHWIN averaging two models 0.266788
SIIT_KAIST-ETRI Modified GoogLeNet and test augmentation. 0.269862
THU-UTSA-MSRA run2 0.271185
NECTEC-MOONGDO Alexnet with retrain 2 0.275558
NECTEC-MOONGDO Alexnet with retrain 1 0.27564
SIIT_KAIST-TECHWIN single model 0.280223
HanGil deep ISA network for Places2 recognition 0.282688
FACEALL-BUPT Fine-tune Model 1 for another 1 epoch and correct the output vertor
size from 400 to 401; 10 crops, top-5 error 31.42% on validation
0.32725
FACEALL-BUPT GoogLeNet, with input resize to 128*128, removed Incecption_5, 10 crops, top-5 error 37.19% on validation 0.38872
FACEALL-BUPT GoogLeNet, with input resize to 128*128 and reduced kernel numbers and sizes,10 crops, top-5 error 38.99% on validation 0.407011
Henry Machine No deep learning. Traditional practice: feature engineering and classifier design. 0.417073
THU-UTSA-MSRA run3 0.987563

[top]

Task 4b: Scene classification with additional training data

Team name Entry description Description of outside data used Classification error
NEIOP We pretrained a VGG16 model on Places205 database and then finetuned the model on Places2 database. Places205 database 0.203539
Isia_ICT Combination of different models Places1 0.220239
Isia_ICT Combination of different models Places1 0.22074
Isia_ICT Combination of different models Places1 0.22074
ZeroHero Zero-shot scene recognition with 15K object categories 15K object categories from ImageNet, textual data from YFCC100M. 0.572784

[top]

Team information[top]

Team name Team members Abstract
1-HKUST Yongyi Lu (The * University of Science and Technology)

Hao Chen (The Chinese University of *)

Qifeng Chen (Stanford University)

Yao Xiao (The * University of Science and Technology)

Law Hei (University of Michigan)

Chi-Keung Tang(The * University of Science and Technology)
Our system detects large and small resolution objects using
different schemes. The threshold between large and small resolution is
100 x 100. For large-resolution objects, we score the average scores of
4 models (Caffe, NIN, Vgg16, Vgg19) in the bounding box of selective
search. For small-resolution objects, scores are generated by fast RCNN
on selective search (quality mode). The finally result is the
combination of the output of large and small resolution objects after
applying NMS. In training, we augment the data with our annotated
HKUST-object-100 dataset which consists of 219174 images.
HKUST-object-100 will be published after the 2015 competition to benefit
the research communities.
1-HKUST *Rui Peng (HKUST),

*Hengyuan Hu (HKUST),

Yuxiang Wu (HKUST),

Yongyi Lu (HKUST),

Yu-Wing Tai (SenseTime Group Limited),

Chi Keung Tang (HKUST)

(* equal contribution)

We adapted two image object detection architectures, namely
Fast-RCNN[1] and Faster-RCNN[2], for the task of object detection from
video. We used Edge Boxes[3] as proposal generation algorithm for
Fast-RCNN in our pipeline since we found that it outperformed other
methods in blurred, low resolution video data. To exploit temporal
information of video, we tried to aggregate proposals from multiple
frames to provide better proposals for each single frame. In addition,
we also devised a simple post-processing program, with CMT[4] tracker
involved, to rectify the predictions.

[1] R. Girshick. "Fast R-CNN." arXiv preprint arXiv:1504.08083

[2] S. Ren, K. He, R. Girshick and J. Sun. "Faster r-cnn: Towards
real-time object detection with region proposal networks." arXiv
preprint arXiv:1506.01497

[3] C. L. Zitnick and P. Dollár. "Edge boxes: Locating object proposals from edges." ECCV 2014

[4] G. Nebehay and R. Pflugfelder. "Clustering of Static-Adaptive Correspondences for Deformable Object Tracking." CVPR 2015

AK47 Na Li

Shenzhen Institue of Advanced Technology,Chinese Academy of Sciences

Hongxiang Hu

Shenzhen Institute of Advanced Technology,Chinese Academy of Sciences
Our algorithm is based on fast-rcnn.

We fintuned the fast-rcnn network using the date picked from
ILSVRC2015's training set.After that,each testing frames were inputted
to the network,then we get the predict result.

We have also tried several kinds of method to using the similiarity
of neighboring frames.At begining,we compared object proposals created
by different methods(selective search,rigor,edgeboxes,mop,gop)and we
choosed edgeboxes finally.

We tried to add the "behind" frame's op and the "before" frame's to
the middle one to use their relativity.Our experiments proved it work.

We have also tried kinds of algorithms to tracking object,like
optical flow and streamline,it's a pity that we havn't apply any of
these algorithms to our model.

Whatever,we have learned a lot from this competition and thanks for your organization!

We will come back!
Amax DET:Jiankang Deng(1,2), Hui Shuai(2), Zongguang Lu(2), Jing
Yang(2), Shaoli Huang(1), Yali Du(1), Yi Wu(2), Qingshan Liu(2), Dacheng
Tao (1)

CLS-LOC:Jing Yang(2), Shaoli Huang(1), Zhengbo Yu(2), Qiang Ma(2), Jiankang Deng(1,2)

VID: Jiankang Deng(1,2), Jing Yang(2), Shaoli Huang(1), Hui Shuai(2),Yi Wu(2), Qingshan Liu(2), Dacheng Tao(1)

(1)University of Technology, Sydney
(2)Nanjing University of Information Science & Technology

Cascade region regression

1.DET:Spatial Cacade region regression

We first set up Faster RCNN[3] as our baseline. (mAP 45.6% for VGG-16;mAP 47.2% for Google-net).

Object detection is to answer "Where" and "What".

We utilize cascade regression regression model to gradually to refine the location of object, which is helpful to answer "what".

Solid tricks including:

Negative example (discriminative feature is inhomogeneous on the
objects, response maps is helpful for choosing reasonable positive and
negative examples, instead of only using IoU).

Multi-scale(image,joint feature map,inception layer; Answer "where"
from former CNN, and "what" from later with high capacity models).

Learn to combine(NMS inter-class, NMS intra-class; exclusive in space).

Learn to rank(hypothesis:the data number distribution between
validation set and test set is similar. We can choose class-specific
parameters).

Add training samples for classes with little training data.

Design class-specific model for hard classes.

Rank low resolution and dense prediction later.

Model ensemble with multi-view learning.

2.VID: Tempor Cascade region regression

Objectiveness based tracker is designed to track the objects on videos.

Firstly, We train Faster-RCNN[3](VGG-16) with the provided training data (sampling from frames).

The network provides features for tracking.

Secondly,the tracker uses the roi_pooling features from the last
conv layer and tempor information, which can be seen as the tempor Fast
RCNN[2].

(Take the location-indexed features from current frame to predict the bounding box of object on next frame.)

Tempor information and scence cluster(different video from one
scence) are greatly helpful to decide the classes on the videos with
high confidence.

[1]Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies
for accurate object detection and semantic segmentation[C]//Computer
Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on. IEEE,
2014: 580-587.

[2]Girshick R. Fast R-CNN[J]. arXiv preprint arXiv:1504.08083, 2015.

[3]Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time
object detection with region proposal networks[J]. arXiv preprint
arXiv:1506.01497, 2015.

APL Christopher M. Gifford (JHU/APL)

Pedro A. Rodriguez (JHU/APL)

Ryan J. Amundsen (JHU/APL)

Stefan Awad (JHU/APL)

Brant W. Chee (JHU/APL)

Clare W. Lau (JHU/APL)

Ajay M. Patrikar (JHU/APL)
Our submissions leverage multiple pre-trained CNNs and a second
stage Random Forest classifier to choose which label or CNN to use for
top 5 guesses. The second stage classifier is trained using the
validation data set based on the 1000 class scores from each individual
network, or based on which network(s) selected the correct label faster
(i.e., closer to the top guess). The primary pre-trained CNNs leveraged
are VGG VeryDeep-19, VGG VeryDeep-16, and VGG S. The second-stage Random
Forest classifier is trained using 1000 trees.

References:

[1] "MatConvNet - Convolutional Neural Networks for MATLAB", A. Vedaldi and K. Lenc, arXiv:1412.4564, 2014

[2] "Very Deep Convolutional Networks for Large-Scale Image
Recognition", Karen Simonyan and Andrew Zisserman, arXiv technical
report, 2014

[3] "Return of the Devil in the Details: Delving Deep into
Convolutional Networks", Ken Chatfield, Karen Simonyan, Andrea Vedaldi,
and Andrew Zisserman, BMVC 2014

ART Vision Rami Hagege

Ilya Bogomolny

Erez Farhan

Arkadi Musheyev

Adam Kelder

Ziv Yavo

Elad Meir

Roee Francos
The problem of classification and segmentation of objects in videos
is one of the biggest challenges in computer vision, demanding
simultaneous solutions of several fundamental problems. Most of these
fundamental problems are yet to be solved separately. Perhaps the most
challenging task in this context, is the task of object detection and
classification. In this work, we utilized the feature-extraction
capabilities of Deep Neural Network in order to construct robust object
classifiers, and accurately localize them in the scene. On top of that,
we use time and space analysis in order to capture the tracklets of each
detected object in time. The results show that our system is able to
localize multiple objects in different scenes, while maintaining track
stability over time.
BAD Shaowei Liu, Honghua Dong, Qizheng He @ Tsinghua University Used fast-rcnn framework and some tracking method.
bioinf@JKU Djork-Arne Clevert (Institute of Bioinformatics, Johannes Kepler University Linz)

Thomas Unterthiner (Institute of Bioinformatics, Johannes Kepler University Linz)

Günter Klambauer (Institute of Bioinformatics, Johannes Kepler University Linz)

Andreas Mayr (Institute of Bioinformatics, Johannes Kepler University Linz)

Martin Heusel (Institute of Bioinformatics, Johannes Kepler University Linz)

Karin Schwarzbauer (Institute of Bioinformatics, Johannes Kepler University Linz)

Sepp Hochreiter (Institute of Bioinformatics, Johannes Kepler University Linz)
We trained CNNs with a new activation function, called "exponential
linear unit" (ELU) [1], which speeds up learning in deep neural
networks.

Like rectified linear units (ReLUs) [2, 3], leaky ReLUs (LReLUs) and
parametrized ReLUs (PReLUs), ELUs also avoid a vanishing gradient via
the identiy for positive values. However ELUs have improved learning
characteristics compared to the other activation functions. In contrast
to ReLUs, ELUs have negative values which allows them to push mean unit
activations closer to zero. Zero means speed up learning because they
bring the gradient closer to the unit natural gradient.

The unit natural gradient differs from the normal gradient by a bias
shift term, which is proportional to the mean activation of incoming
units. Like batch normalization, ELUs push the mean towards zero, but
with a significantly smaller computational footprint. While other
activation functions like LReLUs and PReLUs also have negative values,
they do not ensure a noise-robust deactivation state. ELUs saturate to a
negative value with smaller inputs and thereby decrease the propagated
variation and information. Therefore ELUs code the degree of presence of
particular phenomena in the input, while they do not quantitatively
model the degree of their absence. Consequently dependencies between
ELU units are much easier to model and distinct concepts are less
likely to interfere.

In this challenge ELU networks considerably speed up learning
compared to a ReLU network with similar classification performance.

[1] Clevert, Djork-Arné et al, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), arxiv 2015

[2] Clevert, Djork-Arné et al, Rectified Factor Networks, NIPS 2015

[3] Mayr, Andreas et al, DeepTox: Toxicity Prediction using Deep Learning, Frontiers in Environmental Science 2015

CIL Heungwoo Han

Seongmin Kang

Seonghoon Kim

Kibum Bae

Vitaly Lavrukhin
BoundingBox regression models based on Inception [1] and NIN [4].

The network for classfication pre-trained on the ILSVRC 2014
classification dataset is modified for bounding box regression. Then the
regressor to predict a bouding box is fine-tuned using the features of
middle and last convolutional layer [2]. At inference time, the modified
greedy-merge techinque [3] for multi-scale prediction is applied on
each scale, and the optimal scale is chosen to determin the final
predicted box.

[1] Szegedy et al., Going Deeper with Convolutions, CVPR, 2015.

[2] Long & Shelhamer et al., Fully Convolutional Networks for Semantic Segmentation, CVPR, 2015.

[3] Sermanet et al., OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks, ICLR, 2014.

[4] Lin et al., Network In Network, ICLR, 2014.

CUImage Wanli Ouyang^1, Junjie Yan^2, Xingyu Zeng^1, Hongyang Li^1, Kai
Kang^1, Bin Yang^2, Xuanyi Dong^2, Cong Zhang^1, Tong Xiao^1, Zhe
Wang^1, Yubin Deng^1, Buyu Li^2, Sihe Wang^1, Ruohui Wang^1, Hongsheng
Li^1, Xiaogang Wang^1

1. The Chinese University of *

2. SenseTime Group Limited

For the object detection challenge, our submission is based on the
combination of two types of models, i.e. DeepID-Net in ILSVRC 2014 and
Faster RCNN [a].

Compared with DeepID-Net in ILSVRC 2014, the new components are as follows.

(1) GoogleNet with batch normalization and VGG are pre-trained on
1000-class ImageNet classification/location data (for the entries of
using official data only) and 3000-class ImageNet classification data
(for the entries of using extra data).

(2) A new cascade method is introduced to generate region proposals. It has higher recall rate with fewer region proposals.

(3) The models are fine-tuned on 200 detection classes with multi-context and multi-crop.

(4) The 200 classes are clustered in a hierarchical way based on
their visual similarity. Instead of finetuning with the 200 classes
once, the models are finetuned for multiple times when the 200 classes
are divided into smaller clusters iteratively. Different clusters share
different feature representations. Feature representations gradually
adapt to individual classes in this way.

Compared with Faster RCNN, the new components are

(1) We cascade the RPN, where the proposals generated by RPN are fed
into a object\background Fast-RCNN. It leads to 93% recall rate with
about 126 proposals per image on val2.

(2) We cascade the Fast-RCNN, where a Fast-RCNN with category-wise
softmax loss is used in the cascade step for hard negative mining. It
leads to about 2% improvement in AP.

Deep-ID models and Faster RCNN models are combined with model averaging.

For the localization task, class labels are predicted with VGG. For
each image and each predicted class, candidate regions are proposed by
employing a learned saliency map and edge boxes [c] with high recall
rate. Candidate regions are assigned with scores by using VGG or
GoogLeNet with BN finetuned on candidate regions. The 1000 classes are
grouped in multiple clustered in a hierarchical way. VGG and GoogleNet
are finetuned for multiple times to adapt to different clusters.

The fastest publicly available multi-GPU caffe code (requires only 6
seconds for 20 iterations when mini-batch size is 256 using GoogleNet
using 4 TitanX) is our strong support [d].

[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S.
Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep
Convolutional Neural Networks for Object Detection,” CVPR 2015.

[b] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks,” arXiv:1506.01497.

[c] C. Lawrence Zitnick, Piotr Dollár, “Edge Boxes: Locating Object Proposals from Edges”, ECCV2014

[d] https://github.com/yjxiong/caffe

CUVideo Wanli Ouyang^1, Kai Kang^1, Junjie Yan^2, Xingyu Zeng^1, Hongsheng
Li^1, Bin Yang^2, Tong Xiao^1, Cong Zhang^1, Zhe Wang^1, Ruohui Wang^1,
Xiaogang Wang^1

1. The Chinese University of *

2. SenseTime Group Limited

For object detection in video, we first employ CNN based on
detectors to detect and classify candidate regions on individual frames.
Detectors are based on the combination of two types of models, i.e.
DeepID-Net [a] in ILSVRC 2014 and Faster RCNN [b]. The temporal
information is employed propagating detection scores. Score propagation
is based on optical flow estimation and CNN based tracking [c]. Spatial
and temporal pool along tracks is employed. Video context is also used
to rescore candidate regions.

[a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S.
Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep
Convolutional Neural Networks for Object Detection,” CVPR 2015.

[b] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks,” arXiv:1506.01497.

[c] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual Tracking with Fully Convolutional Networks,” ICCV 2015.
darkensemble Hu Yaoquan, Peking University First, around 4000 candidate object proposals are generated from
selective search and structure edge. Then we extract 12 different
regions' CNN features for each proposal, and concatenate them as part of
final object representation as the method in [1]. In detail, region CNN
is a 16-layer VGG-version SPPnet modified with some random
initialization, a one-level pyramid, very leaky ReLU and a hand-designed
two-level label tree for structurally sharing knowledge to defeat class
imbalance. Single model but another three deformation layers are also
fused for capturing repeated patterns in just 3 regions each proposal.
Semantic segmentation-aware CNN extension in [1] is also used and here
segmentation model is a mixed model of deconvnet and CRF .

Second, we use RPN[2] with convolution layers initialized by the
pre-trained RCNN above to obtain at least 500 proposals each instance.
And next retrain a new model as above for better using RPN proposals and
excluding trained patterns, but part of data used duo to time limit.
Apart from the model above, we add unsupervised segmented features using
CPMC, and encode them with shape and fixation maps, to learn a latent
SVM.

Third, we use the enlarged candidate box for bounding box
regression, iterative localization, and bounding box voting as [1] to do
object localization. Finally, thanks for the competition organisers at
imagenet and GPU resources from NVIDIA and IBM cloud.

[1] Spyros Gidaris, Nikos Komodakis, "Object detection via a
multi-region & semantic segmentation-aware CNN model",
arXiv:1505.01749

[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun," Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks",NIPS, 2015

Deep Punx Evgeny Smirnov,

Denis Timoshenko,

Rasim Akhunzyanov
For the Classification+Localization task we trained two neural nets
with architectures, inspired by "Inception-6"[1] (but without batch
normalization), and one with architecture, inspired by "Inception-7a"[2]
(with batch normalization and 3x1 + 1x3 filters instead of 3x3 filters
for some layers).

Our modifications (for Inception-7a-like model) include:

1) Using MSRA weight initialization scheme [3].

2) Using Randomized ReLU units [4].

3) More agressive data augmentations: random rotations, random
skewing, random stretching, brightness, contrast and color
augmentations.

4) Test-time data augmentation: We applied 30 different random data
augmentations to 30 copies of each test image, passed them through net
and averaged predictions.

Also we trained one of our Inception-6-like models on object-level
images (we cut objects from full images by their localization boxes and
used them to train the net).

This is still work-in-progress, for now networks haven't finished
training yet. We also didn't predict any localization boxes yet.

[1] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,

Sergey Ioffe, Christian Szegedy http://arxiv.org/abs/1502.03167

[2] Scene Classification with Inception-7, Christian Szegedy with
Julian Ibarz and Vincent Vanhoucke
http://lsun.cs.princeton.edu/slides/Christian.pdf

[3] Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun http://arxiv.org/abs/1502.01852

[4] Empirical Evaluation of Rectified Activations in Convolutional
Network, Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li,
http://arxiv.org/abs/1505.00853

DeeperScene Qinchuan Zhang, Xinyun Chen, Junwen Bai, Junru Shao, Shanghai Jiao Tong University, China For the scene classification task, our model is based on a
convolutional neural network framework implemented on Caffe. We use
parameters of vgg_19 model training on ILSVRC classification task as
initialization of our model [1]. Since current deep features learnt by
those convolutional neural networks, which are trained from ImageNet,
are not competitive enough for scene classification task, due to the
fact that ImageNet is an object-centric dataset [3], we further train
our model on Places2 [4]. Moreover, according to our experiments, “msra”
initialization of filter weights for rectifiers is a more robust method
of training extremely deep rectifier networks [2], we use this method
for initialization of some fully-connected layers.

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.

[2] He K, Zhang X, Ren S, et al. Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification.
arXiv:1502.01852, 2015.

[3] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning Deep Features for Scene Recognition using Places Database.
Advances in Neural Information Processing Systems 27 (NIPS), 2014.

[4] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva.
Places2: A Large-Scale Database for Scene Understanding. Arxiv, 2015.

DEEPimagine Sung-soo Park(leader).

Hyoung-jin Moon.

DEEPimagine Co. Ltd. of South Korea

1. Bases

We used the Fast and Faster RCNN object detection framework basically.

And our training model is based on the VGG model and GOOGLENET model.

2. Enhancements

- Detection framework

: Tuning the location of ROI(Region Of Interest) projection, Adding the fixed region proposals.

- Models

: More depth models, More inception models.

We focused on a efficiency of the pooling layers and full connected inner product layers.

So we converted pooling layers and full connected inner product layers to our own custom layers.

- Fusion

: Category adaptive multi-model fusion.

3. References

[1] Ross Girshick. "Fast R-CNN: Fast Region-based Convolutional Networks for object detection", CVPR 2015

[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. "Faster
R-CNN: Towards Real-Time Object Detection with Region Proposal
Networks", CVPR 2015

DeepSEU Johnny S. Yang, Southeast University

Yongjie Chu, Southeast University
A VGG-like model has been trained for this scene classification
task. We only use the resized 256x256 image data to train this model. In
training phase, random crops of multi scales are used to do data
augmentation. The procedure generally follows the VGG paper, for
example, the batch size was set to 128, and the learning rate was
initially set to 0.01. The only difference is that we don't use Gaussian
method for weight initialization. We proposed a new weight initializing
method, which can get a bit faster convergence performance than MSRA
weight filler. In test phase, we convert the full connected layers into
convolutional layers, and then this fully convolutional network is
applied over the whole image. Multi scales images are used to evaluate
dense predictions. Finally, the top 5 classification score we got on the
validation set is 80.0%.
DIT_TITECH Ikuro Sato, Hideki Niihara, Ryutaro Watanabe (DENSO IT LABORATORY);
Hiroki Nishimura (DENSO CORPORATION); Akihiro Nomura, Satoshi Matsuoka
(Tokyo Institute of Technology)
We used an ensemble of 6 deep neural network models, consisting of
11-16 convolutional and 4-5 maximum pooling layers. No 1x1 convolution
is involved, meaning no fully-connected layers are used.

On-line, random image deformation is adopted during training. Test
samples are classified under the augmented pattern classification rule
according to I. Sato, et al., arXiv:1505.03229.

Models are trained by machine-distributed deep learning software
that we developed from scratch on large-scale GPU supercomputer TSUBAME
in Tokyo Institute of Technology by taking advantage of the 1 year trial
use period and the TSUBAME Ground Challenge in Fall 2015.

Up to 96 GPUs are used to speed up the training.
DROPLET-CASIA Jingyu Liu (CASIA)

Junran Peng (CASIA)

Yongzhen Huang (CASIA)

Liang Wang (CASIA)

Tieniu Tan (CASIA)
Our framework is mainly based on RCNN[1], and we make following improvements:

1. Region proposals come from two sources: selective search and region proposal network[3] trained on ILSVRC.

2. The initial models of several googLenets are pretrained on images or bounding boxes following [4].

3. Inspired by [5], an adapted multi-region model is used.

4. Inspired by [2], We train a seperate regression network to correctify the detection positions.

5. Model averaging on the SVM scores of all the used models.

[1]R. Girshick, J. Donahue, T. Darrell, J. Malik, "Rich feature
hierarchies for accurate object detection and semantic segmentation",
CVPR 2014.

[2]R. Girshick, "Fast R-CNN", ICCV 2015.

[3]S. Ren, K. He, R. Girshick, J. Sun, "Faster R-CNN: Towards
Real-Time Object Detection with Region Proposal Networks", NIPS 2015.

[4]W. Ouyang etal, "DeepID-Net: Deformable Deep Convolutional Neural Networks for Object detection", CVPR 2015.

[5]S. Gidaris, N, Komodakis, "Object detection via a multi-region & semantic segmentation-aware CNN model", ICCV 2015.

ESOGU MLCV Hakan Cevikalp

Halil Saglamlar
Here we use a short cascade of classifiers for object detection.
The first stage includes our novel polyhedral conic classifier (PCC)
whereas the second classifier is the kernelized SVM. PCC classifiers can
return polyhedral acceptance regions for positive classes with a simple
linear dot product, thus they are better suited for object detection
tasks compared to linear SVMs. We used LBP+HOG descriptors for image
representation and sliding window approach is used to scan images. Our
first submission includes independent detector outputs for each class
and we apply a non-maximum suppression algorithm between classes for the
second submission.
FACEALL-BUPT Yue WU, BUPT, CHINA

Kun HU, BUPT, CHINA

Yuxuan LIU, BUPT, CHINA

Xuankun HUANG, BUPT, CHINA

Jiangqi ZHANG, BUPT, CHINA

Hongliang BAI, Beijing Faceall co., LTD

Wenjian FENG, Beijing Faceall co., LTD

Tao FU, Beijing Faceall co., LTD

Yuan DONG, BUPT, CHINA
It is the third time that we participate in ILSVRC. In this year,
we start with the GoogLeNet [1] model and apply it to all four tasks.
Details are shown below.

Task 1 Object Classification/Localization

===============

We utilize the GoogLeNet with batch normalization and prelu for
object classification. Three models are trained. The first one uses the
original GoogLeNet architecture but all relu layers are replaced with
prelu layers. The second model is the one mentioned in Ref. [2]. The
third model is fine-tuned from the second model with multi-scale
training. Multi-scale testing and models ensemble is utilized to
generate the final classification result. 144 crops of an image [1] are
used to evaluate one network. Actually, we also tried the method of 150
crops which is described in Ref. [3]. The performance is almost the
same. And merging the results of 144 crops and 150 crops does not bring
to much increased performance. The top-5 error of the three models on
validation is 8.89%, 8.00% and 7.79%, respectively. And using an
ensemble of the three models, we decreased the top-5 error rate to
7.03%. As we all know, ensemble with more models can improve the
performance. But we do not have enough time and GPUs to do that.

To generate a bounding box for each label of an image, we firstly
fine-tune the second classification model with object-level annotations
of 1,000 classes from ImageNet CLS-LOC train data. Moreover, a
background class is added into the network. Then test images are
segmented into ~300 regions by selective search and these regions are
classified by the fine-tuned model into one of 1,001 classes. We select
the top-3 regions with the highest possibility classes generated by the
classification model. A new bounding box is generated by finding a
minimal bounding rectangle of three regions. The localization error is
about 32.9% on validation. We also try the third classification model.
And the localization error on validation is 32.76%. After merging the
two aforementioned results, the localization error decrease to 31.53%.

Task 2 Object Detection

===============

We employ the well-known Fast-RCNN framework [4]. Firstly, we tried
the AlexNet model which is pre-trained on the CLS-LOC dataset with image
–level annotation. When training on the object detection dataset, we
run SDG for 12 epochs, and then lower the learning rate from 0.001 to
0.0001 and train for another 4 epochs. The other setting is the same
with the original Fast-RCNN method. This approach achieves 34.9% MAP on
validation. Then we apply GoogLeNet with Fast-RCNN framework. The pool
layer after inception 4 layers is replaced by a ROI pooling layer. This
trial achieves about 34% map on the validation. In another trial, we
move the ROI pooling layer from the pool4 to pool5 and enlarge the input
size from 600(max 1000) to 786(max 1280). The pooled width and height
is set to 4x4 instead of 1x1. The MAP is about 37.8% on validation. It
is worth noting that the last model needs about 6g GPU memory to train
and 1.5g GPU memory to test. And it has near the same test speed with
AlexNet but gains better performance. We employ a simple strategy to
merge the three results and gain 38.7% MAP.

Task 3 Scene Classification

===============

The Places2 training set has about 8 million images. We reduce the
input size of images from 256x256 to 128x128 and try small network
architectures to accelerate the training procedure. However, due to the
large amount images, we still cannot finish the training before the
submission deadline. We trained two models which architectures are
modified from the original GOOGLENET. The first one only removes the
inception 5 layers. We only trained the model for about 10 epochs. The
top-5 error on validation is about 37.19%. The second model enlarges the
stride of the conv1 layer from 2 to 4 and reduces the kernel number of
the conv2 layer from 192 to 96. For the remaining inception layers,
every kernel number is set as the half of its original number. This
model is trained about 12 epochs and achieves 38.99% top-5 error on the
validation. Unfortunately, about one week ago, we found that the final
output vector was set to 400 instead of 401 due to an oversight. We
correct the error and fine-tune the first model for about 1 epoch. The
top-5 error on validation of this model is about 31.42%.

Task 4 Object Detection from Video

===============

A simple method for this task is to perform object detection in all
frames. But it does not utilize the spatial temporal constraint or
context information between continual frames in a video. Thus, we employ
object detection and object tracking for this track. First, key frames
from a video are selected to detect objects in them using Fast RCNN.
There are about 1.3 million frames in training set. Due to temporal
continuity, we select one frame every 25 frames to train an object
detection model. 52,922 frames are utilized to train the model. Similar
to the approach in the Object Detection track, we run SGD for 12 epochs
and then lower the learning rate from 0.001 to 0.0001 and train for
another 8 epochs. The training procedure takes 1 day and 3 days on a
single K40 for AlexNet and GoogLeNet, respectively. More details about
the object detection can be found in the instruction of the task 2.
During test, if a video has less than 50 frames, we choose two frames in
the middle of the video. If a video has more than 50 frames, we choose a
frame every 25 frames. This results in 6,861 frames of 176,126 on
validation and 12,329 frames of 315,176 on test. We do object detection
on these frames and filter out the objects which confident scores are
larger than 0.2(AlexNet) and 0.4(GoogLeNet). Then, detected objects are
tracked to generate the results of the other frames. The tracking method
we used is TLD [5]. After tracking, we generate the final results for
evaluation. It is worth noting that we set most of parameters
empirically because we have no time to validate them. Three entries are
provided. The first one utilizes AlexNet and achieves 19.1% map on
validation. The second one uses GoogLeNet and achieves 24.6% map on
validation. A simple strategy to merge the two results is employed and
results in the third entry. The map of it is 25.3% on validation.

[1] Szegedy C, Liu W, Jia Y, et al. Going Deeper With
Convolutions[C]//Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 2015: 1-9.

[2] Ioffe S, Szegedy C. Batch normalization: Accelerating deep
network training by reducing internal covariate shift[J]. arXiv preprint
arXiv:1502.03167, 2015.

[3] Simonyan K, Zisserman A. Very deep convolutional networks for
large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

[4] Girshick R. Fast R-CNN[J]. arXiv preprint arXiv:1504.08083, 2015.

[5] Kalal, Z.; Mikolajczyk, K.; Matas, J.,
"Tracking-Learning-Detection," in Pattern Analysis and Machine
Intelligence, IEEE Transactions on , vol.34, no.7, pp.1409-1422, July
2012

Futurecrew Daejoong Kim

futurecrew@gmail.com
Our model is based on R-CNN[1], Fast R-CNN[2] and Faster R-CNN[3].

We used pre-trained Imagenet 2014 classification models (VGG16, VGG19) to train detection models.

For ILSVRC 2015 detection datasets, we trained Fast R-CNN and Faster
R-CNN multiple times and the trained models are used for ensemble.

Three algorithms are used for the region proposal: Selective
search[4], Region proposal network step_1[3] and Region proposal network
step_3[3]. The three kinds of proposals are fed to detection models
(VGG16, VGG19) which then are used for ensemble.

Detection results are 41.7% mAP for a single model and 44.1% mAP for the ensemble of multiple models.

References:

[1] Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

R. Girshick, J. Donahue, T. Darrell, J. Malik

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014

[2] Fast R-CNN

Ross Girshick

IEEE International Conference on Computer Vision (ICCV), 2015

[3] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

Neural Information Processing Systems (NIPS), 2015

[4] Selective Search for Object Recognition

Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, Arnold W. M. Smeulders

International Journal of Computer Vision, Volume 104 (2), page 154-171, 2013

GatorVision Dave Ojika, University of Florida

Liu Chujia, University of Florida

Rishab Goel, University of Florida

Vivek Viswanath, University of Florida

Arpita Tugave, University of Florida

Shruti Sivakumar, University of Florida

Dapeng Wu, University of Florida
We implement a Caffe-based convolutional neural network using the
Places2 dataset for a large-scale visual recognition environment. We
trained a network based on the VGG ConvNet with 13 weight layers and 3
by 3 kernels, with 3 fully connected layers. All convolutional layers
are followed with a ReLU layer. Due to the very large amount of time
required to train the model with deeper layers, we deployed Caffe on a
multiple GPU cluster environment and leveraged cuDNN libraries to
improve training time.

[1] Chen Z, Lam O, Jacobson A, et al. Convolutional neural
network-based place recognition[J]. arXiv preprint arXiv:1411.1509,
2014.

[2] Zhou B, Khosla A, Lapedriza A, et al. Object detectors emerge in deep scene cnns[J]. arXiv preprint arXiv:1412.6856, 2014.

[3] Simonyan K, Zisserman A. Very deep convolutional networks for
large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

HanGil Gil-Jin Jang, School of Electronics Engineering, Kyungpook National University, Daegu, Republic of Korea

Han-Gyu Kim, School of Computing, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
The novel deep network architecture is proposed based on
independent subspace analysis (ISA). We extract 4096 dimensional
features by the baseline Alexnet trained by the Places2 Database, and
the proposed architecture is applied on top of the feature extraction
network. Every other 4 nodes of the 4096 feature nodes are grouped as a
single subspace, resulting in 1024 individual subspaces. The output of
the each subspace is generated by the square root of the sum of the
squares of the components, and the architecture is repeated 3 times to
generate 256 nodes before connecting to the final network output of 401
categories.
Henry Machine Henry Shu (Home)

Jerry Shu (Home)
Fundamentally different from deep
learning/ConvNet/ANN/representation learning, Henry machine was trained
using the traditional methodology: feature engineering --> classifier
design --> prediction paradigm. The intent is to encourage continued
research interest in many traditional methods in spite of the current
popularity of deep learning.

The most recent (as of Nov 14, 2015) top-5 and top-1 accuracies of
Henry machine for the Scene401 validation dataset are 68.53% and 36.15%,
respectively.

Here are some characteristics of Henry machine.

- The features used are our own modified version of a selection of
features in the literature. These are engineered features, not learned
features. The feature extraction was done for all the 8.1M training,
380K test, and 20K validation images on their original, high-resolution
version with the original aspect ratio left unchanged. I.e., no image
resizing or rescaling were applied. The entire feature extraction step
using our own implementation took 7 days to complete on a cluster of
home-brewed CPU cluster (listed below), which consists of 12 low-end to
mid-end computers borrowed from family and friends. Five of the twelve
computers are laptops.

- We did not have time to study and implement many strong features
in the literature. The accuracy of Henry machine is expected to increase
once we include more such features. We believe that the craftsmanship
of feature engineering encodes the expertise and ingenuity of the human
designer behind it, and cannot be dispensed away (at least not yet) in
spite of the current popularity of deep learning and representation
learning-based approaches. Our humble goal with Henry machine is to
encourage continued research interest in the craftsmanship of feature
engineering.

- The training of Henry machine for Scene401 was also done using the
home-brewed CPU cluster, and took 21 days to complete (not counting
algorithm design/development/debugging time).

- While Henry machine was trained using traditional classification
methodology, the classifier itself was devised by us from scratch
instead of using conventionally available methods such as SVM. We did
not use SVM because it suffers from several drawbacks. For example, SVM
is fundamentally a binary classifier, and applying its maximum-margin
formulation in a multiclass setting, especially with such a large number
of classes (401 and 1000), could be ad hoc. Also, the output of SVM is
inherently non-probabilistic, which makes it less convenient to use in a
probabilistic setting. The training algorithm behind Henry machine is
our attempt to address these and several other issues (not mentioned
here), while at the same time to make it efficient to train, with small
memory footprint, on mom-and-pop computers at home, using only CPU's.
However, Henry machine is still very far from perfect, and our training
algorithm still needs a lot of improvement. More details of the training
and prediction algorithm will be available in our publication.

- As Nov 13 was fast approaching, we were pressed by time. The delay
was mainly due to hardware heat stress from many
triple-digit-temperature days in Sep and Oct here in California. In the
end of Oct, we bought two GTX 970 graphics card and implemented
high-performance CUDA code (driver version 7.5) to help us finish the
final prediction phase in time.

- We will also release the performance report of Henry machine on the ImageNet1000 CLS-LOC validation dataset.

- The source code for building Henry machine, including the feature
extraction part, was primarily written in Octave (version 4.0.0) and C++
(gcc 4.8.4) on linux machines (Ubuntu x86_64 14.04).

Here is the list of the CPU's of the home-brewed cluster:

* Pentium D 2.8GHz, 3G DDR (2005 Desktop)

* Pentium D 2.8GHz, 3.26G DDR (2005 Desktop)

* Pentium 4 3.0GHz, 3G DDR (2005 Desktop)

* Core 2 Duo T5500 1.66GHz, 3G DDR2 (2006 Laptop)

* Celeron E3200 2.4GHz, 3G DDR2 (2009 Desktop)

* Core i7-720QM 1.6GHz, 20G DDR3 (2009 Laptop)

* Xeon E5620 2.4GHz (x2), 16G DDR3 (2010 Server)

* Core i5-2300 2.8GHz, 6G DDR3 (2011 Desktop)

* Core i7-3610QM 2.3GHz, 16G DDR3 (2012 Laptop)

* Pending info (2012 Laptop)

* Core i7-4770K 3.5GHz, 32G DDR3 (2013 Desktop)

* Core i7-4500U 1.8GHz, 8G DDR3 (2013 Laptop)

HiVision Q.Y. Zhong, S.C. Yang, H.M. Sun, G. Zheng, Y. Zhang, D. Xie and S.L. Pu [DET] We follow the Fast R-CNN [1] framework for detection.
EdgeBoxes is used for generating object proposals. A detection model is
fine-tuned based on a pre-trained VGG16 [3] model on ILSVRC2012 CLS
dataset. During testing, predictions on test images and their flipped
version are combined by non-maximum suppression. Validation mAP is
42.3%.

[CLS-LOC] We train different models for classification and
localization separately, i.e. GoogLeNet [4,5] for classification and
VGG16 for bounding box regression. The final models achieve 28.9% top-5
cls-loc error and 6.62% cls error on the validation set.

[Scene] Due to the limit of time and GPUs, we have just trained one
CNN model for the scene classification task, namely VGG19, based on the
resized 256x256 image datasets. The top-5 accuracy on the validation set
with single center crop is 79.9%. In test phase, a multi-scale dense
evaluation is adopted for the prediction, whose accuracy on the
validation set is 80.7%.

[VID] First we apply Fast R-CNN with RPN proposals [2] to detect
objects frame by frame. Then a Multi-Object Tracking (MOT) [6] method is
utilized to associate the detections for each snippet. Validation mAP
is 43.1%.

References:

[1] R. Girshick. Fast R-CNN. arXiv 1504.08083, 2015.

[2] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time
object detection with region proposal networks. arXiv 1506.01497, 2015.

[3] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 1409.1556, 2014.

[4] C. Szegedy, W. Liu, Y. Jia, et al. Going Deeper with Convolutions. arXiv 1409.4842, 2014.

[5] S. Ioffe, C. Szegedy. Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift. arXiv 1502.03167,
2015.

[6] Bae S H, Yoon K J. Robust online multi-object tracking based on
tracklet confidence and online discriminative appearance learning. 2014
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE,
2014: 1218-1225.

hustvision Xinggang Wang, Huazhong Univ. of Science and Technology The submitted results were produced by a detection algorithm based
yolo detection method [1]. Different from the original yolo detection, I
made several changes:

1. Downsize the training/testing image to 224*224 for faster training and testing;

2. Reduce “pooling” layers to improve the detection performance of small objects;

3. Add weight balance method to deal with the unbalanced number of objects in training images.

[1] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhad,
You Only Look Once: Unified, Real-Time Object Detection,
arXiv:1506.02640
Isia_ICT Xiangyang Li

Xinhang Song

Luis Herranz

Shuqiang Jiang
As the number of images per category for training is non-uniform,
we sample 4020 images for each class from the training dataset. We use
this uniform distributed subdataset to train our convolutional neural
networks. In order to reuse the semantic information in the 205-catogery
Places dataset [1], we also use the models trained on this dataset to
extract visual features for the classification task. Even though the
mid-level representations in convolutional neural networks are rich, but
the geometric invariance properties are poor [2]. So we use multi-scale
features. Precisely, we convert all the layers in the convolutional
neural network to convolution layers and use the full convolution
network to extract features with different input sizes. We use max
pooling to pool the features to the same size with the fixed size of 227
which is used to train the network. At last, we combine features
extracted from models which not only have different architectures, but
also are pre-trained on different datasets. We use the concatenated
features to classify the scene images. Considering the efficiency, we
use a logistic regression classifier composed with two fully-connected
layers of 4096 units and 401 units respectively and a softmax layer with
the sampled training examples exposed to the model.

[1] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, And A. Oliva.
“Learning deep features for scene recognition using places database”. In
NIPS 2014.

[2] D. Yoo, S. Park, J. Lee and I. Kweon. “Multi-scale pyramid
pooling for deep convolutional representation”. In CVPR Workshop 2015.

ITLab - Inha Byungjae Lee

Enkhbayar Erdenee

Yoonyoung Kim

Sungyul Kim

Phill Kyu Rhee

Inha University, South Korea

A hierarchical data-driven object detection framework is
addressed considering deep feature hierarchy of object appearances. We
are motivated from the observations that many object detectors are
degraded in performance due to ambiguities in inter-class and variations
in intra-class appearances; deep features extracted from visual objects
show strong hierarchical clustering property. We partition the deep
features into unsupervised super-categories in the inter-class level,
augmented categories in the object level to discover deep-feature-driven
knowledge. We build hierarchical feature model using the Latent
Dirichlet Allocation (LDA) [6] algorithm and constitute hierarchical
classification ensemble.

Our method is mainly based on Fast RCNN framework [1]. The region
proposal algorithm EdgeBoxes [2] is employed to generate region of
interests from an image, and features are generated using 16 layer CNN
network [3] which pre-trained on the ILSVRC 2013 CLS dataset and
fine-tuned on the detection dataset. We perform the final decision of
object localization using the hierarchical ridge regression and the
extended bounding box similar to [5], and the weighted non-maximum
suppression similarly to [1, 4].

[1] R. Girshick. Fast R-CNN. In CoRR, abs/1504.08083, 2015.

[2] C. L. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391–405, 2014.

[3] K. Simonyan and A. Zisserman. Very deep convolutional networks
for large-scale image recognition. In CoRR, abs/1409.1556, 2014.

[4] S. Gidaris and N. Komodakis. Object detection via a multiregion
& semantic segmentation-aware cnn model. In arXiv:1505.01749, 2015.

[5] Y. Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segDeepM:
Exploiting segmentation and context in deep neural networks for object
detection. In CVPR, 2015.

[6] D. M. Blei, A. Y. Ng, M. I. Jordan. Latent dirichlet allocation. In JMLR, pages 993-1022, 2003.

ITLab VID - Inha Byungjae Lee
Enkhbayar Erdenee
Songguo Jin
Sungyul Kim
Phill Kyu Rhee
We purpose an object detection based tracking algorithm. Our method
is mainly based on Fast RCNN framework [1]. The region proposal
algorithm EdgeBoxes [2] is employed to generate region of interests from
a frame, and features are generated using 16 layer CNN network [3]
which pre-trained on the ILSVRC 2013 CLS dataset and fine-tuned on the
ILSVRC 2015 video dataset. We implemented the tracking algorithm
similarly to Partial Least Square Analysis for generating a
low-dimensional discriminative subspace in video [4]. For parameter
optimization, we adopt POMDP based parameter learning approach which
described in our previous work [5]. We perform the final decision of
object localization using the bounding box ridge regression and the
weighted non-maximum suppression similar to [1].

[1] R. Girshick. Fast R-CNN. In CoRR, abs/1504.08083, 2015.

[2] C. L. Zitnick, and P. Dollár. Edge boxes: Locating object proposals from edges. In ECCV, pages 391-405, 2014.

[3] K. Simonyan, and A. Zisserman. Very deep convolutional networks
for large-scale image recognition. In CoRR, abs/1409.1556, 2014.

[4] Q. Wang, F. Chen, W. Xu, and M.-H. Yang. Object tracking via
partial least squares analysis. IEEE Transactions on Image Processing,
pages 4454–4465, 2012.

[5] S. Khim, S. Hong, Y. Kim and P. Rhee. Adaptive visual tracking
using the prioritized Q-learning algorithm: MDP-based parameter learning
approach. Image and Vision Computing, pages 1090-1101, 2014.

ITU Ahmet Levent Subaşı Google Net Classification + Localization
JHL Jaehyun Lim (JHL)

John Hyeon Lee (JHL)
Our baseline algorithm is Convolutional Neural Network for both
detection and classification/localization entries. For detection task,
our method is based on Fast R-CNN [1] framework. We trained VGG16
network[2] within proposed regions by Selective Search[3]. Fast-RCNN
uses ROI pooling on the top of convolutional feature maps. For
classification and localization task, we trained on GoogLeNet network[4]
with batch normalization methods.[5] The submitted model averaged 5
models trained on multiple random crops and tested on single center crop
without any further data augmentation during training.

[1] R. Girshick, Fast R-CNN, in Proceedings of the International Conference on Computer Vision (ICCV), 2015.

[2] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015

[3] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013.

[4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, Going Deeper with
Convolutions, Cvpr, 2015.

[5] S. Ioffe and C. Szegedy, Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift, Arxiv, 2015.

KAISTNIA_ETRI Hyungwon Choi*(KAIST)

Yunhun Jang*(KAIST)

Keun Dong Lee*(ETRI)

Seungjae Lee*(ETRI)

Jinwoo Shin*(KAIST)

(* indexes equal contribution, by Alphabets)
In this work, we use a variant of GoogLenet [1] for localization
task. We further use VGG classification models [2] to boost up the
performance of the GoogLenet-based network. The overall training of the
baseline localization network follows a similar procedure with [2]. Our
CNN model is based on GoogLenet [1]. We first trained our network for
minimizing the classification loss using batch normalization [3]. Based
on this pre-trained classification network, we further fine-tuned it to
minimize the localization loss. Then, we performed recursive
localizations to adjust localization outputs utilizing outputs of a
VGG-based classification network. For obtaining the VGG-based network,
we used pre-trained VGG-16 and VGG-19 models with multiple crops on
regular grid, selective crops based on objectness score using a similar
method with BING [4] and different image sizes. It is further tuned on
validation set.

[1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich, “Going deeper with convolutions,” in Proc. CVPR, 2015.

[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.

[3] Sergey Ioffe and Christian Szegedy, “Batch normalization:
Accelerating deep network training by reducing internal covariate
shift,” in Proc. ICML, 2015.

[4] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. H. S. Torr, “BING:
Binarized normed gradients for objectness estimation at 300fps,” in
Proc. CVPR, 2014.

Lunit-KAIST Donggeun Yoo*, KAIST,

Kyunghyun Paeng*, KAIST,

Sunggyun Park*, KAIST,

Sangheum Hwang, Lunit Inc.,

Hyo-Eun Kim, Lunit Inc.,

Jungin Lee, Lunit Inc.,

Minhong Jang, Lunit Inc.,

Anthony S. Paek, Lunit Inc.,

Kyoung-Kuk Kim, KAIST,

Seong Dae Kim, KAIST,

In So Kweon, KAIST.

(* indicates equal contribution)
Given top-5 object classes provided by multiple networks including
GoogLeNet [1] and VGG-16 [2], we localize each object with a single
class-agnostic AttentionNet, which is a multi-class extension of [3]. In
order to improve the localization accuracy, we significantly increased
the network depth from 8-layer [3] to 22-layer. In addition, 1,000
class-wise direction layers and a classification layer are stacked on
top of the network, sharing the convolutional layers. Starting from an
initial bounding-box, AttentionNet predicts quantized weak directions
for top-left and bottom-right corners pointing a target object, and
aggregates the predictions iteratively to guide the bounding-box to an
accurate object boundary. Since AttentionNet is a unified framework for
localization, any independent pre/post-processing technique such as the
hand-engineered object proposal and the bounding-box regression is not
used in this submission.

[1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with
convolutions. In CVPR, 2015.

[2] K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR 2015.

[3] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. S. Kweon.
Attentionnet: Aggregating weak directions for accurate object detection.
In ICCV, 2015.
MCG-ICT-CAS Tang Sheng, Corresponding leading member: ts@ict.ac.cn, Multimedia
Computing Group, Institute of Computing Technology, Chinese Academy of
Sciences;

Zhang Yong-Dong, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences;

Zhang Rui, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences;

Li Ling-Hui, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences;

Wang Bin, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences;

Li Yu, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences;

Deng Li-Xi, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences;

Xiao Jun-Bin, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences;

Cao Zhi, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences;

Li Jin-Tao, Multimedia Computing Group, Institute of Computing Technology, Chinese Academy of Sciences
Title:

Investigation of Model Sparsity and Category Information on Object Classification, Localization and Detection at ILSVRC 2015

Abstract:

In ILSVRC 2015 challenge, we (team MCG-ICT-CAS) participate in two
tasks: object classification/localization (CLS-LOC) and object detection
(DET) without using any outside images or annotations.

CLS-LOC: we do object classification and localization sequentially. So we will describe them respectively as follows.

For object classification, we use fusion of VGG [1] and GoogleNet
[2] models as our baseline. Although the top-5 accuracy of the GoogleNet
used by us which was released by Caffe platform [3] is 90.73 in the
validation dataset, much lower than that of the initial unreleased
GoogleNet by Google, we can still get the top-5 accuracy of 93.28 after
fusion with VGG (92.55). After further fusion with our own 2 models, we
can improve the accuracy to final 93.72. Our unique contributions are
three-fold:

(1) Sparse CNN model (SPCNN): In the early April of this year, we
propose to learn a compact CNN model named SPCNN for reducing
computation and memory cost. Since the 4K*1K (4M) number of connections
between the 7th and 8th layer is one of the densest connections between
the CNN layers, we focus on how to remove most of small connections
between them because small connections are often unstable and may cause
noise. Actually, for a given category (a neuron in the 8th layer), it is
naturally only related with a very few number of category bases
(neurons of the 7th layer), i.e., the former can be modelled as sparse
linear combinations of a very few number of the latter. Hence, we
propose to select only a very few number of connections with large
weights between a given category and category bases for retraining, and
set other small unstable weights as constant zero without retraining.
Experiments on the validation dataset show that only keeping and
retraining a very small percentage (about 9.12%) of all the initial 4M
connections can still gain 0.32% improvement of top-5 accuracy (92.87%)
compared with that (92.55%) of the initial VGG model after removing the
90.88% small unstable connections.

(2) Sparse Ensemble Learning (SEL): Large scale training dataset
imposes a great challenge for efficiency of both training and testing.
To overcome the low efficiency and unscalability of the classical
methods based on global classification such as SVM, we proposed SEL for
visual concept detection in [4]. It leverages sparse codes of image
features for both partition of large scale training set into small
localities for efficient training and coordination of the individual
classifiers in each locality for final classification. The sparsity
ensures that only a small number of complementary classifiers in the
ensemble will fire on a testing sample, which not only gives better AP
than other fusion methods like average fusion and Adaboost, but also
allows the high efficiency of online detection. In this year’s object
classification task, we use the 4K-dim CNN features as the image
features instead of traditional bag of words (BoW). After expanding the
training dataset from 1.2 million samples to about 6 million samples
with through data argumentation, we use 8K-dim sparse codes of CNN
features to partition the training dataset into 8K number of small
localities for efficient training and use the sparse codes of test
sample for fusion. Experiments on the validation dataset show that
compared with CaffeNet, SEL with CNN features can improve 3.6% and 1.0%
in top-1 and top-5 accuracies respectively.

(3) Additionally, to compare with the widely used average fusion
(AVG) method, we try ordered weighted averaging (OWA) [5] in terms of
top-5 accuracy for fusion of all the models including GoogleNet, VGG,
SPCNN and SEL. Experiments on the validation dataset show that OWA is
better than average fusion.

For object localization, we mainly focus on the following three aspects:

(1) We apply the framework of Fast R-CNN [6] into object
localization task. Instead of using Non-Maximum-Suppression (NMS)
filtering of overlapped detection proposals of standard Fast R-CNN, we
propose to fuse the most probable detection proposals with top-N (N=10)
scores to get the final localization result.

(2) In order to get more confident proposals, we combine three
different region proposals including Selective Search (SS) [7], EdgeBox
(EB) [8], Region Proposal Network (RPN) [9] for both training and
localization of objects.

(3) We try clustering-based object localization to get more positive
training samples in each individual clustering subset besides the
complexity reduction of both training and localization.

After the above measures, we can improve our localization accuracy
from 83.2% (baseline) to around 84.9% in the validation dataset.

DET: An overwhelming majority of existing object detection methods
have focused on how to reduce the number of region proposals while
keeping high object recall without consideration of category
information, which may lead to a lot of false positives due to the
interferences between categories especially when the number of
categories is very large. To eliminate such interferences, in this
year’s DET task, we propose a novel category aggregation among region
proposals based upon our careful observation that more frequently
detected categories around an object have the higher probabilities to be
present in the image. After further exploiting the co-occurrences
relationship between categories, we can determine the most possible
categories for an image in advance. Thus, many false positives of region
proposals can be greatly filtered out before subsequent classification
process. Our experiments on the validation dataset verified the
effectiveness of our proposed both category aggregation and
co-occurrence refinement approaches.

References:

[1] Karen Simonyan, Andrew Zisserman,“Very Deep Convolutional
Networks for Large-Scale Image Recognition”, CoRR abs/1409.1556 (2014)

[2] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew
Rabinovich, “Going deeper with convolutions”, CVPR 2015.

[3] http://caffe.berkeleyvision.org/model_zoo.html

[4] Sheng Tang, Yan-Tao Zheng, Yu Wang, Tat-Seng Chua, “Sparse
Ensemble Learning for Concept Detection”, IEEE Transactions on
Multimedia, 14 (1): 43-54, February 2012.

[5] Ronald R. Yager, “On ordered weighted averaging aggregation
operators in multicriteria decisionmaking”, IEEE Transactions on
Systems, Man, and Cybernetics 18(1):183-190 (1988).

[6] R. Girshick, “Fast R-CNN”, In Proceedings of the International Conference on Computer Vision (ICCV), 2015.

[7] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W.
M. Smeulders, “Selective search for object recognition”, International
Journal of Computer Vision, 104(2):154-171, 2013.

[8] C. L. Zitnick and P. Dollar, “Edge boxes: Locating object
proposals from edges”, In Computer Vision–ECCV 2014, pages 391-405.
Springer, 2014.

[9] S. Ren, K. He, R. Girshick, and J. Sun. , “Faster R-CNN Towards
real-time object detection with region proposal networks”, In Neural
Information Processing Systems (NIPS), 2015.

MIL-UT Masataka Yamaguchi (The Univ. of Tokyo)

Qishen Ha (The Univ. of Tokyo)

Katsunori Ohnishi (The Univ. of Tokyo)

Masatoshi Hidaka (The Univ. of Tokyo)

Yusuke Mukuta (The Univ. of Tokyo)

Tatsuya Harada (The Univ. of Tokyo)
We use Fast-RCNN[1] as the base detection system.

Before we train models using the Fast-RCNN framework, we retrain
VGG-16[2] with object-level annotations from CLS/LOC and DET data as
with [3]. We initialize two models with original VGG-16 and other two
models with retrained one.

For all models, we concatenate the whole image features extracted by
CNN with the fc7-layer output and use it as the input to the inner
product layer before Softmax, while we use original fc7-layer output as
the input to the bounding box regressors. We multiply the whole image
features by a constant to make the norm of the whole image features
smaller compared to that of the fc7-layer output.

We replace pool5 layer of one of the models initialized by original
VGG-16 and pool5 layer of one of the models retrained on annotated
objects with RoI Pooling layers, and replace pool4 layers of the other
models with RoI Pooling layers and then train them on training dataset
and the val1 dataset (see [4]) using the Fast-RCNN framework.

During testing, we use object region proposals obtained from
Selective Search[5] and Multibox[6], and we test not only original
images but also horizontally-flipped ones and combine them.

In our experiments, replacing pool4 layer rather than pool5 layer
with RoI Pooling Layer improved mAP on the val2 dataset from 42.9% to
44.2%, and retraining the model with object-level annotation improved
mAP from 44.2% to 45.6%.

We submitted two results. One is obtained by model fusion using the
same weights for all models and the other is obtained by model fusion
using the weights learned separately for each class by Bayesian
optimization on the val2 dataset.

[1] Girshick, Ross. "Fast R-CNN." arXiv preprint arXiv:1504.08083 (2015).

[2] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional
networks for large-scale image recognition." arXiv preprint
arXiv:1409.1556 (2014).

[3] Ouyang, Wanli, et al. "Deepid-net: Deformable deep convolutional
neural networks for object detection." Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 2015.

[4] Girshick, Ross, et al. "Rich feature hierarchies for accurate
object detection and semantic segmentation." Computer Vision and Pattern
Recognition (CVPR), 2014 IEEE Conference on. IEEE, 2014.

[5] Uijlings, Jasper RR, et al. "Selective search for object
recognition." International journal of computer vision 104.2 (2013):
154-171.

[6] Szegedy, Christian, et al. "Scalable, high-quality object detection." arXiv preprint arXiv:1412.1441 (2014).

Miletos 1- Azmi Can Özgen / Miletos Inc - Istanbul Technical University - Department of Electrical and Electronics Engineering

2- Berkin Malkoç / Miletos Inc - Istanbul Technical University - Department of Physics

3- Mustafa Can Uslu / Miletos Inc - Istanbul Technical University - Department of Physics

4- Onur Kaplan / Miletos Inc - Istanbul Technical University - Department of Electrical and Electronics Engineering

# Hierarchical GoogleNet Classification + Localization

* We used GoogleNet architecture for Classification and
Localization. We also used Simple Linear Iterative Clustering for
Localization.

* For each ground truth label, we selected 3 different nodes based
upon WordNet tree structure, hierarchically. Also, we trained the model
from scratch with these selected three nodes, one for each output level
of the model.

* We tried two different schemes to train the GoogleNet
architecture. In first scheme, we separated the architecture into three
individual parts such that each part has one of the output layers of
GoogleNet architecture and we trained these parts separately. Later, we
reconstructed the original GoogleNet architecture from these parts and
fine-tuned it by using all of the three output layers.

In second scheme, we trained GoogleNet architecture with three output layers without separating them.

* For each image, we selected multiple crops that are more likely to
be possible objects with Simple Linear Iterative Clustering. Then, we
selected five of the crops (one for each Top-5 prediction) with
GoogleNet Classification model.

References :

1- Going Deeper With Convolutions - arXiv:1409.4842

MIPAL_SNU Sungheon Park, Myunggi Lee, Jihye Hwang, John Yang, Jieun Lee, Nojun Kwak

Machine Intelligence and Pattern Analysis Lab,

Seoul National University, Korea

Our method learns bounding box information from the results of
classifier rather than from CNN features. The classification score
before softmax is used as a feature. The input image is divided into
grids with various sizes and locations. Testing is applied to various
crops of input images. Only the classifier score from the ground truth
class is selected and stacked to generate a feature vector. We used 140
crops for this competition. The localization network is trained with the
feature vector as input and bounding box coordinates as output.
Euclidean loss is used. The classification is performed by GoogLeNet[1]
trained using quick solver in [2]. Once the class is determined, feature
vector for localization is extracted and bounding box information is
determined by the localization network. We used single network for
bounding box estimation, so ensemble of multiple models may improve the
performance.

[1] C Szegedy et al, "Going Deeper with Convolutions"

[2] https://github.com/BVLC/caffe/tree/master/models/bvlc_googlenet

Mitsubishi Electric Research Laboratories Ming-Yu Liu, Mitsubishi Electric Research Laboratories

Teng-Yok Lee, Mitsubishi Electric Research Laboratories
The submitted result is computed using the VGG16 network, which
contains 13 convolutional layers and 3 fully connected layers. After the
network is trained for several epochs using the training procedure
described in the original paper, we fine-tune the network by using a
weighted cross entropy loss, where the weights are determined per class
and are based on their fitting errors. During testing time, we conduct a
multi-resolution testing. The images are resized to three different
resolutions. 10 crops are extracted from each resolution and the final
score is the average scores of the 30 crops.
MPG_UT Noriki Nishida,

Jan Zdenek,

Hideki Nakayama

Machine Perception Group

Graduate School of Information Science and Technology

The University of Tokyo

Our proposed system consists of two components:

the first component is a deep neural network that predicts bounding boxes in every frame while utilizing contextual information,

and the second component is a convolutional neural network that predicts class categories for given regions.

The first component uses a recurrent neural network ("inner" RNN) to sequentially detect multiple objects in every frame.

Morever, the first component uses an encoding bidirectional RNN ("outer" RNN) to extract temporal dynamics.

To facilitate the learning of the encoding RNN, we develop decoding RNNs to reconstruct the input sequences.

We also use curriculum learning for training the "inner" RNN.

We use The Oxford VGG-Net with 16 layers in Caffe to initialize our ConvNets.
MSRA Kaiming He

Xiangyu Zhang

Shaoqing Ren

Jian Sun
We train neural networks with depth of over 150 layers. We propose a
"deep residual learning" framework [a] that eases the optimization and
convergence of extremely deep networks. Our "deep residual nets" enjoy
accuracy gains when the networks are substantially deeper than those
used previously. Such accuracy gains are not witnessed for many common
networks when going deeper.

Our localization and detection systems are based on deep residual
nets and the "Faster R-CNN" system in our NIPS paper [b]. The extremely
deep representations generalize well, and greatly improve the results of
the Faster R-CNN system. Furthermore, we show that the region proposal
network (RPN) in [b] is a generic framework and performs excellent for
localization.

We only use the ImageNet main competition data. We do not use the Scene/VID data.

The details will be disclosed in a later technical report of [a].

[a] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Tech Report 2015.

[b] "Faster R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks", Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun.
NIPS 2015.

N-ODG Qi Zheng,

Wenhua Fang,

Da Xiang,

Xiao Wang,

Cheng Tong,

National Engineering Research Center for Multimedia Software(NERCMS), Wuhan University, Wuhan, China
Object detection is a very challenging task, due to variety of
scales, poses, illuminations, partial occlusions or truncation,
especially in Large-Scale dataset [1]. In this condition, traditional
shallow feature based approaches cannot work well. To address the above
problem, deep convolutional neural networks (CNN)[2] is applied to
detect objects like hierarchical visual cortex from image patches, but
its computation complexity is very high. To well balance the
effectiveness and efficiency, we present a novel object detection from
more effective proposals method based on selective search[4]. Inspired
by R-CNN[3], we exploit a novel neural network structure for high
detection rate as well as low computation complexity can be
simultaneously achieved. Experimental results demonstrate that the
proposed method produces high quality detection results both
quantitatively and perceptually.

[1] Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual
Recognition Challenge[J]. International Journal of Computer Vision,
2014:1-42.

[2] Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification
with Deep Convolutional Neural Networks[J]. Advances in Neural
Information Processing Systems, 2012, 25:2012.

[3] Girshick R, Donahue J, Darrell T, et al. Rich Feature
Hierarchies for Accurate Object Detection and Semantic Segmentation[C]//
Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference
on. IEEE, 2014:580-587.

[4] Uijlings J R R, Sande K E A V D, Gevers T, et al. Selective
Searchfor Object Recognition[J]. International Journal of Computer
Vision, 2013, 104(2):154-171.
NECLAMA Wongun Choi (NEC-Labs)

Samuel Schulter (NEC-Labs)
This is a baseline entry relying on the current state-of-the-art in standard object detection.

We use the Faster-RCNN framework [1] and finetune the network for the 30 classes with the provided training data.

The model is essentially the same as in [1], except that the training data has changed.

We did not experiment much with the hyper-parameters, but we expect
better results with more training iterations and proper sampling of the
video frames.

This model does not exploit temporal information at all but is a
good object detector on static images, which is why this entry should
serve as a baseline.

[1] Ren, He, Girshick, Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. ArXiv 2015.

NECTEC-MOONGDO Sanparith Marukatat, Ithipan Methasate, Nattachai Watcharapinchai, Sitapa Rujikietgumjorn

IMG Lab, National Electronics and Computer Technology Center, Thailand

We first built AlexNet[NIPS2012_4824] model using Caffe[jia2014caffe] on Imagenet’s object classification dataset.

The performance of the our AlexNet on object classification is 55.76 percent of accuracy.

Then we replaced the classification layers (fc6 and fc7) with larger ones.

By comparing the size of training data between object classification
dataset and the Place2 dataset, we doubled the number of hidden nodes
on these two layers.

We connected this structure with new output layer with 401 nodes for 401 classes in Place2 dataset.

The Place2 training dataset was split into 2 parts.

The first part was used to adjust weights on the new layers.

To train with the first part, the model is trained with 1,000,000 iterations in total.

The learning rate is initialed with 0.001, then decreased 10 times every 200,000 iterations.

This yielded 43.18% accuracy on the validation set.

Then we retrained the whole convolution network using the second part.

We set learning rate of the new layers to be 100 times higher than the lower layers.

This raised the validation accuracy to 43.31%.

We also have trained from Place2 training dataset, since the beginning but the method described above achieved a better result.

@article{jia2014caffe,

Author = {Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and
Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama,
Sergio and Darrell, Trevor},

Journal = {arXiv preprint arXiv:1408.5093},

Title = {Caffe: Convolutional Architecture for Fast Feature Embedding},

Year = {2014}

}

@incollection{NIPS2012_4824,

title = {ImageNet Classification with Deep Convolutional Neural Networks},

author = {Alex Krizhevsky and Sutskever, Ilya and Geoffrey E. Hinton},

booktitle = {Advances in Neural Information Processing Systems 25},

editor = {F. Pereira and C.J.C. Burges and L. Bottou and K.Q. Weinberger},

pages = {1097--1105},

year = {2012},

publisher = {Curran Associates, Inc.},

url = {http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf}

}

NEIOP NEIOPs We pretrained a VGG16 model on Places205 database and then
finetuned the model on Places2 database. All images are resized to 224
by N. Multi-scale & multi-crop are used at the testing stage.
NICAL Jiang Chunhui, USTC I am a student of USTC, my major is computer vision and deep learning
ntu_rose wang xingxing(NTU ROSE),wang zhenhua(NTU ROSE), yin jianxiong(NTU
ROSE), gu jiuxiang(NTU ROSE), wang gang(NTU ROSE), Alex Kot(NTU ROSE),
Jenny Chen(Tencent)
For the scene task, we first train VGG-16[5], VGG-19[5] and
Inception-BN[1] model, after this step, we use CNN Tree to learning
Fine-grained Features[6]. After all, we combine CNN Tree model and
VGG-16,VGG-19 , Inception-BN model as the final prediction result.

[1]Sergey Ioffe, Christian Szegedy. Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate Shift.
arXiv preprint arXiv:1502.03167

[2]Ren Wu 1 , Shengen Yan, Yi Shan, Qingqing Dang, Gang Sun.Deep
Image: Scaling up Image Recognition. arXiv preprint arXiv:1501.02876

[3]Andrew G. Howard. Some Improvements on Deep Convolutional Neural
Network Based Image Classification. arXiv preprint arXiv:1312.5402

[4]Karen Simonyan, Andrew Zisserman. Very Deep Convolutional
Networks for Large-Scale Image Recognition. arXiv preprint
arXiv:1409.1556

[5]Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao. Towards Good
Practices for Very Deep Two-Stream ConvNets. arXiv preprint
arXiv:1507.02159

[6]Zhenhua Wang, Xingxing Wang, Gang Wang. Learning Fine-grained
Features via a CNN Tree for Large-scale Classification. arXiv preprint

PCL Bangalore Dipankar Das; Intel Labs

Sasikanth Avancha; Intel Labs

Dheevatsa Mudigere; Intel Labs

Nataraj Jammalamadaka; Intel Labs

Karthikeyan Vaidyanathan; Intel Labs
We jointly train image classification and object localization on a single CNN

using cross entropy loss and L2 regression loss respectively. The network

predicts both the location of the object and a corresponding confidence score.

We use a variant of the network topology (VGG-A) proposed by [1]. This network

is initialized using the weights of classification only network. This network

is used to identify bounding boxes for the objects, while a 144-crop classification

is used to classify the image.

The network has been trained on Intel Parallel Computing Lab’s deep learning library

(PCL-DNN) and all the experiments were performed on 32-node Xeon E5 clusters. A network

of this size typically takes about 30 hrs for training on our deep learning framework.

Multiple experiments for fine-tuning were performed in parallel on NERSC’s Edison and

Cori clusters, as well as Intel’s Endeavor cluster.

[1] Very Deep Convolutional networks for Large-Scale Image Recognition. ICLR 2015.

Karen Simonyan and Andrew Zisserman.

Qualcomm Research Daniel Fontijne

Koen van de Sande

Eren Gölge

Blythe Towal

Anthony Sarah

Cees Snoek
We present NeoNet, an inception-style [1] deep convolutional neural
network ensemble that forms the basis for our work on object detection,
object localization and scene classification. Where traditional deep
nets in the ImageNet challenge are image-centric, NeoNet is
object-centric. We emphasize the notion of objects during
pseudo-positive mining, in the improved box proposals [2], in the
augmentations, during batch-normalized pre-training of features, and via
bounding box regression at run time [3].

[1] S. Ioffe & C. Szegedy. Batch Normalization: Accelerating
Deep Network Training by Reducing Internal Covariate Shift. In ICML
2015.

[2] K.E.A. van de Sande et al. Segmentation as Selective Search for Object Recognition. In ICCV 2011

[3] R. Girshick. Fast R-CNN. In ICCV 2015.

ReCeption Christian Szegedy, Drago Anguelov, Pierre Sermanet, Vincent Vanhoucke, Sergey Ioffe, Jianmin Chen (Google) Next generation of inception architecture (ensemble of 4 models),
combined with a simple one-shot class agnostic multi-scale multibox.
RUC_BDAI Peng Han, Renmin University of China

Wenwu Yuan, Renmin University of China

Zhiwu Lu, Renmin University of China

Jirong Wen, Renmin University of China
The main components of our algorithm are the R-CNN model [1] and
the video segmentation algorithm [2]. Given a video, we use the well
trained R-CNN model [1] to extract the potential bounding boxes and
their object categories for each keyframe. Considering that R-CNN has
ignored the temporal context across all the keyframes of the video, we
further utilize the results (with temporal context) of the video
segmentation algorithm [2] to refine the results of R-CNN. In addition,
we also define several local refinement rules using the spatial and
temporal context to obtain better object detection results.

[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature
hierarchies for accurate object detection and semantic segmentation. In
CVPR, 2014.

[2] A. Papazoglou and V. Ferrari, Fast object segmentation in unconstrained video. In ICCV, 2013.

SamExynos Qian Zhang(Beijing Samsung Telecom R&D Center)

Peng Liu(Beijing Samsung Telecom R&D Center)

Wei Zheng(Beijing Samsung Telecom R&D Center)

Zhixuan Li(Beijing Samsung Telecom R&D Center)

Junjun Xiong(Beijing Samsung Telecom R&D Center)
Our submissions are trained by modified version of [1] and [2]. We
use the structure of [1], but remove the batch normalization layers. And
Relu is replaced by Prelu[3]. Meanwhile, the modified version of latent
Semantic representation learning is integrated into the structure of
[1].

[1]Sergey Ioffe, Christian Szegedy, Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate Shift.
ICML 2015.

[2]Xin Li,Yuhong Guo, Latent Semantic Representation Learning for Scene Classification.ICML 2014.

[3]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,Delving
Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet
Classification. ICCV 2015

[4] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,

Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, Going deeper with convolutions. CVPR 2015.

SIAT_MMLAB Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao

Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

We propose a new scene recognition system with deep convolutional
models. Specifically, we address this problem from four aspects:

(i) Multi-scale CNNs: we utilize Inception2 architecture as our main
exploration network structure due to its performance and efficiency. We
propose a multi-scale CNN framework, where we train CNNs from image
patches of two resolutions (224 cropped from 256, and 336 cropped from
384). For CNN at low resolution, we use the same network as Inception2.
For CNN at high resolution, we design a deeper network based on
Inception2.

(ii) Handing Label Ambiguity: As the scene labels are not mutually
exclusive with each other and some categories are easily confused, we
propose two methods to handle this problem. First, according to the
confusion matrix on the validation dataset, we merge some scene
categories into one single super-category. Second, we utilize the
Places205 scene model to test images and use the soft output as another
target to guide the training of CNN.

(iii) Better Optimize CNN: We use a large batch size to train CNN
(1024). Meanwhile, we try to set the decrease of learning rate in the
exponential form. Moreover, we design a locally-supervised learning
method to learn the weight of CNNs.

(iv) Combing CNNs of different architectures: considering the
complementarity of networks with different architectures, we also fuse
the prediction results of networks: VGGNet13, VGGNet16, VGGNet19 and
MSRANet-B.

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
classification with deep convolutional neural networks. In NIPS, 2012.

[2] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep
network training by reducing internal covariate shift. CoRR,
abs/1502.03167, 2015.

[3] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.

[4] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into
rectifiers: Surpassing human-level performance on imagenet
classification. CoRR, abs/1502.01852, 2015.

[5] Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the
knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.

[6] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao. Towards good practices
for very deep two-stream convnets. CoRR, abs/1507.02159, 2015.

SIIT_KAIST-ETRI Youngsoo Kim(KAIST), Heechul Jung(KAIST), Jeongwoo Ju(KAIST),
Byungju Kim(KAIST), Yeakang Lee(KAIST), Junmo Kim(KAIST), Joongwon
Hwang(ETRI), Young-Suk Yoon(ETRI), Yuseok Bae(ETRI)
For this work, we use modified GoogLeNet and test augmentation such as crop, scaling, rotation and projective transformation.

Networks model was pre-trained with localization dataset.
SIIT_KAIST-TECHWIN Youngsoo Kim(KAIST), Heechul Jung(KAIST), Jeongwoo Ju(KAIST),
Byungju Kim(KAIST), Sihyeon Seong(KAIST), Junho Yim(KAIST), Gayoung
Lee(KAIST), Yeakang Lee(KAIST), Minju Jung(KAIST), Junmo Kim(KAIST),
Soonmin Bae(Hanwha Techwin), Jayeong Ku(Hanwha Techwin), Seokmin
Yoon(Hanwha Techwin), Hwalsuk Lee(Hanwha Techwin), Jaeho Jang(Hanwha
Techwin)
Our method for scene classification is based on deep convolutional neural networks.

We used pre-trained networks on ILSVRC2015 localization dataset and retrained the networks with 256x256 size Places2 dataset.

For test, we used ten crop data augmentation and model combination with four slightly different models.

This work was supported by Hanwha Techwin.
SYSU_Vision Liang Lin, Sun Yat-sen University

Wenxi Wu, Sun Yat-sen University

Zhouxia Wang, Sun Yat-sen University

Depeng Liang, Sun Yat-sen University

Tianshui Chen, Sun Yat-sen University

Xian Wu, Sun Yat-sen University

Keze Wang, Sun Yat-sen University

Lingbo Liu, Sun Yat-sen University
We design our detection model based on fast RCNN [1], and improve
it from the following two aspects. First, we utilize a CNN-based method
to re-rank and re-fine the proposals generated by proposal methods.
Specifically, we re-rank the proposals to reject the low-confident ones
and refine the proposals to get more accurate locations for the
corresponding proposals. Second, we incorporate self-pace learning (SPL)
in our optimization stage. We first initial a detector with all the
annotated training samples and assign a confidence to each candidate
with this detector. We further fine tune the detector using the
candidates with high-confidence.

[1] Ross Girshick, Fast R-CNN, In ICCV 2015

Szokcka Research Group Dmytro Mishkin Ensemble of cheap (2x2 + 2x2) and (3x1 + 1x3) models, inspired by
[1], [2] in Inception-style Convnet[3], trained with Layer-sequential
unit-variance orthogonal initialization[4].

[1]Xudong Cao. A practical theory for designing very deep convolutional neural networks, 2014. (unpublished)

[2]Ben Graham. Sparse 3D convolutional neural networks, BMVC 2015

[3]Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott
Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew
Rabinovich. Going Deeper with Convolutions, CVPR 2015

[4] D.Mishkin, J. Matas. All you need is a good init, arXiv:1511.06422.

Tencent-Bestimage Hangyu Yan, Tencent

Xiaowei Guo, Tencent

Ruixin Zhang, Tencent

Hao Ye, Tencent
Localization task is composed of two parts: indicate the category
and point out where is it. As the great progress in image
classification[1][2], indicating in the category no longer seems to be
the bottlenecks of the localization task. In previous competition, dense
box regression[3][2] was used to generalize boundary prediction.
However, the image crops of dense prediction is bounded at certain
position and scale, and is hard to regress to a single box when more
than one object appear in the same image crop.

Given a lot of box proposals(eg. from selective search[5], about
1500 boxes per image), the localization task become (a) how to find real
boxes that contain ground truth, (b) what is the category lies in the
box. Enlightened by fast R-CNN[4], We propose a classification
&& localization framework, which combines the global context
information and local box information by selecting proper (class, box)
pairs: first we pre train a googlenet for classification; and then two
kinds of transformations have been made to the network, obtaining
objectness and local category within the box; and finally, a SVM
classifier is applied to select proper (class, box) pairs.

1. Fine-tuning for Objectness

Similar to fast R-CNN, we train a network to indicate the
"objectness" of box. The probability of objectness is computed by a
two-class softmax layer over fully-connected layer. A box proposal is
positive if it has intersection over union (IoU) at least 0.7, negative
if IoU<0.3. No 1000 categories information is used at this stage.
Another Box offsets regression layer is also jointly trained with
objectness, and ignored when the box is negative, as is standard
practice in[4].

At test time, the objectness of all box proposals are verified by
the network. Then we perform non-maximum suppression and keep most
objectness boxes.

2. Fine-tuning for Local Classification

We acquire the global classification by avarging result of multi
crops(eg. 144crops per image[1]) using the pre-trained googlenet model.
On the other hand, we finetune the googlenet to indicate the 1000
categories within the box, which we call "local classification". The
training sets are sampled around ground truth and resized to the network
input.

At test time, we clip images from the regression boundary of the
most objectness boxes, and put them into local classification network.
Classes with high probability from local and global will be retaining.

3. Combine Information and Pair Select

So far, we have got objectness and offsets regression for some
boxes, classification results for both local and global. For each
possible pair (class, box), all these information form a feature vector,
and is trained by SVM.

At test time, the pairs that reach top5 confidence in SVM would become our final result.

It may be surprising that the simple strategy can be more accurate
even without models ensemble. We believe that the object boundary has
something to do with low level vision, such as corner, edge, shape. As
the network goes deeper, the features become more and more abstract. It
is questionable to train a box regression while retaining 1000
categories information, especially when there are only 500 images per
category on average. Our solution avoid seeking box directly, but take
full advantage of classification capability of deep network.

We train our network models using modified cxxnet[6].

[1]Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep
Network Training by Reducing Internal Covariate Shift. In JMLR, 2015.

[2]Simonyan K, Zisserman A. VGG Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

[3]Sermanet P, Eigen D, Zhang X, et al. Overfeat: Integrated
recognition, localization and detection using convolutional networks. In
ICLR, 2014.

[4]Girshick R. Fast R-CNN. In NIPS, 2015.

[5]Uijlings J R R, van de Sande K E A, Gevers T, et al. Selective
search for object recognition. International journal of computer vision,
2013, 104(2): 154-171.

[6]cxxnet: https://github.com/dmlc/cxxnet

The University of Adelaide Zifeng Wu, the University of Adelaide

Chunhua Shen, the University of Adelaide

Anton van den Hengel, the University of Adelaide
Our method is largely improved upon fast-rcnn. Multiple VGG16 and
VGG19 networks are involved, which were pre-trained with the ImageNet
CLS-LOC dataset. Each of the models is initialized with different models
and/or tuned with different data augmentation strategies. Furthermore,
we observe that feature maps obtained by applying the 'hole convolution
algorithm' are beneficial. Selective-search proposals are filtered by a
pre-trained two-way (object or non-object) classifier. The outputs of
each network for the original and flipped images at multiple scales are
averaged to obtain the predictions.
THU-UTSA-MSRA Liang Zheng, University of Texas at San Antonio

Shengjin Wang, Tsinghua University

Qi Tian, University of Texas at San Antonio

Jingdong Wang, Microsoft Research
Our team submits results on the scene classification task using the Places2 dataset [4].

We trained three CNNs on the Places2 datasets using the GoogleNet
[1]. Specifically, Caffe [2] is used for training. Among the three
models, the first model is trained on by fine-tuning the GoogleNet model
trained on Places dataset [3]. The classification accuracy of this
model is: top-1 accuracy = 42.96%, top-5 accuracy = 75.35%. This model
is obtained after 3,320,000 mini-batches, and the batch size is 32.
Learning rate is set to 0.001, and gamma is set to 0.1, with a step size
of 750,000.

The second and the third models are trained using the "quick" and
the default solvers provided by GoogleNet, respectively, and both models
are trained from scratch. Specifically, for the second model, we use a
base learning rate of 0.01, gamma = 0.7, and step size = 320,000. We run
this model for 4,500,000 mini-batches, and each mini-batch is of size
32. For this model, our result on the validation set is: top-1 accuracy =
43.41%, top-5 accuracy = 75.37%. In more detail, we only change the
architecture of GoogleNet to have 401 blobs in the last fully connected
layer. The average operation is done after the softmax calculation.

For submission, we submit results of each model as the first three
runs (run 1, run 2, and run 3). Then, run 4 is the averaged result of
fine-tuned GoogleNet + quick GoogleNet. Run 5 is the averaged result of
all three models.

[1] Szegedy, Christian, et al. "Going deeper with convolutions." arXiv preprint arXiv:1409.4842 (2014).

[2] Jia, Yangqing, et al. "Caffe: Convolutional architecture for
fast feature embedding." Proceedings of the ACM International Conference
on Multimedia. ACM, 2014.

[3] Zhou, Bolei, et al. "Learning deep features for scene
recognition using places database." Advances in Neural Information
Processing Systems. 2014.

[4] Places2: A Large-Scale Database for Scene Understanding. B.
Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva, Arxiv, 2015

Trimps-Fudan-HUST Jianying Zhou(1), Jie Shao(1), Lin Mei(1), Chuanping Hu(1), Xiangyang Xue(2), Zheng Zhang(3), Xiang Bai(4)

(1)The Third Research Institute of the Ministry of Public Security, China

(2)Fudan university

(3)New York University Shanghai

(4)Huazhong University of Science & Technology

Object detection:

Our models were trained based on Fast R-CNN and Faster R-CNN. 1)
More training signal were added, including negative classes and
objectness. 2) Replace pooling layers with stride convolutional layers
for more accurate localization. 3) Extra data from Microsoft COCO, more
anchors. 4) Various models were combined with weighted nms.
Trimps-Soushen Jie Shao*, Xiaoteng Zhang*, Jianying Zhou*, Zhengyan Ding*, Wenfei Wang, Lin Mei, Chuanping Hu (* indicates equal contribution)

(The Third Research Institute of the Ministry of Public Security, P.R. China.)

Object detection:

Our models were trained based on Fast R-CNN and Faster R-CNN. 1)
More training signal were added, including negative classes and
objectness. Some models were trained on 489 subcategories first, then
fine-tuning using 200 categories. 2) Replace pooling layers with stride
convolutional layers for more accurate localization. 3) Extra data from
Microsoft COCO, more anchors. 4) Iterative scheme, which alternates
between scoring the proposals and refining their localizations with
bounding box regression. 5) Various models were combined with weighted
nms.

Object localization:

Different data augmentation methods were used, including random
crops, multiple scales, contrast and color jittering. Some models were
trained by maintaining the aspect ratio of input images, while others
were not. In the test phase, whole uncropped images are densely
processed for various scales. Further generate the fusion classification
result according to the scores and labels jointly. On the localization
side, we refer to the framework of Fast R-CNN. An iterative scheme like
detection task is used. Then select top-k regions and averaging their
coordinates as output. Results from multiple models are fused in
different ways, using the model accuracy as weights.

Object detection from video:

We use same models as object detection task. Part of these models
were fine-tuned using VID data. We also try main object constraint by
considering whole snippet rather than single frame.

Scene classification:

Based on both MSRA-net and BN-GoogLeNet, and plus several
improvements: 1) Choose subset data in a stochastic way at each epoch,
which ensures each class has roughly equal images. This can both
accelerate training and increase model diversity. 2) To utilize the
whole image and part object information simultaneously, three different
size patches of image (whole image with 224x224, crop of 160x160, and
crop of 112x112) were feed into network and concatenate at the last
convolution layer. 3) Enlarge the MSRA-net to 25 layers, and change some
BN-net input from 224x224 to 270x270. 4) Use dense sample and
multi-crop (50x3) for testing.

UIUC-IFP Pooya Khorrami - UIUC (*)

Tom Le Paine - UIUC (*)

Wei Han - UIUC

Prajit Ramachandran - UIUC

Mohammad Babaeizadeh - UIUC

Honghui Shi - UIUC

Thomas S. Huang - UIUC

* - equal contribution

Our system uses deep convolutional neural networks (CNNs) for
object detection and can be broken up into three distinct phases: (i)
object proposal generation (ii) object classification and (iii)
post-processing using non-maximum suppression (NMS). The first two
phases are based on the faster R-CNN framework presented in [1].

First, we find image regions that may contain objects via bounding
boxes (i.e. proposals) using the fully convolutional region-proposal
network (RPN) of [1]. The network’s topology is similar to the
Zeiler-Fergus (ZF) network [2]. Once the RPN has been trained, we train
another neural network (RCNN) [1] that tries to classify the object in a
given proposal. For this phase, we use a VGG16-style [3] network that
was pre-trained on the ImageNet Classification and Localization Data
(CLS) and only fine-tune the last fully-connected layer. Instead of
trying to classify 200 objects, the layer has been altered to classify a
proposal as being one of 30 classes. Both the RPN and RCNN are trained
using the initial training release.

During testing, we pass an image to the RPN which extracts 500
proposals. We then apply the RCNN to each proposal and extract 30
confidence scores and 30 refined bounding boxes. We then consider one of
two possible post-processing algorithms: (i) NMS on each frame (ii)
sequence-NMS (SEQ-NMS) on each video snippet.

Regular frame-wise NMS operates on a single frame by iteratively
selecting a class’ most confident detection in the frame and removes
detections in the vicinity that have sufficient overlap. SEQ-NMS, on the
other hand, operates on each video snippet. It iteratively selects the
sequence of boxes over time that has maximum score and then suppresses
detections that overlap with any of the members in the selected
sequence. We accomplish this by constructing a graph over the snippet’s
frames and doing dynamic programming. We score each sequence using
either the average (NMS-AVG) or the max score (NMS-MAX) of the boxes. We
then re-score each box in the kept sequence with the overall sequence
score. One of our models selects whether to score using the average or
the max for each class depending on each method's performance on the
initial validation set.

References

[1] – Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun.
"Faster r-cnn: Towards real-time object detection with region proposal
networks." arXiv preprint arXiv:1506.01497 (2015).

[2] - Zeiler, Matthew D., and Rob Fergus. "Visualizing and
understanding convolutional networks." In Computer Vision–ECCV 2014, pp.
818-833. Springer International Publishing, 2014.

[3] - Simonyan, Karen, and Andrew Zisserman. "Very deep
convolutional networks for large-scale image recognition." arXiv
preprint arXiv:1409.1556 (2014).

UIUCMSR Yingzhen Yang (UIUC), Wei Han (UIUC), Nebojsa Jojic (Microsoft
Research), Jianchao Yang, Honghui Shi (UIUC), Shiyu Chang (UIUC), Thomas
S. Huang (UIUC)
Abstract:

We develop a new architecture for deep Convolutional Neutral
Networks (CNNs), named Filter Panorama Convolutional Neutral Network
(FPCNNs) for this scene classification competition. Convolutional layers
are essential parts of CNNs and each layer is comprised of a set of
trainable filters. To enhance the representation capability of the
convolutional layers with more filters while maintaining almost the same
parameter size, the filters of one convolutional layer (or possibly
several convolutional layers) of FPCNNs are replaced by a filter
panorama, wherein each window of the filter panorama serves as a filter.
With the densely extracted overlapping windows from the filter
panorama, a significantly larger filter set is obtained without the risk
of overfitting since the parameter size of the filter panorama is the
same as that of the original filters in CNNs.

The idea of filter map is inspired by epitome [1], which is
developed in the computer vision and machine learning literature for
learning a condensed version of Gaussian Mixture Models (GMMs). In
epitome, the Gaussian means are represented by a two dimensional matrix
wherein each window in this matrix contains parameters of the Gaussian
means for a Gaussian component. The same structure is adopted for
representing the Gaussian covariances. With almost the same parameter
space as GMMs, the epitome possesses significantly more number of
Gaussian components than its GMMs counterpart since much more Gaussian
means and covariances can be extracted densely from the mean and
covariances matrices of the epitome. Therefore, the generalization and
representation capability of epitome outshines GMMs with almost the same
parameter space, while circumventing the potential overfitting.

The above characteristics of epitome encourage us to arrange filters
in a way similar to epitome in the FPCNNs. More precisely, we construct
a three dimensional matrix named filter panorama for the convolutional
layer of FPCNNs, wherein each window of the filter map plays the same
role as the filter in the convolutional layer of CNNs. The filter
panorama is designed such that the number of non-overlapping windows in
the filter panorama is almost equal to the number of filters in the
corresponding convolutional layer of CNNs. By densely extracting
overlapping windows from the filter panorama, there are many more
filters in the filter panorama that vary smoothly in the spatial domain,
and neighboring filters share weights in their overlapping region.
These smoothly varying filters tend to activate on more types of
features that also exhibit small variations in the input volume,
increasing the chance of extracting more robust deformation invariant
features by the subsequent max-pooling layer [2].

In addition to the superior representation capability, filter
panorama inherently enables a better visualization of the filters by
forming a “panorama” of the filters such that adjacent filters changes
smoothly across the spatial domain, and similar filters group together.
This feature would benefit the design of the networks and make it easier
to observe the characteristics of the filters learnt in the
convolutional layer.

References:

[1] N. Jojic, B. J. Frey, and A. Kannan. Epitomic Analysis of
Appearance and Shape. In Pro-ceedings of IEEE International Conference
on Computer Vision (ICCV), pp. 34-43, 2003.

[2] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is
the best multi-stage archi-tecture for object recognition? In Proc.
ICCV, pages 2146–2153, 2009. 1

USTC_UBRI Guiying Li, USTC

Junlong Liu, USTC
We are graduate students from USTC, trying to do some research in computer vision and deep learning.
VUNO Yeha Lee (VUNO Inc.)

Kyu-Hwan Jung (VUNO Inc.)

Hyun-Jun Kim (VUNO Inc.)

Sangki Kim (VUNO Inc.)
Our localization models are based on deep convolutional neural network

For the classification model which predicts the confidence score of
each object class in the viewing window, we ensembled four deep neural
networks two which are similar with [1](CNN) and two of which are their
variants. The variants models(CLSTM) replaces last two convolution
layers of [2] with 2D-LSTM [3] layer, which provides robustness to
object warping and contextual information.

For localization model which predicts the location of bounding box,
we used per-class-regression(PCR) approach[3] which replaces last 1000D
softmax layer with 4000D regression layer. Additionally, we also
replaced SPP layer of CNN and CLSTM with max-pooling layer. The results
obtained from two combinations of the models are submitted: CNN-CNN,
CNN-CLSTM.

For training, we used scale jittering strategy in [3] where the
scale is sampled from fixed set of scales. For testing, we used
multi-scale testing and merged all bounding boxes obtained from each
scale.

All experiments were performed using our own deep learning library called VunoNet on GPU server with 4 NVIDIA Titan X GPUs.

[1] He, Kaiming, Xiangyu Zhang Shaoqing and Ren Jian Sun ”Delving
Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet
Classification.” arXiv preprint arXiv:1502.01852 (2015).

[2] Alex Graves, Santiago Fernandez and Jurgen Schmidhuber.
Multidimensional recurrent neural networks. In ´ Proceedings of the 2007
International Conference on Artificial Neural Networks, Porto,
Portugal, September (2007).

[3] Simonyan Karen, and Andrew Zisserman. ”Very deep convolutional
networks for large-scale image recognition.” arXiv preprint
arXiv:1409.1556 (2014).

WM Li Shen (University of Chinese Academy of Sciences)

Zhouchen Lin (Peking University)
We exploit partially overlapping optimization strategy to improve
the convolutional neural networks, alleviating the optimization
difficulty at lower layers and favoring better discrimination at higher
layers. We have verified its effectiveness on VGG-like architectures
[1]. We also apply two modifications of network architectures. Model A
has 22 weight layers in total, adding three 3x3 convolutional layers in
VGG-19 [1] and replacing the last max-pooling layer with SPP layer [2].
Model B integrates multi-scale information combination. Moreover, we
apply balanced sampling strategy during training to tackle the
non-uniform distribution of class samples. The algorithm and
architecture details will be described in our arXiv paper (available
online shortly).

In this competition, we submit five entries. The first is a single
model (model A), which achieved 16.33% top-5 error on validation
dataset. The second is a single model (model B), which achieved 16.36%
top-5 error on validation dataset. The third is a combination of
multiple CNN models with the averaging strategy. The fourth is the
combination of these CNN models with a product strategy. The fifth is
the combination of multiple CNN models with learnt weights.

[1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR 2015

[2] K. He, X. Zhang, S. Ren and J. Sun. Spatial pyramid pooling in
deep convolutional networks for visual recognition. In ECCV 2014.

Yunxiao --- The model is trained based on the Fast R-CNN framework. The
selective search method is applied for object proposals generation. A
VGG16 model pre-trained based on the image level classification task is
used for initialization. A balanced fine-tuning dataset constructed from
the training and validation dataset for the object detection task is
utilized for fine-tuning the model. No other data augmentation nor model
combination are applied.
ZeroHero Svetlana Kordumova, UvA; Thomas Mensink, UvA; Cees Snoek, UvA; ZeroHero recognizes scenes without using any scene images as
training data. Instead of using attributes for the zero-shot
recognition, we recognize a scene using a semantic word embedding that
is spanned by a skip-gram model of thousands of object categories [1].
We subsample 15K object categories from the 22K ImageNet dataset, for
which more than 200 training examples are available. Using those, we
train an inception-style convolutional neural network [2]. An unseen
test image, is represented as the sparisified set of prediction scores
of the last network layer with softmax normalization. For the embedding
space, we learn a 500-dimensional word2vec model [3], which is trained
on the title description and tag text from the 100M Flickr photos in the
YFCC100M dataset [4]. The similarity between object and scene
affinities in the semantic space is computed with cosine similarity over
their word2vec representations, pooled with Fisher word vectors [1].
For each test image we predict the five highest scoring scenes.

References:

[1] M. Jain, J.C. van Gemert, T. Mensink, and C.G.M. Snoek.
Objects2action: Classifying and localizing actions without any video
example. In ICCV, 2015.

[2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions.
CVPR, 2015.

[3] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean.
Distributed representations of words and phrases and their
compositionality. In NIPS, 2013.

[4] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D.
Poland, D. Borth, L.-J. Li. The New Data and New Challenges in
Multimedia Research. arXiv:1503.01817, 2015.

from: http://image-net.org/challenges/LSVRC/2015/results