GoogLeNet深度学习模型的Caffe复现模型

Going deeper with convolutions

Christian Szegedy Google Inc.Wei Liu University of North Carolina,Chapel Hill Yangqing Jia Google Inc.Pierre Sermanet Google Inc.Scott Reed University of Michigan Dragomir Anguelov Google Inc.Dumitru Erhan Google Inc.

Vincent Vanhoucke Google Inc.Andrew Rabinovich

Google Inc.

Abstract

We propose a deep convolutional neural network architecture codenamed Incep-tion,which was responsible for setting the new state of the art for classiﬁcation and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014(ILSVRC14).The main hallmark of this architecture is the improved utilization of the computing resources inside the network.This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant.To optimize quality,the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing.One particular incarnation used in our s

ubmission for ILSVRC14is called GoogLeNet,a 22layers deep network,the quality of which is assessed in the context of classiﬁcation and detection.

1Introduction

In the last three years,mainly due to the advances of deep learning,more concretely convolutional networks [10],the quality of image recognition and object detection has been progressing at a dra-matic pace.One encouraging news is that most of this progress is not just the result of more powerful hardware,larger datasets and bigger models,but mainly a consequence of new ideas,algorithms and improved network architectures.No new data sources were used,for example,by the top entries in the ILSVRC 2014competition besides the classiﬁcation dataset of the same competition for detec-tion purposes.Our GoogLeNet submission to ILSVRC 2014actually uses 12×fewer parameters than the winning architecture of Krizhevsky et al [9]from two years ago,while being signiﬁcantly more accurate.The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models,but from the synergy of deep architectures and classical computer vision,like the R-CNN algorithm by Girshick et al [6].

Another notable factor is that with the ongoing traction of mobile and embedded computing,the efﬁci

峰与谷ency of our algorithms –especially their power and memory use –gains importance.It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer ﬁxation on accuracy numbers.For most of the experiments,the models were designed to keep a computational budget of 1.5billion multiply-adds at inference time,so that the they do not end up to be a purely academic curiosity,but could be put to real world use,even on large datasets,at a reasonable cost.

a r X i v :1409.4842v 1 [c s .C V ] 17 S e p 2014

江西大宇学院

In this paper,we will focus on an efﬁcient deep neural network architecture for computer vision, codenamed Inception,which derives its name from the Network in network paper by Lin et al[12] in conjunction with the famous“we need to go deeper”internet meme[1].In our case,the word “deep”is used in two different meanings:ﬁrst of all,in the sense that we introduce a new level of organization in the form of the“Inception module”and also in the more direct sense of increased network depth.In general,one can view the Inception model as a logical culmination of[12] while taking inspiration and guidance from the theoretical work by Arora et al[2].The beneﬁts of the architecture are experimentally veriﬁed on the ILSVRC2014classiﬁcation and detection challenges,on which it signiﬁcantly outperforms the current state of the art.

张家口第九中学2Related Work

Starting with LeNet-5[10],convolutional neural networks(CNN)have typically had a standard structure–stacked convolutional layers(optionally followed by contrast normalization and max-pooling)are followed by one or more fully-connected layers.Variants of this basic design are prevalent in the image classiﬁcation literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classiﬁcation challenge[9,21].For larger datasets such as Imagenet,the recent trend has been to increase the number of layers[12]and layer size[21,14], while using dropout[7]to address the problem of overﬁtting.

Despite concerns that max-pooling layers result in loss of accurate spatial information,the same convolutional network architecture as[9]has also been successfully employed for localization[9, 14],object detection[6,14,18,5]and human pose estimation[19].Inspired by a neuroscience model of the primate visual cortex,Serre et al.[15]use a series ofﬁxed Gaborﬁlters of different sizes in order to handle multiple scales,similarly to the Inception model.However,contrary to theﬁxed 2-layer deep model of[15],allﬁlters in the Inception model are learned.Furthermore,Inception layers are repeated many times,leading to a22-layer deep model in the case of the GoogLeNet model.

Network-in-Network is an approach proposed by Lin et al.[12]in order to increase the representa-tional power of neural networks.When applied to convolutional layers,the method could be viewed as additional1×1convolutional layers followed typically by the rectiﬁed linear activation[9].This enables it to be easily integrated in the current CNN pipelines.We use this approach heavily in our architecture.However,in our setting,1×1convolutions have dual purpose:most critically,they are used mainly as dimension reduction modules to remove computational bottlenecks,that would otherwise limit the size of our networks.This allows for not just increasing the depth,but also the width of our networks without signiﬁcant performance penalty.

The current leading approach for object detection is the Regions with Convolutional Neural Net-works(R-CNN)proposed by Girshick et al.[6].R-CNN decomposes the overall detection problem into two subproblems:toﬁrst utilize low-level cues such as color and superpixel consistency for potential object proposals in a category-agnostic fashion,and to then use CNN classiﬁers to identify object categories at those locations.Such a two stage approach leverages the accuracy of bound-ing box segmentation with low-level cues,as well as the highly powerful classiﬁcation power of state-of-the-art CNNs.We adopted a similar pipeline in our detection submissions,but have ex-plored enhancements in both stages,such as multi-box[5]prediction for higher object bounding box recall,and ensemble approaches for better categorization of bounding box proposals.

3Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increas-ing their size.This includes both increasing the depth–the number of levels–of the network and its width:the number of units at each level.This is as an easy and safe way of training higher quality models,especially given the availability of a large amount of labeled training data.However this simple solution comes with two major drawbacks.

Bigger size typically means a larger number of parameters,which makes the enlarged network more prone to overﬁtting,especially if the number of labeled examples in the training set is limited. This can become a major bottleneck,since the creation of high quality training sets can be tricky

(a)Siberian husky(b)Eskimo dog

Figure1:Two distinct classes from the1000classes of the ILSVRC2014classiﬁcation challenge.

and expensive,especially if expert human raters are necessary to distinguish betweenﬁne-grained visual categories like those in ImageNet(even in the1000-class ILSVRC subset)as demonstrated by Figure1.

Another drawback of uniformly increased network size is the dramatically increased use of compu-tational resources.For example,in a deep vision network,if two convolutional layers are chained, any

uniform increase in the number of theirﬁlters results in a quadratic increase of computation.If the added capacity is used inefﬁciently(for example,if most weights end up to be close to zero), then a lot of computation is wasted.Since in practice the computational budget is alwaysﬁnite,an efﬁcient distribution of computing resources is preferred to an indiscriminate increase of size,even when the main objective is to increase the quality of results.

The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures,even inside the convolutions.Besides mimicking biological systems,this would also have the advantage ofﬁrmer theoretical underpinnings due to the ground-breaking work of Arora et al.[2].Their main result states that if the probability distribution of the data-set is representable by a large,very sparse deep neural network,then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.Although the strict math-ematical proof requires very strong conditions,the fact that this statement resonates with the well known Hebbian principle–neurons thatﬁre together,wire together–suggests that the underlying idea is applicable even under less strict conditions,in practice.

全面建成小康社会最艰巨最繁重的任务在

On the downside,todays computing infrastructures are very inefﬁcient when it comes to numerical ca

lculation on non-uniform sparse data structures.Even if the number of arithmetic operations is reduced by100×,the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off.The gap is widened even further by the use of steadily improving, highly tuned,numerical libraries that allow for extremely fast dense matrix multiplication,exploit-ing the minute details of the underlying CPU or GPU hardware[16,9].Also,non-uniform sparse models require more sophisticated engineering and computing infrastructure.Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of em-ploying convolutions.However,convolutions are implemented as collections of dense connections to the patches in the earlier layer.ConvNets have traditionally used random and sparse connection tables in the feature dimensions since[11]in order to break the symmetry and improve learning,the trend changed back to full connections with[9]in order to better optimize parallel computing.The uniformity of the structure and a large number ofﬁlters and greater batch size allow for utilizing efﬁcient dense computation.

This raises the question whether there is any hope for a next,intermediate step:an architecture that makes use of the extra sparsity,even atﬁlter level,as suggested by the theory,but exploits our

红楼惊梦current hardware by utilizing computations on dense matrices.The vast literature on sparse matrix co

[3])suggests that clustering sparse matrices into relatively dense submatrices tends to give state of the art practical performance for sparse matrix multiplication.It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.

The Inception architecture started out as a case study of theﬁrst author for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by[2]for vision networks and covering the hypothesized outcome by dense,read-ily available components.Despite being a highly speculative undertaking,only after two iterations on the exact choice of topology,we could already see modest gains against the reference architec-ture based on[12].After further tuning of learning rate,hyperparameters and improved training methodology,we established that the resulting Inception architecture was especially useful in the context of localization and object detection as the base network for[6]and[5].Interestingly,while most of the original architectural choices have been questioned and tested thoroughly,they turned out to be at least locally optimal.

One must be cautious though:although the proposed architecture has become a success for computer vision,it is still questionable whether its quality can be attributed to the guiding principles th

at have lead to its construction.Making sure would require much more thorough analysis and veriﬁcation: for example,if automated tools based on the principles described below wouldﬁnd similar,but better topology for the vision networks.The most convincing proof would be if an automated system would create network topologies resulting in similar gains in other domains using the same algorithm but with very differently looking global architecture.At very least,the initial success of the Inception architecture yieldsﬁrm motivation for exciting future work in this direction.

4Architectural Details

The main idea of the Inception architecture is based onﬁnding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components.Note that assuming translation invariance means that our network will be built from convolutional building blocks.All we need is toﬁnd the optimal local construction and to repeat it spatially.Arora et al.[2]suggests a layer-by layer construction in which one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer.We assume that each unit from the earlier layer corresponds to some region of the input image and these units are grouped intoﬁlter banks.In the lower layers(the ones close to the input)correlated

units would concentrate in local regions.This means,we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of1×1convolutions in the next layer,as suggested in[12].However,one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches,and there will be a decreasing number of patches over larger and larger regions.In order to avoid patch-alignment issues,current incarnations of the Inception architecture are restricted toﬁlter sizes1×1, 3×3and5×5,however this decision was based more on convenience rather than necessity.It also means that the suggested architecture is a combination of all those layers with their outputﬁlter banks concatenated into a single output vector forming the input of the next stage.Additionally, since pooling operations have been essential for the success in current state of the art convolutional networks,it suggests that adding an alternative parallel pooling path in each such stage should have additional beneﬁcial effect,too(see Figure2(a)).

As these“Inception modules”are stacked on top of each other,their output correlation statistics are bound to vary:as features of higher abstraction are captured by higher layers,their spatial concentration is expected to decrease suggesting that the ratio of3×3and5×5convolutions should increase as we move to higher layers.

One big problem with the above modules,at least in this na¨ıve form,is that even a modest number of 5×5convolutions can be prohibitively expensive on top of a convolutional layer with a large number ofﬁlters.This problem becomes even more pronounced once pooling units are added to the mix: their number of outputﬁlters equals to the number ofﬁlters in the previous stage.The merging of the output of the pooling layer with the outputs of convolutional layers would lead to an inevitable

(a)Inception module,na¨ıve version(b)Inception module with dimension reductions

Figure2:Inception module

increase in the number of outputs from stage to stage.Even while this architecture might cover the o

ptimal sparse structure,it would do it very inefﬁciently,leading to a computational blow up within a few stages.

This leads to the second idea of the proposed architecture:judiciously applying dimension reduc-tions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings:even low dimensional embeddings might contain a lot of information about a relatively large image patch.However,embeddings represent information in a dense,compressed form and compressed information is harder to model.We would like to keep our representation sparse at most places(as required by the conditions of[2])and compress the signals only whenever they have to be aggregated en masse.That is,1×1convolutions are used to compute reductions before the expensive3×3and5×5convolutions.Besides being used as reduc-tions,they also include the use of rectiﬁed linear activation which makes them dual-purpose.The ﬁnal result is depicted in Figure2(b).

In general,an Inception network is a network consisting of modules of the above type stacked upon each other,with occasional max-pooling layers with stride2to halve the resolution of the grid.For technical reasons(memory efﬁciency during training),it seemed beneﬁcial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This

is not strictly necessary,simply reﬂecting some infrastructural inefﬁciencies in our current implementation.

One of the main beneﬁcial aspects of this architecture is that it allows for increasing the number of units at each stage signiﬁcantly without an uncontrolled blow-up in computational complexity.The ubiquitous use of dimension reduction allows for shielding the large number of inputﬁlters of the last stage to the next layer,ﬁrst reducing their dimension before convolving over them with a large patch size.Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.

The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difﬁculties.Another way to utilize the inception architecture is to create slightly inferior,but computationally cheaper versions of it.We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are2−3×faster than similarly performing networks with non-Inception architecture,however this requires careful manual design at this point. 5GoogLeNet

We chose GoogLeNet as our team-name in the ILSVRC14competition.This name is an homage to Yann LeCuns pioneering LeNet5network[10].We also use GoogLeNet to refer to the particular incarnation of the Inception architecture used in our submission for the competition.We have also used a deeper and wider Inception network,the quality of which was slightly inferior,but adding it to the ensemble seemed to improve the results marginally.We omit the details of that network,since our experiments have shown that the inﬂuence of the exact architectural parameters is relatively

type patch size/

stride

output

女强人电影size

depth#1×1

#3×3

reduce

#3×3

#5×5

reduce

#5×5

pool

proj

params ops

convolution7×7/2112×112×641 2.7K34M max pool3×3/256×56×640

convolution3×3/156×56×192264192112K360M max pool3×3/228×28×1920

inception(3a)28×28×25626496128163232159K128M inception(3b)28×28×4802128128192329664380K304M max pool3×3/214×14×4800

inception(4a)14×14×512219296208164864364K73M inception(4b)14×14×5122160112224246464437K88M inception(4c)14×14×5122128128256246464463K100M inception(4d)14×14×5282112144288326464580K119M inception(4e)14×14×832225616032032128128840K170M max pool3×3/27×7×8320

inception(5a)7×7×8322256160320321281281072K54M inception(5b)7×7×10242384192384481281281388K71M avg pool7×7/11×1×10240

dropout(40%)1×1×10240

linear1×1×100011000K1M softmax1×1×10000

Table1:GoogLeNet incarnation of the Inception architecture

minor.Here,the most successful particular instance(named GoogLeNet)is described in Table1for demonstrational purposes.The exact same topology(trained with different sampling methods)was used for6out of the7models in our ensemble.

All the convolutions,including those inside the Inception modules,use rectiﬁed linear activation. The s

ize of the receptiveﬁeld in our network is224×224taking RGB color channels with mean sub-traction.“#3×3reduce”and“#5×5reduce”stands for the number of1×1ﬁlters in the reduction layer used before the3×3and5×5convolutions.One can see the number of1×1ﬁlters in the pro-jection layer after the built-in max-pooling in the pool proj column.All these reduction/projection layers use rectiﬁed linear activation as well.

The network was designed with computational efﬁciency and practicality in mind,so that inference can be run on individual devices including even those with limited computational resources,espe-cially with low-memory footprint.The network is22layers deep when counting only layers with parameters(or27layers if we also count pooling).The overall number of layers(independent build-ing blocks)used for the construction of the network is about100.However this number depends on the machine learning infrastructure system used.The use of average pooling before the classiﬁer is based on[12],although our implementation differs in that we use an extra linear layer.This enables adapting andﬁne-tuning our networks for other label sets easily,but it is mostly convenience and we do not expect it to have a major effect.It was found that a move from fully connected layers to average pooling improved the top-1accuracy by about0.6%,however the use of dropout remained essential even after removing the fully connected layers.

Given the relatively large depth of the network,the ability to propagate gradients back through all the layers in an effective manner was a concern.One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative.By adding auxiliary classiﬁers connected to these intermediate layers,we would expect to encourage discrimination in the lower stages in the classiﬁer,increase the gradient signal that gets propagated back,and provide additional regulariza-tion.These classiﬁers take the form of smaller convolutional networks put on top of the output of the Inception(4a)and(4d)modules.During training,their loss gets added to the total loss of the network with a discount weight(the losses of the auxiliary classiﬁers were weighted by0.3).At inference time,these auxiliary networks are discarded.

The exact structure of the extra network on the side,including the auxiliary classiﬁer,is as follows:•An average pooling layer with5×5ﬁlter size and stride3,resulting in an4×4×512output for the(4a),and4×4×528for the(4d)stage.

本文发布于:2024-09-21 02:42:13，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/453303.html

上一篇：Qt之modelview（自定义委托代理）

下一篇：【Kaggle纽约出租车车程用时预测实战（5）】XGBOOST训练模型预测结果

标签：江西建成红楼电影学院小康社会惊梦

留言与评论（共有 0 条评论）