xavier初始化_深入解读xavier初始化（附源码）

xavier初始化_深⼊解读xavier初始化（附源码）

论⽂是Understanding the difficulty of training deep feedforward neural networks。

⼀篇感觉不错的翻译为【Deep Learning】笔记：Understanding the difficulty of training deep feedforward neural networks。⼀些不错的解读⽂章Understanding the difficulty of training deep feedforward neural networks。

这篇论⽂还是很经典的，⽐如作者的名字Xavier Glorot，Xavier初始化就是这位⼤佬搞的。

参考的源码是TensorFlow版的，API在variance_scaling_initializer，源码在initializers.py。

0 Abstract

论⽂讲深度学习效果变好的两⼤功⾂是参数初始化和训练技巧，参数初始化能够拔⾼到如此地步还是很震撼我的：

All these experimental results were obtained with new initialization or training mechanisms.

然⽽现有的初始化⽅法即随机初始化表现并不好，本⽂的主要⽬的在于论证这点，以及基于论证指明改进的⽅向：

济宁陈涛Our objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

论⽂⾸先观察了⾮线性激活函数的影响，发现sigmoid激活函数因其均值不适合于深度⽹络，其会导致⾼级层陷⼊饱和状态：

徐新国

We find that the logistic sigmoid activation is unsuited for deep networks with random initialization because of its mean value, which can drive especially the top hidden layer into saturation.

论⽂发现处于饱和的神经元能够⾃⼰逃出饱和状态：

Surprisingly, we find that saturated units can move out of saturation by themselves, albeit slowly, and explaining the plateaus sometimes seen when training neural networks.

论⽂发现较少饱和的激活函数通常是有⽤的：

We find that a new non-linearity that saturates less can often be beneficial.

论⽂发现每层⽹络的雅克⽐矩阵的奇异值远⼤于1时⽹络就会难以训练：

Finally, we study how activations and gradients vary across layers and during training, with the idea that training may be more difficult when the singular values of the Jacobian associated with each layer are far from 1.

据此论⽂提出了⼀种新的初始化⽅法。

1 Deep Neural Networks

深度学习⽹络旨在从低阶特征中学习⾼阶特征：

Deep learning methods aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features.

论⽂并不着眼于⽆监督的预训练或半监督的训练准则，⽽是观察多层神经⽹络的问题所在：

So here instead of focusing on what unsupervised pre-training or semi-supervised criteria bring to deep architectures, we focus on analyzing what may be going wrong with good old (but deep) multi-layer neural networks.

这个观察主要是指层之间以及训练过程中激活值和梯度的变化情况：

Our analysis is driven by investigative experiments to monitor activations (watching for saturation of hidden units) and gradients, across layers and across training iterations.

另外也评估了激活函数与初始化过程的影响：

We also evaluate the effects on these of choices of activation function (with the idea that it might affect saturation) and initialization procedure (since unsupervised pre-training is a particular form of initialization and it has a drastic impact).

曹仁伟

2 Experiment Setting and Datasets

2.1 Online Learning on an Infinite Dataset: Shapeset-3*2

论⽂讲了在线学习的好处：

The online setting is also interesting because it focuses on the optimization issues rather than on the small-sample regularization effects.

在线学习是⼀个优化问题，⽽不是在⼩数据集内的正则化效果。我⾃⼰的理解就是，⼩数据集的正则化效果就是给你⼀堆⼆维点，求⼀个直线使得距离最短（理论上可以直接求解得到，这⾥使⽤训练的

⽅法），可以先给⼀个初始直线，扔进⼀个点去，优化这个直线使得直线与点之间的距离变⼩，直到⼩数据集的点利⽤完。⽽在线学习就是，不断重复上述这个过程，点是取之不尽⽤之不竭的，⽆穷的点可以⽆限接近于真实状态。两者的不同在于⽹络是否知道全部的数据集。

之后讲了数据集

，对该数据集的描述是：

Shapeset-3*2 contains images of 1 or 2 two-dimensional objects, each taken from 3 shape categories (triangle, parallelogram, ellipse), and placed with random shape parameters (relative lengths and/or angles), scaling, rotation, translation and grey-scale.

图1上部分有⼀部分例图：

这⾥讲下图1种为什么将会产⽣9种可能的分类：

The task is to predict the objects present (e.g. triangle+ellipse, parallelogram+parallelogram, triangle alone, etc.) without having to distinguish between the foreground shape and the background shape when they overlap. This therefore defines nine configuration classes.

这9种分别是：

1.单独的⽬标，有3种：triangle alone, parallelogram alone, ellipse alone

2.三种⽬标选择不同的两种，则有

种：triangle+parallelogram, triangle+ellipse, parallelogram+ellipse

3.三种⽬标选择相同的两种，有3种：triangle+triangle, parallelogram+parallelogram, ellipse+ellipse

2.2 Finite Datasets

有三个数据集，分别是：MNIST、 CIFAR-10、Small-ImageNet。

2.3 Experimental Setting

有⼀些⽹络的基本设置，另外介绍了三种激活函数：the sigmoid、the hyperbolic tangent、the softsig

n，这三个激活函数如下图所⽰：

其中，

。

3 Effect of Activation Functions and Saturation During Training

该章基本上是对三个激活函数的实验，判断好坏的效果论⽂是从以下两⽅⾯介绍的：

2.overly linear units (they will not compute something interesting)

⾸先分析第⼀点，因为反步法求解的时候需要考虑到激活函数的偏导，如果激活函数处于饱和状态，即意味者其偏导接近于０，会导致梯度弥散，参考反向传播算法的公式推导 - HappyRocking的专栏 - CSDN博客，反向传播时代价函数对

和

的偏导为：

另外：

这两个公式证明反步法求解得到的梯度与激活函数的导数相关，激活函数饱和表⽰激活函数的导数接近为0，这是不利的。

关于第⼆点，刚好最近学习了下ResNet，可参考我的⽂章深⼊解读残差⽹络ResNet V1（附源码），4.2节中讲了如果没有激活函数，两层的神经⽹络也是相当于⼀层神经⽹络的，因为线性函数的叠加依然是线性函数，神经⽹络拟合的是⾮线性函数（⾮线性⼀般由⾮线性函数赋予)，所以过多的线性单元是⽆⽤的。

3.1 Experiments with the sigmoid

Sigmoid激活函数的平均值⾮零，⽽平均值与海森矩阵的奇异值相关（⼜是⼀篇很古⽼的论⽂了，没时间看了），这导致其训练相对较慢：

The sigmoid non-linearity has been already shown to slow down learning because of its non-zero mean that induces important singular values in the Hessian.

接着对图2做⼀个解释吧：

威尔逊主义

论⽂是这么解释activation values的：

activation values: output of the sigmoid

因为sigmoid是⼀个函数，其因变量随着⾃变量变化⽽变化，⽽⾃变量

。

论⽂⾥讲训练的时候会⼀直拿固定的300个样本的检测集去测试，

表⽰的就是测试样本传⾄此节点时的值，则activation values的值为

，这⾥莘县实验初中

为sigmoid函数。

⽽每⼀层是有⼀千个隐含节点的，每个节点都有⼀个对应的activation values，因为会有⼀个平均值和标准差，再加上300个样本，图2即意在阐明这个。

不过不理解的是图⽰⾥讲的top hidden layer指的啥层，按照描述是Layer 4，但top hidden layer应该是Layer 1呀，有疑问。现在倾向于就是Layer 4了，因为正⽂⾥有这么⼀句话：

We see that very quickly at the beginning, all the sigmoid activation values of the last hidden layer are pushed their lower saturation value of 0.

综上，the last hidden layer和top hidden layer指的都是Layer 4。

论⽂开始对图2进⾏分析，在很长的⼀段时间内，前三层激活值的平均值⼀直保持在0.5左右，Layer 4

则⼀直保持在0左右，即处于饱和区，并且当Layer 4开始跳出饱和区时，前三层开始饱和并稳定下来。

论⽂给出的解释是：随机初始化的时候，最后⼀层softmax层

我的兵之初，刚开始训练的时候会更依赖于偏置

，⽽不是top hidden layer（即Layer 4）的输出

，因此梯度更新的时候会使得

倾向于0，即使得

倾向于0：

The logistic layer output sofmax(b+Wh) might initially rely more on its biaes b (which are learned very quickly) than on the top hidden activations h derived from the input image (because h would vary in ways that are not predictive of y, maybe correlated mostly with other and possibly more dominant variations of x).

Thus the error gradient would tend to push Wh towards 0, which can be achieved by pushing h towards 0.

我理解的意思是w的梯度回传⾥还有个系数h，b对应的系数则是1，所以h如果⾮常⼩的话，w的梯度是⾮常⼩的，这导致其学习速度⽐b差很多。

本文发布于:2024-09-22 00:52:53，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/279173.html

上一篇：沈阳高三英语家教 2011辽宁高考英语完型填空实战演练吴军精品教案A-1

下一篇：南德TUV-ISO90012015外审检查表