Deep Neural Networks for Acoustic Modeling in Speech Recognition The Shared Views of Four Research

D
igital Object Identifier 10.1109/MSP.2012.2205597 D ate of publication: 15 October 2012M os t current s peech recognition s ys tems  us e
hidden Markov models  (HMMs ) to deal with
the temporal variability of s peech and
Gaus s ian mixture models  (GMMs ) to deter-
mine how well each state of each HMM fits a
frame or a short window of frames of coefficients that repre-
sents the acoustic input. An alternative way to evaluate the fit
is  to us e a feed-forward neural network that takes  s everal
frames of coefficients as input and produces posterior proba-
bilities  over HMM s tates  as  output. Deep neural networks  (DNNs ) that have many hidden layers  and are trained us ing new methods have been shown to outperform GMMs on a vari-ety of s peech recognition benchmarks , s ometimes  by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.INTRODUCTION New machine learning algorithms  can lead to s ignificant advances  in automatic s peech recognition (ASR). The bigges t
[Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury
fiypaper
]
[The shared views of four research groups ]
© I S T O C K P H O T O .C O M /S U C H O A  L E R T A D I P A T FUNDAMENTAL TECHNOLOGIES
IN MODERN SPEECH RECOGNITION
s ingle advance occurred nearly
four decades ago with the intro-
duction of the expectation-maxi-
mization (EM) algorithm for
training HMMs (s ee [1] and [2]
for informative historical reviews
of the introduction of HMMs).
With the EM algorithm, it be -
came possible to develop speech
recognition s ys tems for real-
world tas ks us ing the richnes s of GMMs [3] to repres ent the relations hip between HMM s tates and the acous tic input. In these systems the acoustic input is typically represented by con-catenating Mel-frequency cepstral coefficients (MFCCs) or per-ceptual linear predictive coefficients (PLPs) [4] computed from the raw waveform and their firs t- and s econd-order temporal differences [5]. This nonadaptive but highly engineered prepro-cessing of the waveform is designed to discard the large amount of information in waveforms that is considered to be irrelevant for discrimination and to express the remaining information in a form that facilitates discrimination with GMM-HMMs.
GMMs have a number of advantages that make them suit-able for modeling the probability distributions over vectors of input features that are associated with each state of an HMM. With enou
gh components, they can model probability dis tri-butions to any required level of accuracy, and they are fairly easy to fit to data using the EM algorithm. A huge amount of research has gone into finding ways of constraining GMMs to increas e their evaluation s peed and to optimize the tradeoff between their flexibility and the amount of training data required to avoid serious overfitting [6].
The recognition accuracy of a GMM-HMM s ys tem can be further improved if it is discriminatively fine-tuned after it has been generatively trained to maximize its probability of gener-ating the observed data, especially if the discriminative objec-tive function us ed for training is clos ely related to the error rate on phones, words, or sentences [7]. The accuracy can also be improved by augmenting (or concatenating) the input fea-tures (e.g., MFCCs) with “tandem” or bottleneck features gen-erated using neural networks [8], [69]. GMMs are so successful that it is difficult for any new method to outperform them for acoustic modeling.
Des pite all their advantages, GMMs have a s erious s hort-coming—they are s tatis tically inefficient for modeling data that lie on or near a nonlinear manifold in the data space. For example, modeling the set of points that lie very close to the s urface of a s phere only requires a few parameters us ing an appropriate model class, but it requires a very large number of diagonal Gaussians or a fairly large number of full-covariance Gaus s ians. Speech is produced by modulating a relatively small nu
mber of parameters of a dynamical system [10], [11], and this implies that its true underlying s tructure is much lower-dimensional than is immediately apparent in a window that contains hundreds of coefficients. We believe, therefore, that other types of model may work better than GMMs for
acous tic modeling if they can
more effectively exploit informa-
化学反应工程原理tion embedded in a large win-
dow of frames.
Artificial neural networks
trained by backpropagating
error derivatives have the poten-
tial to learn much better models
星贝云链
of data that lie on or near a non-
linear manifold. In fact, two decades ago, researchers achieved some success using artificial neural networks with a single layer of nonlinear hidden units to predict HMM s tates from windows of acous tic coefficients [9]. At that time, however, neither the hardware nor the learn-ing algorithms were adequate for training neural networks with many hidden layers on large amounts of data, and the performance benefits of us ing neural networks with a s ingle hidden layer were not sufficiently large to seriously challenge GMMs. As a result, the main practical contribution of neural networks at that time was to provide extra features in tandem or bottleneck systems.
Over the last few years, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training DNNs that contain many layers of non-linear hidden units and a very large output layer. The large output layer is required to accommodate the large number of HMM states that arise when each phone is modeled by a num-ber of different “triphone” HMMs that take into account the phones on either side. Even when many of the states of these triphone HMMs are tied together, there can be thous ands of tied states. Using the new learning methods, several different research groups have shown that DNNs can outperform GMMs at acous tic modeling for s peech recognition on a variety of data sets including large data sets with large vocabularies.
This review article aims to repres ent the s hared views of research groups at the University of Toron
to, Microsoft Research (MSR), Google, and IBM Research, who have all had recent suc-cesses in using DNNs for acoustic modeling. The article starts by describing the two-stage training procedure that is used for fit-ting the DNNs. In the first stage, layers of feature detectors are initialized, one layer at a time, by fitting a stack of generative models, each of which has one layer of latent variables. These generative models are trained without us ing any information about the HMM states that the acoustic model will need to dis-criminate. In the s econd s tage, each generative model in the stack is used to initialize one layer of hidden units in a DNN and the whole network is then discriminatively fine-tuned to predict the target HMM s tates. Thes e targets are obtained by us ing a baseline GMM-HMM system to produce a forced alignment.
In this article, we review exploratory experiments on the TIMIT database [12], [13] that were used to demonstrate the power of this two-stage training procedure for acoustic mod-eling. The DNNs that worked well on TIMIT were then applied to five different large-vocabulary continuous speech recogni-tion (LVCSR) tasks by three different research groups whose
DEEP NEURAL NETWORKS THAT HAVE MANY HIDDEN LAYERS AND ARE TRAINED USING NEW METHODS HAVE BEEN SHOWN TO OUTPERFORM
GMMs ON A VARIETY OF SPEECH
RECOGNITION BENCHMARKS, SOMETIMES BY A LARGE MARGIN.
res ults  we als o s ummarize. The DNNs  worked well on all of thes e tas ks  when compared with highly tuned GMM-HMM systems, and on some of the tasks they outperformed the state of the art by a large margin. We als o des cribe s ome other us es  of DNNs  for acous tic modeling and s ome variations  on the training procedure. TRAINING DEEP NEURAL NETWORKS
A DNN is a feed-forward, artificial
neural network that has more than one layer of hidden units
between its inputs and its outputs. Each hidden unit, j , typically
uses the logistic function (the closely related hyberbolic tangent
is also often used and any function with a well-behaved deriva-
tive can be used)  to map its total input from the layer below,
x ,j  to the scalar state, y j  that it sends to the layer above.
(),y x e x b y w 11logistic ,j j x j j i ij i
j ==+=+-/ (1)where b j  is the bias of unit j, i  is an index over units in the
layer below, and w ij  is  the weight on a connection to unit j
from unit i  in the layer below. For multiclas s  clas s ification,
output unit j  converts its total input, x j , into a class probabil-
ity, p j , by using the “softmax” nonlinearity
()
(),exp exp p x x j k k j =/ (2)where k  is an index over all classes.
DNNs  can be dis criminatively trained (DT) by backpropa-
gating derivatives of a cost function that measures the discrep-
ancy between the target outputs  and the actual outputs
produced for each training case [14]. When using the softmax
output function, the natural cost function C  is the cross entro-
py between the target probabilities  d  and the outputs  of the
softmax, p
,log C d p j j j =-/ (3)
where the target probabilities, typically taking values of one or
zero, are the s upervis ed information provided to train the
DNN classifier.
For large training sets, it is typically more efficient to com-
pute the derivatives on a small, random “minibatch” of training
cases, rather than the whole training set, before updating the
weights in proportion to the gradient. This stochastic gradient
descent method can be further improved by using a “momen-
tum” coefficient, 0111a , that smooths the gradient comput-
ed for minibatch t , thereby damping oscillations across ravines
and speeding progress down ravines
()(1)().w t w t w t C ij ij ij a e D D 22=--  (4)The update rule for biases can be derived by treating them as weights on connections coming from units that always have a state of one.
To reduce overfitting, large weights can be penalized in propor-tion to their squared magnitude, or the learning can simply be termi-nated at the point at which perfor-mance on a held-out validation set s tarts  getting wors e [9]. In DNNs  with full connectivity between adja-cent layers, the initial weights are given small random values to prevent all of the hidden units in a layer from getting exactly the same gradient. DNNs  with many hidden layers  are hard to optimize. Gradient descent from a random starting point near the origin is not the best way to find a good set of weights, and unless the initial scales of the weights are carefully chosen [15], the back-propagated gradients will h
ave very different magnitudes in dif-
ferent layers. In addition to the optimization issues, DNNs may generalize poorly to held-out test data. DNNs with many hidden layers and many units per layer are very flexible models with a
very large number of parameters. This makes them capable of modeling very complex and highly nonlinear relations hips  between inputs and outputs. This ability is important for high-quality acoustic modeling, but it also allows them to model spu-rious  regularities  that are an accidental property of the particular examples in the training set, which can lead to severe overfitting. Weight penalties or early stopping can reduce the overfitting but only by removing much of the modeling power.
Very large training sets [16] can reduce overfitting while pre-serving modeling power, but only by making training very com-putationally expens ive. What we need is  a better method of using the information in the training set to build multiple lay-ers of nonlinear feature detectors. GENERATIVE PRETRAINING Instead of designing feature detectors to be good for discrimi-nating between classes, we can start by designing them to be
good at modeling the structure in the input data. The idea is to learn one layer of feature detectors at
a time with the states of
the feature detectors in one layer acting as the data for training the next layer. After this generative “pretraining,” the multiple layers of feature detectors can be used as a much better start-ing point for a discriminative “fine-tuning” phase during which backpropagation through the DNN slightly adjusts the weights found in pretraining [17]. Some of the high-level features cre-ated by the generative pretraining will be of little use for dis-crimination, but others  will be far more us eful than the raw inputs. The generative pretraining finds a region of the weight-space that allows the discriminative fine-tuning to make rapid progress, and it also significantly reduces overfitting [18]. A single layer of feature detectors can be learned by fitting a
generative model with one layer of latent variables to the input
data. There are two broad classes of generative model to choose
OVER THE LAST FEW YEARS, ADVANCES IN BOTH MACHINE LEARNING ALGORITHMS AND COMPUTER HARDWARE HAVE LED TO MORE EFFICIENT METHODS FOR TRAINING DNNs.
from. A directed model generates  data by firs t choos ing the
states of the latent variables from a prior distribution and then
choosing the states of the observable variables from their condi-
tional distributions given the latent states. Examples of directed models with one layer of latent variables are factor analysis, in
which the latent variables  are drawn from an is otropic Gaussian, and GMMs, in which they are dra
wn from a discrete dis tribution. An undirected model has  a very different way of generating data. Instead of using one set of parameters to define a prior distribution over the latent variables and a separate set
of parameters to define the condition-al distributions of the observable vari-ables  given the values  of the latent variables, an undirected model uses a single set of parameters, W , to define the joint probability of a vector of val-ues of the observable variables, v , and a vector of values  of the latent vari-ables, h , via an energy function, E  v h W (,;),,p Z e Z e 1v h W v h W v h (,;)(,;),E E ==--l l l l / (5)where Z  is called the partition function.
If many different latent variables  interact nonlinearly to generate each data vector, it is difficult to infer the states of the latent variables  from the obs erved data in a directed model because of a phenomenon known as “explaining away” [19]. In undirected models , however, inference is  eas y pro-vided the latent variables  do not have edges  linking them. Such a restricted class of undirected models is ideal for lay-
erwise pretraining because each layer will have an easy infer-ence procedure.
We s tart by des cribing an approximate learning algorithm for a restricted Boltzmann machine (RBM)
which consists of a layer of stochastic binary “visible” units that represent binary input data connected to a layer of stochastic binary hidden units that learn to model significant nonindependencies between the
vis ible units  [20]. There are undirected connections  between visible and hidden units but no visible-visible or hidden-hidden
connections. An RBM is a type of Markov random field (MRF) but differs from most MRFs in several ways: it has a bipartite connectivity graph, it does not usually share weights between different units , and a s ubs et of the variables  are unobs erved, even during training. AN EFFICIENT LEARNING PROCEDURE FOR RBMs A joint configuration, (v , h ) of the visible and hidden units of an RBM has an energy given by  v h ()E a v b h v h w ,i i i j j j i j ij visible hidden ,i j =---!!///, (6)
where ,v h i j  are the binary states of visible unit i  and hidden
unit j , ,a b i j  are their bias es , and w ij  is  the weight between
them. The network assigns a probability to every possible pair of
a visible and a hidden vector via this energy function as in (5) and the probability that the network as
signs to a visible vector, v , is given by summing over all possible hidden vectors  v ()p Z e 1h v,h ()E =-/. (7)The derivative of the log probability of a training s et with respect to a weight is surprisingly simple
v ()log N w p v h v h 1ij n n n N i j i j 1data model 212122=-==/, (8)where N  is  the s ize of the training s et and the angle brackets are used to denote expectations under the dis-tribution s pecified by the s ubs cript that follows. The
s imple derivative in (8)
leads to a very simple learn-ing rule for performing sto-
chastic steepest ascent in the log probability of the training data  w v h v h data model ij i j i j 1212e D =-^h , (9)
where e  is a learning rate.
The absence of direct connections between hidden units in an RBM makes  it is  very eas y to get an unbias ed s ample of v h i j data 12. Given a randomly s elected training cas e, v , the
binary state, h j , of each hidden unit, j , is set to one with prob-ability
活顶尖
v (1)()p h b v w logistic j j i ij i ;==+/ (10) and v h i j  is then an unbiased sample. The absence of direct con-nections between visible units in an RBM makes it very easy to get an unbiased sample of the state of a visible unit, given a hid-den vector
()().h p v a h w 1logistic i i j ij j ;==+/ (11)
Getting an unbias ed s ample of v h i j model 12, however, is  much more difficult. It can be done by starting at any random state of the visible units and performing alternating Gibbs sam-pling for a very long time. Alternating Gibbs sampling consists of updating all of the hidden units  in parallel us ing (10) fol-
lowed by updating all of the visible units in parallel using (11).
A much fas ter learning procedure called contras tive diver-
gence (CD) was proposed in [20]. This starts by setting the states of the visible units to a training vector. Then the binary states of
the hidden units are all computed in parallel using (10). Once
binary states have been chosen for the hidden units, a “recon-struction” is produced by setting each v i  to one with a probabil-
ity given by (11). Finally, the s tates  of the hidden units  are updated again. The change in a weight is then given by  ()w v h v h ij i j i j data recon 1212e D =-. (12)
WHAT WE NEED IS A BETTER METHOD OF USING THE INFORMATION IN THE TRAINING SET TO BUILD MULTIPLE LAYERS OF NONLINEAR FEATURE DETECTORS.
A s implified vers ion of the s ame learning rule that us es  the
states of individual units instead of pairwise products is used for
the biases.
CD works well even though it is only crudely approximating
关于加强商业性房地产信贷管理通知the gradient of the log probability
of the training data [20]. RBMs  learn better generative models  if more s teps  of alternating Gibbs  sampling are used before collecting the statistics for the second term in the learning rule, but for the pur-pos es  of pretraining feature detec-tors , more alternations  are
generally of little value and all the
results reviewed here were obtained using CD 1 which does a sin-
gle full s tep of alternating Gibbs  s ampling after the initial
update of the hidden units. To suppress noise in the learning,
the real-valued probabilities rather than binary samples are gen-
erally used for the reconstructions and the subsequent states of
the hidden units, but it is important to use sampled binary val-ues for the first computation of the hidden states because the
sampling noise acts as a very effective regularizer that prevents overfitting [21]. MODELING REAL-VALUED DATA Real-valued data, such as MFCCs, are more naturally modeled by linear variables  with Gaus s ian nois e and the RBM energy function can be modified to accommodate such variables, giving a Gaussian–Bernoulli RBM (GRBM)    v h (,)()E v a b h v h w 2i i i i j j i i j ij 22vis ,j i j hid v v =---!!///, (13)where i v  is the standard deviation of the Gaussian noise for vis-
ible unit i .
The two conditional distributions required for CD 1 learning
are
v ()p h b v w logistic j j i i ij i ;v =+c m / (14) h (),N p a h w v i i j ij j i i 2;v v =+c m /, (15)where (,)N 2n v  is  a Gaus s ian. Learning the s tandard devia-tions of a GRBM is problematic for reasons described in [21], so for pretraining using CD 1, the data are normalized so that each coefficient has zero mean and unit variance, the standard devia-
tions are set to one when computing ()v h p ;, and no noise is
added to the reconstructions. This avoids the issue of deciding
the right noise level.
STACKING RBMs TO MAKE A DEEP BELIEF NETWORK
After training an RBM on the data, the inferred states of the hid-
den units  can be us ed as  data for training another RBM that
learns to model the significant dependencies between the hid-
den units of the first RBM. This can be repeated as many times
as desired to produce many layers of nonlinear feature detectors
that represent progressively more complex statistical structure in the data. The RBMs in a stack can be combined in a surpris-ing way to produce [22] a single, multilayer generative model called a deep belief net (DBN) (not to be confus ed with a dynamic Bayesian net, which is a type of directed model of temporal data that unfortu-nately has the same acronym). Even though each RBM is  an undirected model, the DBN  formed by the whole stack is a hybrid generative model whose top two layers are undi-rected (they are the final RBM in the s tack) but whos e lower layers  have top-down, directed connections (see Figure 1). To unders tand how RBMs  are compos ed into a DBN, it is  helpful to rewrite (7) and to make explicit the dependence on W :  v W h W h W (;)(;)(;),v p p p h ;=/ (16)
where h W (;)p  is defined as in (7) but with the roles of the visi-ble and hidden units reversed. Now it is clear that the model can
be improved by holding v h W (;)p ; fixed after training the RBM,
but replacing the prior over hidden vectors h W (;)p  by a better prior, i.e., a prior that is closer to the aggregated posterior over hidden vectors that can be sampled by first picking a training case and the
n inferring a hidden vector using (14). This aggre-gated pos terior is  exactly what the next RBM in the s tack is
trained to model. As shown in [22], there is a series of variational bounds on the log probability of the training data, and furthermore, each
time a new RBM is added to the stack, the variational bound on the new and deeper DBN is better than the previous variational bound, provided the new RBM is initialized and learned in the right way. While the existence of a bound that keeps improving is mathematically reassuring, it does not answer the practical issue, addressed in this article, of whether the learned feature detectors  are us eful for dis crimination on a tas k that is  unknown while training the DBN. Nor does it guarantee that anything improves  when we us e efficient s hort-cuts  s uch as
苌家拳CD 1 training of the RBMs.
One very nice property of a DBN that distinguishes it from other multilayer, directed, nonlinear generative models is that it
is possible to infer the states of the layers of hidden units in a s ingle forward pas s. This  inference,
which is  us ed in deriving the variational bound, is not exactly correct but is fairly accu-rate. So after learning a DBN by training a stack of RBMs, we can jettison the whole probabilistic framework and simply use
the generative weights in the reverse direction as a way of ini-tializing all the feature detecting layers of a deterministic feed-forward DNN. We then just add a final softmax layer and train the whole DNN discriminatively. Unfortunately, a DNN that is  pretrained generatively as a DBN is often still called a DBN in
the literature. For clarity, we call it a DBN-DNN. ONE VERY NICE PROPERTY OF A DBN THAT DISTINGUISHES IT FROM OTHER MULTILAYER, DIRECTED, NONLINEAR GENERATIVE MODELS IS THAT IT IS POSSIBLE TO INFER THE STATES OF THE LAYERS OF HIDDEN UNITS IN A SINGLE FORWARD PASS.

本文发布于:2024-09-22 01:37:19,感谢您对本站的认可!

本文链接:https://www.17tex.com/xueshu/279122.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:商业性   信贷管理   工程   原理   通知   加强
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议