首页 > 学术百科

Deep Neural Networks for Acoustic Modeling in Speech Recognition The Shared Views of Four Research

igital Object Identifier 10.1109/MSP.2012.2205597 D ate of publication: 15 October 2012M os t current s peech recognition s ys tems us e

hidden Markov models (HMMs ) to deal with

the temporal variability of s peech and

Gaus s ian mixture models (GMMs ) to deter-

mine how well each state of each HMM fits a

frame or a short window of frames of coefficients that repre-

sents the acoustic input. An alternative way to evaluate the fit

is to us e a feed-forward neural network that takes s everal

frames of coefficients as input and produces posterior proba-

bilities over HMM s tates as output. Deep neural networks (DNNs ) that have many hidden layers and are trained us ing new methods have been shown to outperform GMMs on a vari-ety of s peech recognition benchmarks , s ometimes by a large margin. This article provides an overview of this progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.INTRODUCTION New machine learning algorithms can lead to s ignificant advances in automatic s peech recognition (ASR). The bigges t

[Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, and Brian Kingsbury

fiypaper

]

[The shared views of four research groups ]

IN MODERN SPEECH RECOGNITION

s ingle advance occurred nearly

four decades ago with the intro-

duction of the expectation-maxi-

mization (EM) algorithm for

training HMMs (s ee [1] and [2]

for informative historical reviews

of the introduction of HMMs).

With the EM algorithm, it be -

came possible to develop speech

recognition s ys tems for real-

world tas ks us ing the richnes s of GMMs [3] to repres ent the relations hip between HMM s tates and the acous tic input. In these systems the acoustic input is typically represented by con-catenating Mel-frequency cepstral coefficients (MFCCs) or per-ceptual linear predictive coefficients (PLPs) [4] computed from the raw waveform and their firs t- and s econd-order temporal differences [5]. This nonadaptive but highly engineered prepro-cessing of the waveform is designed to discard the large amount of information in waveforms that is considered to be irrelevant for discrimination and to express the remaining information in a form that facilitates discrimination with GMM-HMMs.

GMMs have a number of advantages that make them suit-able for modeling the probability distributions over vectors of input features that are associated with each state of an HMM. With enou

gh components, they can model probability dis tri-butions to any required level of accuracy, and they are fairly easy to fit to data using the EM algorithm. A huge amount of research has gone into finding ways of constraining GMMs to increas e their evaluation s peed and to optimize the tradeoff between their flexibility and the amount of training data required to avoid serious overfitting [6].

The recognition accuracy of a GMM-HMM s ys tem can be further improved if it is discriminatively fine-tuned after it has been generatively trained to maximize its probability of gener-ating the observed data, especially if the discriminative objec-tive function us ed for training is clos ely related to the error rate on phones, words, or sentences [7]. The accuracy can also be improved by augmenting (or concatenating) the input fea-tures (e.g., MFCCs) with “tandem” or bottleneck features gen-erated using neural networks [8], [69]. GMMs are so successful that it is difficult for any new method to outperform them for acoustic modeling.

Des pite all their advantages, GMMs have a s erious s hort-coming—they are s tatis tically inefficient for modeling data that lie on or near a nonlinear manifold in the data space. For example, modeling the set of points that lie very close to the s urface of a s phere only requires a few parameters us ing an appropriate model class, but it requires a very large number of diagonal Gaussians or a fairly large number of full-covariance Gaus s ians. Speech is produced by modulating a relatively small nu

mber of parameters of a dynamical system [10], [11], and this implies that its true underlying s tructure is much lower-dimensional than is immediately apparent in a window that contains hundreds of coefficients. We believe, therefore, that other types of model may work better than GMMs for

acous tic modeling if they can

more effectively exploit informa-

化学反应工程原理tion embedded in a large win-

dow of frames.

Artificial neural networks

trained by backpropagating

error derivatives have the poten-

tial to learn much better models

星贝云链

of data that lie on or near a non-

linear manifold. In fact, two decades ago, researchers achieved some success using artificial neural networks with a single layer of nonlinear hidden units to predict HMM s tates from windows of acous tic coefficients [9]. At that time, however, neither the hardware nor the learn-ing algorithms were adequate for training neural networks with many hidden layers on large amounts of data, and the performance benefits of us ing neural networks with a s ingle hidden layer were not sufficiently large to seriously challenge GMMs. As a result, the main practical contribution of neural networks at that time was to provide extra features in tandem or bottleneck systems.

Over the last few years, advances in both machine learning algorithms and computer hardware have led to more efficient methods for training DNNs that contain many layers of non-linear hidden units and a very large output layer. The large output layer is required to accommodate the large number of HMM states that arise when each phone is modeled by a num-ber of different “triphone” HMMs that take into account the phones on either side. Even when many of the states of these triphone HMMs are tied together, there can be thous ands of tied states. Using the new learning methods, several different research groups have shown that DNNs can outperform GMMs at acous tic modeling for s peech recognition on a variety of data sets including large data sets with large vocabularies.

This review article aims to repres ent the s hared views of research groups at the University of Toron

to, Microsoft Research (MSR), Google, and IBM Research, who have all had recent suc-cesses in using DNNs for acoustic modeling. The article starts by describing the two-stage training procedure that is used for fit-ting the DNNs. In the first stage, layers of feature detectors are initialized, one layer at a time, by fitting a stack of generative models, each of which has one layer of latent variables. These generative models are trained without us ing any information about the HMM states that the acoustic model will need to dis-criminate. In the s econd s tage, each generative model in the stack is used to initialize one layer of hidden units in a DNN and the whole network is then discriminatively fine-tuned to predict the target HMM s tates. Thes e targets are obtained by us ing a baseline GMM-HMM system to produce a forced alignment.

In this article, we review exploratory experiments on the TIMIT database [12], [13] that were used to demonstrate the power of this two-stage training procedure for acoustic mod-eling. The DNNs that worked well on TIMIT were then applied to five different large-vocabulary continuous speech recogni-tion (LVCSR) tasks by three different research groups whose

DEEP NEURAL NETWORKS THAT HAVE MANY HIDDEN LAYERS AND ARE TRAINED USING NEW METHODS HAVE BEEN SHOWN TO OUTPERFORM

GMMs ON A VARIETY OF SPEECH

RECOGNITION BENCHMARKS, SOMETIMES BY A LARGE MARGIN.

res ults we als o s ummarize. The DNNs worked well on all of thes e tas ks when compared with highly tuned GMM-HMM systems, and on some of the tasks they outperformed the state of the art by a large margin. We als o des cribe s ome other us es of DNNs for acous tic modeling and s ome variations on the training procedure. TRAINING DEEP NEURAL NETWORKS

A DNN is a feed-forward, artificial

neural network that has more than one layer of hidden units

between its inputs and its outputs. Each hidden unit, j , typically

uses the logistic function (the closely related hyberbolic tangent

is also often used and any function with a well-behaved deriva-

tive can be used) to map its total input from the layer below,

x ,j to the scalar state, y j that it sends to the layer above.

(),y x e x b y w 11logistic ,j j x j j i ij i

j ==+=+-/ (1)where b j is the bias of unit j, i is an index over units in the

layer below, and w ij is the weight on a connection to unit j

from unit i in the layer below. For multiclas s clas s ification,

output unit j converts its total input, x j , into a class probabil-

ity, p j , by using the “softmax” nonlinearity

()

(),exp exp p x x j k k j =/ (2)where k is an index over all classes.

DNNs can be dis criminatively trained (DT) by backpropa-

gating derivatives of a cost function that measures the discrep-

ancy between the target outputs and the actual outputs

produced for each training case [14]. When using the softmax

output function, the natural cost function C is the cross entro-

py between the target probabilities d and the outputs of the

softmax, p

,log C d p j j j =-/ (3)

where the target probabilities, typically taking values of one or

zero, are the s upervis ed information provided to train the

DNN classifier.

For large training sets, it is typically more efficient to com-

pute the derivatives on a small, random “minibatch” of training

cases, rather than the whole training set, before updating the

weights in proportion to the gradient. This stochastic gradient

descent method can be further improved by using a “momen-

tum” coefficient, 0111a , that smooths the gradient comput-

ed for minibatch t , thereby damping oscillations across ravines

and speeding progress down ravines

()(1)().w t w t w t C ij ij ij a e D D 22=-- (4)The update rule for biases can be derived by treating them as weights on connections coming from units that always have a state of one.

To reduce overfitting, large weights can be penalized in propor-tion to their squared magnitude, or the learning can simply be termi-nated at the point at which perfor-mance on a held-out validation set s tarts getting wors e [9]. In DNNs with full connectivity between adja-cent layers, the initial weights are given small random values to prevent all of the hidden units in a layer from getting exactly the same gradient. DNNs with many hidden layers are hard to optimize. Gradient descent from a random starting point near the origin is not the best way to find a good set of weights, and unless the initial scales of the weights are carefully chosen [15], the back-propagated gradients will h

ave very different magnitudes in dif-

ferent layers. In addition to the optimization issues, DNNs may generalize poorly to held-out test data. DNNs with many hidden layers and many units per layer are very flexible models with a

very large number of parameters. This makes them capable of modeling very complex and highly nonlinear relations hips between inputs and outputs. This ability is important for high-quality acoustic modeling, but it also allows them to model spu-rious regularities that are an accidental property of the particular examples in the training set, which can lead to severe overfitting. Weight penalties or early stopping can reduce the overfitting but only by removing much of the modeling power.

Very large training sets [16] can reduce overfitting while pre-serving modeling power, but only by making training very com-putationally expens ive. What we need is a better method of using the information in the training set to build multiple lay-ers of nonlinear feature detectors. GENERATIVE PRETRAINING Instead of designing feature detectors to be good for discrimi-nating between classes, we can start by designing them to be

good at modeling the structure in the input data. The idea is to learn one layer of feature detectors at

a time with the states of

the feature detectors in one layer acting as the data for training the next layer. After this generative “pretraining,” the multiple layers of feature detectors can be used as a much better start-ing point for a discriminative “fine-tuning” phase during which backpropagation through the DNN slightly adjusts the weights found in pretraining [17]. Some of the high-level features cre-ated by the generative pretraining will be of little use for dis-crimination, but others will be far more us eful than the raw inputs. The generative pretraining finds a region of the weight-space that allows the discriminative fine-tuning to make rapid progress, and it also significantly reduces overfitting [18]. A single layer of feature detectors can be learned by fitting a

generative model with one layer of latent variables to the input

data. There are two broad classes of generative model to choose

OVER THE LAST FEW YEARS, ADVANCES IN BOTH MACHINE LEARNING ALGORITHMS AND COMPUTER HARDWARE HAVE LED TO MORE EFFICIENT METHODS FOR TRAINING DNNs.

from. A directed model generates data by firs t choos ing the

states of the latent variables from a prior distribution and then

choosing the states of the observable variables from their condi-

tional distributions given the latent states. Examples of directed models with one layer of latent variables are factor analysis, in

which the latent variables are drawn from an is otropic Gaussian, and GMMs, in which they are dra

wn from a discrete dis tribution. An undirected model has a very different way of generating data. Instead of using one set of parameters to define a prior distribution over the latent variables and a separate set

of parameters to define the condition-al distributions of the observable vari-ables given the values of the latent variables, an undirected model uses a single set of parameters, W , to define the joint probability of a vector of val-ues of the observable variables, v , and a vector of values of the latent vari-ables, h , via an energy function, E v h W (,;),,p Z e Z e 1v h W v h W v h (,;)(,;),E E ==--l l l l / (5)where Z is called the partition function.

If many different latent variables interact nonlinearly to generate each data vector, it is difficult to infer the states of the latent variables from the obs erved data in a directed model because of a phenomenon known as “explaining away” [19]. In undirected models , however, inference is eas y pro-vided the latent variables do not have edges linking them. Such a restricted class of undirected models is ideal for lay-

erwise pretraining because each layer will have an easy infer-ence procedure.

We s tart by des cribing an approximate learning algorithm for a restricted Boltzmann machine (RBM)

which consists of a layer of stochastic binary “visible” units that represent binary input data connected to a layer of stochastic binary hidden units that learn to model significant nonindependencies between the

vis ible units [20]. There are undirected connections between visible and hidden units but no visible-visible or hidden-hidden

connections. An RBM is a type of Markov random field (MRF) but differs from most MRFs in several ways: it has a bipartite connectivity graph, it does not usually share weights between different units , and a s ubs et of the variables are unobs erved, even during training. AN EFFICIENT LEARNING PROCEDURE FOR RBMs A joint configuration, (v , h ) of the visible and hidden units of an RBM has an energy given by v h ()E a v b h v h w ,i i i j j j i j ij visible hidden ,i j =---!!///, (6)

where ,v h i j are the binary states of visible unit i and hidden

unit j , ,a b i j are their bias es , and w ij is the weight between

them. The network assigns a probability to every possible pair of

a visible and a hidden vector via this energy function as in (5) and the probability that the network as

signs to a visible vector, v , is given by summing over all possible hidden vectors v ()p Z e 1h v,h ()E =-/. (7)The derivative of the log probability of a training s et with respect to a weight is surprisingly simple

v ()log N w p v h v h 1ij n n n N i j i j 1data model 212122=-==/, (8)where N is the s ize of the training s et and the angle brackets are used to denote expectations under the dis-tribution s pecified by the s ubs cript that follows. The

s imple derivative in (8)

leads to a very simple learn-ing rule for performing sto-

chastic steepest ascent in the log probability of the training data w v h v h data model ij i j i j 1212e D =-^h , (9)

where e is a learning rate.

The absence of direct connections between hidden units in an RBM makes it is very eas y to get an unbias ed s ample of v h i j data 12. Given a randomly s elected training cas e, v , the

binary state, h j , of each hidden unit, j , is set to one with prob-ability

活顶尖

v (1)()p h b v w logistic j j i ij i ;==+/ (10) and v h i j is then an unbiased sample. The absence of direct con-nections between visible units in an RBM makes it very easy to get an unbiased sample of the state of a visible unit, given a hid-den vector

()().h p v a h w 1logistic i i j ij j ;==+/ (11)

Getting an unbias ed s ample of v h i j model 12, however, is much more difficult. It can be done by starting at any random state of the visible units and performing alternating Gibbs sam-pling for a very long time. Alternating Gibbs sampling consists of updating all of the hidden units in parallel us ing (10) fol-

lowed by updating all of the visible units in parallel using (11).

A much fas ter learning procedure called contras tive diver-

gence (CD) was proposed in [20]. This starts by setting the states of the visible units to a training vector. Then the binary states of

the hidden units are all computed in parallel using (10). Once

binary states have been chosen for the hidden units, a “recon-struction” is produced by setting each v i to one with a probabil-

ity given by (11). Finally, the s tates of the hidden units are updated again. The change in a weight is then given by ()w v h v h ij i j i j data recon 1212e D =-. (12)

WHAT WE NEED IS A BETTER METHOD OF USING THE INFORMATION IN THE TRAINING SET TO BUILD MULTIPLE LAYERS OF NONLINEAR FEATURE DETECTORS.

A s implified vers ion of the s ame learning rule that us es the

states of individual units instead of pairwise products is used for

the biases.

CD works well even though it is only crudely approximating

关于加强商业性房地产信贷管理的通知the gradient of the log probability

of the training data [20]. RBMs learn better generative models if more s teps of alternating Gibbs sampling are used before collecting the statistics for the second term in the learning rule, but for the pur-pos es of pretraining feature detec-tors , more alternations are

generally of little value and all the

results reviewed here were obtained using CD 1 which does a sin-

gle full s tep of alternating Gibbs s ampling after the initial

update of the hidden units. To suppress noise in the learning,

the real-valued probabilities rather than binary samples are gen-

erally used for the reconstructions and the subsequent states of

the hidden units, but it is important to use sampled binary val-ues for the first computation of the hidden states because the

sampling noise acts as a very effective regularizer that prevents overfitting [21]. MODELING REAL-VALUED DATA Real-valued data, such as MFCCs, are more naturally modeled by linear variables with Gaus s ian nois e and the RBM energy function can be modified to accommodate such variables, giving a Gaussian–Bernoulli RBM (GRBM) v h (,)()E v a b h v h w 2i i i i j j i i j ij 22vis ,j i j hid v v =---!!///, (13)where i v is the standard deviation of the Gaussian noise for vis-

ible unit i .

The two conditional distributions required for CD 1 learning

are

v ()p h b v w logistic j j i i ij i ;v =+c m / (14) h (),N p a h w v i i j ij j i i 2;v v =+c m /, (15)where (,)N 2n v is a Gaus s ian. Learning the s tandard devia-tions of a GRBM is problematic for reasons described in [21], so for pretraining using CD 1, the data are normalized so that each coefficient has zero mean and unit variance, the standard devia-

tions are set to one when computing ()v h p ;, and no noise is

added to the reconstructions. This avoids the issue of deciding

the right noise level.

STACKING RBMs TO MAKE A DEEP BELIEF NETWORK

After training an RBM on the data, the inferred states of the hid-

den units can be us ed as data for training another RBM that

learns to model the significant dependencies between the hid-

den units of the first RBM. This can be repeated as many times

as desired to produce many layers of nonlinear feature detectors

that represent progressively more complex statistical structure in the data. The RBMs in a stack can be combined in a surpris-ing way to produce [22] a single, multilayer generative model called a deep belief net (DBN) (not to be confus ed with a dynamic Bayesian net, which is a type of directed model of temporal data that unfortu-nately has the same acronym). Even though each RBM is an undirected model, the DBN formed by the whole stack is a hybrid generative model whose top two layers are undi-rected (they are the final RBM in the s tack) but whos e lower layers have top-down, directed connections (see Figure 1). To unders tand how RBMs are compos ed into a DBN, it is helpful to rewrite (7) and to make explicit the dependence on W : v W h W h W (;)(;)(;),v p p p h ;=/ (16)

where h W (;)p is defined as in (7) but with the roles of the visi-ble and hidden units reversed. Now it is clear that the model can

be improved by holding v h W (;)p ; fixed after training the RBM,

but replacing the prior over hidden vectors h W (;)p by a better prior, i.e., a prior that is closer to the aggregated posterior over hidden vectors that can be sampled by first picking a training case and the

n inferring a hidden vector using (14). This aggre-gated pos terior is exactly what the next RBM in the s tack is

trained to model. As shown in [22], there is a series of variational bounds on the log probability of the training data, and furthermore, each

time a new RBM is added to the stack, the variational bound on the new and deeper DBN is better than the previous variational bound, provided the new RBM is initialized and learned in the right way. While the existence of a bound that keeps improving is mathematically reassuring, it does not answer the practical issue, addressed in this article, of whether the learned feature detectors are us eful for dis crimination on a tas k that is unknown while training the DBN. Nor does it guarantee that anything improves when we us e efficient s hort-cuts s uch as

苌家拳CD 1 training of the RBMs.

One very nice property of a DBN that distinguishes it from other multilayer, directed, nonlinear generative models is that it

is possible to infer the states of the layers of hidden units in a s ingle forward pas s. This inference,

which is us ed in deriving the variational bound, is not exactly correct but is fairly accu-rate. So after learning a DBN by training a stack of RBMs, we can jettison the whole probabilistic framework and simply use

the generative weights in the reverse direction as a way of ini-tializing all the feature detecting layers of a deterministic feed-forward DNN. We then just add a final softmax layer and train the whole DNN discriminatively. Unfortunately, a DNN that is pretrained generatively as a DBN is often still called a DBN in

the literature. For clarity, we call it a DBN-DNN. ONE VERY NICE PROPERTY OF A DBN THAT DISTINGUISHES IT FROM OTHER MULTILAYER, DIRECTED, NONLINEAR GENERATIVE MODELS IS THAT IT IS POSSIBLE TO INFER THE STATES OF THE LAYERS OF HIDDEN UNITS IN A SINGLE FORWARD PASS.

本文发布于:2024-09-22 01:37:19，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/279122.html

上一篇：英语测试题

下一篇：教学设计Module 6 Crouching tiger hidden dragon

标签：商业性信贷管理工程原理通知加强

留言与评论（共有 0 条评论）