首页 > 学术百科

P11-1153Structural Topic Model for Latent Topical Structure Analysis

Structural Topic Model for Latent Topical Structure Analysis Hongning Wang,Duo Zhang,ChengXiang Zhai

Department of Computer Science

University of Illinois at Urbana-Champaign

Urbana IL,61801USA

{wang296,dzhang22,czhai}@cs.uiuc.edu

Abstract

Topic models have been successfully applied

to many document analysis tasks to discover

topics embedded in text.However,existing

topic models generally cannot capture the la-

tent topical structures in documents.Since

能人于四languages are intrinsically cohesive and coher-

ent,modeling and discovering latent topical

transition structures within documents would

be beneﬁcial for many text analysis tasks.

In this work,we propose a new topic model,

Structural Topic Model,which simultaneously

discovers topics and reveals the latent topi-

cal structures in text through explicitly model-

ing topical transitions with a latentﬁrst-order

Markov chain.Experiment results show that

the proposed Structural Topic Model can ef-

fectively discover topical structures in text,

and the identiﬁed structures signiﬁcantly im-

prove the performance of tasks such as sen-

tence annotation and sentence ordering.

1Introduction

A great amount of effort has recently been made in applying statistical topic models(Hofmann,1999; Blei et al.,2003)to explore word co-occurrence pics,embedded in documents.Topic models have become important building blocks of many interesting applications(,(Blei and Jordan,2003;Blei and Lafferty,2007;Mei et al., 2007;Lu and Zhai,2008)).

In general,topic models can discover word clus-tering patterns in documents and project each doc-ument to a latent topic space formed by such word clusters.However,the topical structure in a ,the internal dependency between the top-ics,is generally not captured due to the exchange-ability assumption(Blei et al.,2003),i.e.,the doc-ument generation probabilities are invariant to con-tent permutation.In reality,natural language text rarely consists of isolated,unrelated

sentences,but rather collocated,structured and coherent groups of sentences(Hovy,1993).Ignoring such latent topi-cal structures inside the documents means wasting valuable clues about topics and thus would lead to non-optimal topic modeling.

Taking apartment rental advertisements as an ex-ample,when people write advertisements for their apartments,it’s natural toﬁrst introduce“size”and “address”of the apartment,and then“rent”and “contact”.Few people would talk about“restric-tion”ﬁrst.If this kind of topical structures are cap-tured by a topic model,it would not only improve the topic mining results,but,more importantly,also help many other document analysis tasks,such as sentence annotation and sentence ordering. Nevertheless,very few existing topic models at-tempted to model such structural dependency among topics.The Aspect HMM model introduced in (Blei and Moreno,2001)combines pLSA(Hof-mann,1999)with HMM(Rabiner,1989)to perform document segmentation over text streams.However, Aspect HMM separately estimates the topics in the training set and depends on heuristics to infer the transitional relations between topics.The Hidden Topic Markov Model(HTMM)proposed by(Gru-ber et al.,2007)extends the traditional topic models by assuming words in each sentence share the same topic assignment,and topics transit between adja-cent sentences.However,the transitional structures among ,how likely one topic would fol-low another topic,are not captured in this model.

In this paper,we propose a new topic model, named Structural Topic Model(strTM)to model and analyze both latent topics and topical structures in text documents.To do so,strTM assumes:1)words in a document are either drawn from a content topic or a ,background)topic;2)words in the same sentence share the same content topic;and 3)content topics in the adjacent sentences follow a topic transition that satisﬁes theﬁrst order Markov property.Theﬁrst assumption distinguishes the se-mantics of the occurrence of each word in the doc-ument,the second requirement conﬁnes the unreal-istic“bag-of-word”assumption into a tighter unit, and the third assumption exploits the connection be-tween adjacent sentences.

To evaluate the usefulness of the identiﬁed top-ical structures by strTM,we applied strTM to the tasks of sentence annotation and sentence ordering, where correctly modeling the document structure is crucial.On the corpus of8,031apartment ad-vertisements from craiglist(Grenager et al.,2005) and1,991movie reviews from IMDB(Zhuang et al.,2006),strTM achieved encouraging improve-ment in both tasks compared with the baseline meth-ods that don’t explicitly model the topical structure. The results conﬁrm the necessity of modeling the latent topical structures inside documents,and also demonstrate the advantages of the proposed strTM over existing topic models.

2Related Work

Topic models have been successfully applied to many problems, e.g.,sentiment analysis(Mei et al.,2007),document summarization(Lu and Zhai, 2008)and image annotation(Blei and Jordan,2003). However,in most existing work,the dependency among the topics is loosely governed by the prior topic ,Dirichlet distribution. Some work has attempted to capture the interre-lationship among the latent topics.Correlated Topic Model(Blei and Lafferty,2007)replaces Dirichlet prior with logistic Normal prior for topic distribu-tion in each document in order to capture the cor-relation between the topics.HMM-LDA(Grifﬁths et al.,2005)distinguishes the short-range syntactic dependencies from long-range semantic dependen-cies among the words in each document.But in HMM-LDA,only the latent variables for the syn-tactic classes are treated as a locally dependent se-quence,while latent topics are treated the same as in other topic models.Chen et al.introduced the gen-eralized Mallows model to constrain the latent topic assignments(Chen et al.,2009).In their model, they assume there exists a canonical order among the topics in the collection of related documents and the same topics are forced not to appear in discon-nected portions of the topic sequence in one docu-ment(sampling without replacement).Our method relaxes this assumption by only postulating transi-tional dependency between topics in the adjacent sentences(sampling with replacement)and thus po-tentially allows a topic to appear multiple times in disconnected segments.As discussed in the pre-vious section,HTMM(Gruber et al.,2007)is the most similar model

to ours.HTMM models the document structure by assuming words in the same sentence share the same topic assignment and suc-cessive sentences are more likely to share the same topic.However,HTMM only loosely models the transition between topics as a binary relation:the same as the previous sentence’s assignment or draw a new one with a certain probability.This simpli-ﬁed coarse modeling of dependency could not fully capture the complex structure across different docu-ments.In contrast,our strTM model explicitly cap-tures the regular topic transitions by postulating the ﬁrst order Markov property over the topics. Another line of related work is discourse analysis in natural language processing:discourse segmen-tation(Sun et al.,2007;Galley et al.,2003)splits a document into a linear sequence of multi-paragraph passages,where lexical cohesion is used to link to-gether the textual units;discourse parsing(Soricut and Marcu,2003;Marcu,1998)tries to uncover a more sophisticated hierarchical coherence structure from text to represent the entire discourse.One work in this line that shares a similar goal as ours is the content models(Barzilay and Lee,2004),where an HMM is deﬁned over text spans to perform infor-mation ordering and extractive summarization.A deﬁciency of the content models is that the identi-ﬁcation of clusters of text spans is done separately from transition modeling.Our strTM addresses this deﬁciency by deﬁning a generative process to simul-taneously capture the topics and the transitional re-

lationship among topics:allowing topic modeling

and transition modeling to reinforce each other in a

principled framework.

3Structural Topic Model

In this section,we formally deﬁne the Structural

Topic Model(strTM)and discuss how it captures the

latent topics and topical structures within the docu-

ments simultaneously.From the theory of linguistic

analysis(Kamp,1981),we know that document ex-

hibits internal structures,where structural segments

encapsulate semantic units that are closely related.

In strTM,we treat a sentence as the basic structure

unit,and assume all the words in a sentence share the

same topical aspect.Besides,two adjacent segments

are assumed to be highly related(capturing cohesion

in text);speciﬁcally,in strTM we pose a strong tran-

孙惠芬

sitional dependency assumption among the topics:

the choice of topic for each sentence directly de-

pends on the previous sentence’s topic assignment,

<,ﬁrst order Markov property.Moveover,tak-

ing the insights from HMM-LDA that not all the

words are content conveying(some of them may

just be a result of syntactic requirement),we intro-

一般等价物duce a dummy functional topic z B for every sen-

tence in the document.We use this functional topic

to capture the document-independent word distribu-

,corpus background(Zhai et al.,2004).As

a result,in strTM,every sentence is treated as a mix-

ture of content and functional topics.

Formally,we assume a corpus consists of D doc-

uments with a vocabulary of size V,and there are

k content topics embedded in the corpus.In a given

document d,there are m sentences and each sentence

i has N i words.We assume the topic transition prob-汽车助燃剂

ability p(z|z′)is drawn from a Multinomial distribu-

tion Mul(αz′),and the word emission probability un-der each topic p(w|z)is drawn from a Multinomial

distribution Mul(βz).

To get a uniﬁed description of the generation

process,we add another dummy topic T-START in

strTM,which is the initial topic with position“-1”

for every document but does not emit any words.

In addition,since our functional topic is assumed to

occur in all the sentences,we don’t need to model

its transition with other content topics.We use a

Binomial variableπto control the proportion be-tween content and functional topics in each sen-tence.Therefore,there are k+1topic transitions,one for T-START and others for k content topics;and k emission probabilities for the content topics,with an additional one for the functional topic z B(in total k+1emission probability distributions). Conditioned on the model parametersΘ= (α,β,π),the generative process of a document in strTM can be described as follows:

1.For each sentence s i in document d:

(a)Draw topic z i from Multinomial distribu-

tion conditioned on the previous sentence

s i−1’s topic assignment z i−1:

z i∼Mul(αz i−1)

(b)Draw each word w ij in sentence s i from

the mixture of content topic z i and func-

tional topic z B:

w ij∼πp(w ij|β,z i)+(1−π)p(w ij|β,z B) The joint probability of sentences and topics in one document deﬁned by strTM is thus given by:

p(S0,S1,...,S m,z|α,β,π)=

∏

i=1

p(z i|α,z i−1)p(S i|z i)

(1) where the topic to sentence emission probability is deﬁned as:

p(S i|z i)=

N i

∏

j=0

[

πp(w ij|β,z i)+(1−π)p(w ij|β,z B)

]

(2) This process is graphically illustrated in Figure1.

K+1

Figure1:Graphical Representation of strTM. From the deﬁnition of strTM,we can see that the document structure is characterized by a document-speciﬁc topic chain,and forcing the words in one

sentence to share the same content topic ensures se-mantic cohesion of the mined topics.Although we do not directly model the topic mixture for each doc-ument as the traditional topic models do,the word co-occurrence patterns within the same document are captured by topic propagation through the transi-tions.This can be easily understood when we write down the posterior probability of the topic assign-ment for a particular sentence:

p(z i|S0,S1,...,S m,Θ)

=p(S0,S1,...,S m|z i,Θ)p(z i) p(S0,S1,...,S m)

∝p(S0,S1,...,S i,z i)×p(S i+1,S i+2,...,S m|z i) =

∑

z i−1

p(S0,...,S i−1,z i−1)p(z i|z i−1)p(S i|z i)

∑

z i+1

p(S i+1,...,S m|z i+1)p(z i+1|z i)(3)

Theﬁrst part of Eq(3)describes the recursive in-ﬂuence on the choice of topic for the i th sentence from its preceding sentences,while the second part captures how the succeeding sentences affect the current topic assignment.Intuitively,when we need to decide a sentence’s topic,we will look“back-

ward”and“forward”over all the sentences in the document to determine a“suitable”one.In addition, because of theﬁrst order Markov property,the local topical dependency gets more ,they are interacting directly through the transition proba-bilities p(z i|z i−1)and p(z i+1|z i).And such interac-tion on sentences farther away would get damped by the multiplication of such probabilities.This result is reasonable,especially in a long document,since neighboring sentences are more likely to cover sim-ilar topics than two sentences far apart.

4Posterior Inference and Parameter Estimation

The chain structure in strTM enables us to perform exact inference:posterior distribution can be ef-ﬁciently calculated by the forward-backward algo-rithm,the optimal topic sequence can be inferred using the Viterbi algorithm,and parameter estima-tion can be solved by the Expectation Maximization (EM)algorithm.More technical details can be found in(Rabiner,1989).In this section,we only discuss strTM-speciﬁc procedures.

In the E-Step of EM algorithm,we need to col-lect the expected count of a sequential topic pair (z,z′)and a topic-word pair(z,w)to update the model parametersαandβin the M-Step.In strTM, E[c(z,z′)]can be easily calculated by forward-backward algorithm.But we have to go one step further t

碘甘油o fetch the required sufﬁcient statistics for E[c(z,w)],because our emission probabilities are deﬁned over sentences.

Through forward-backward algorithm,we can get the posterior probability p(s i,z|d,Θ).In strTM, words in one sentence are independently drawn from either a speciﬁc content topic z or functional topic z B according to the mixture weightπ.Therefore, we can accumulate the expected count of(z,w)over all the sentences by:

E[c(z,w)]=

∑

d,s∈d

πp(w|z)p(s,z|d,Θ)c(w,s)

πp(w|z)+(1−π)p(w|z B)

(4)

where c(w,s)indicates the frequency of word w in sentence s.

Eq(4)can be easily explained as follows.Since we already observe topic z and sentence s co-occur with probability p(s,z|d,Θ),each word w in s should share the same probability of be-ing observed with content topic z.Thus the ex-pected count of c(z,w)in this sentence would be p(s,z|d,Θ)c(w,s).However,since each sentence is also associated with the functional topic z B,the word w may also be drawn from z B.By applying the Bayes’rule,we can properly reallocate the ex-pected count of c(z,w)by Eq(4).The same strategy can be applied to obtain E[c(z B,w)].

As discussed in(Johnson,2007),to avoid the problem that EM algorithm tends to assign a uni-form word/state distribution to each hidden state, which deviates from the heavily skewed word/state distributions empirically observed,we can apply a Bayesian estimation approach for strTM.Thus we introduce prior distributions over the topic transi-tion Mul(αz′)and emission probabilities Mul(βz), and use the Variational Bayesian(VB)(Jordan et al., 1999)estimator to obtain a model with more skewed word/state distributions.

Since both the topic transition and emission prob-abilities are Multinomial distributions in strTM, the conjugate Dirichlet distribution is the natural

choice for imposing a prior on them(Diaconis and Ylvisaker,1979).Thus,we further assume:

αz∼Dir(η)(5)

βz∼Dir(γ)(6) where we use exchangeable Dirichlet distributions to control the sparsity ofαz andβz.Asηandγap-proach zero,the prior strongly favors the models in which each hidden state emits as few words/states as possible.In our experiments,we empirically tuned ηandγon different training corpus to optimize log-likelihood.

The resulting VB estimation only requires a mi-nor modiﬁcation to the M-Step in the original EM algorithm:

¯αz=Φ(E[c(z′,z)]+η)

Φ(E[c(z)]+kη)

(7)

¯β

z=Φ(E[c(w,z)]+γ)

Φ(E[c(z)]+Vγ)

(8)

whereΦ(x)is the exponential of theﬁrst derivative of the log-gamma function.

The optimal setting ofπfor the proportion of con-tent topics in the documents is empirically tuned by cross-validation over the training corpus to maxi-mize the log-likelihood.

5Experimental Results

In this section,we demonstrate the effectiveness of strTM in identifying latent topical structures from documents,and quantitatively evaluate how the mined topic transitions can help the tasks of sen-tence annotation and sentence ordering.

5.1Data Set

We used two different data sets for evaluation:apart-ment advertisements(Ads)from(Grenager et al., 2005)and movie reviews(Review)from(Zhuang et al.,2006).

The Ads data consists of8,767advertisements for apartment rentals crawled from Craigslist website. 302of them have been labeled with11ﬁelds,in-cluding size,feature,address,etc.,on the sentence level.

The review data contains2,000movie reviews discussing11different movies from IMDB.These reviews are manually labeled with12movie feature labels(We didn’t use the additional opinion anno-tations in this data set.),e.g.,VP(vision effects), MS(music and sound effects),etc.,also on the sen-tences,but the annotations in the review data set is much sparser than that in the Ads data set(see in Ta-ble1).The sentence-level annotations make it pos-sible to quantitatively evaluate the discovered topic structures.

We performed simple preprocessing on these two data sets:1)removed a standard list of stop words,terms occurring in less than2documents;

2)discarded the documents with less than2sen-tences;3)aggregated sentence-level annotations into document-level labels(binary vector)for each document.Table1gives a brief summary on these two data sets after the processing.

Ads Review

Document Size8,0311,991

V ocabulary Size21,99314,507

中国同学录5460Avg Stn/Doc8.013.9

Avg Labeled Stn/Doc7.1* 5.1

Avg Token/Stn14.120.0

*Only in302labeled ads

Table1:Summary of evaluation data set

5.2Topic Transition Modeling

First,we qualitatively demonstrate the topical struc-ture identiﬁed by strTM from Ads data1.We trained strTM with11content topics in Ads data set,used word distribution under each class(estimated by maximum likelihood estimator on document-level labels)as priors to initialize the emission probabil-ity Mul(βz)in Eq(6),and treated document-level la-bels as the prior for transition from T-START in each document,so that the mined topics can be aligned with the predeﬁned class labels.Figure2shows the identiﬁed topics and the transitions among them.To get a clearer view,we discarded the transitions be-low a threshold of0.1and removed all the isolated nodes.

From Figure2,we canﬁnd some interesting top-ical structures.For example,people usually start with“size”,“features”and“address”,and end with“contact”information when they post an apart-1Due to the page limit,we only show the result in Ads data set.

本文发布于:2024-09-20 20:26:30，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/452383.html

上一篇：the model contains multiple top-level

下一篇：能源方法汇总

标签：汽车同学录助燃剂中国

留言与评论（共有 0 条评论）