P11-1153Structural Topic Model for Latent Topical Structure Analysis

Structural Topic Model for Latent Topical Structure Analysis Hongning Wang,Duo Zhang,ChengXiang Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana IL,61801USA
{wang296,dzhang22,czhai}@cs.uiuc.edu
Abstract
Topic models have been successfully applied
to many document analysis tasks to discover
topics embedded in text.However,existing
topic models generally cannot capture the la-
tent topical structures in documents.Since
能人于四languages are intrinsically cohesive and coher-
ent,modeling and discovering latent topical
transition structures within documents would
be beneficial for many text analysis tasks.
In this work,we propose a new topic model,
Structural Topic Model,which simultaneously
discovers topics and reveals the latent topi-
cal structures in text through explicitly model-
ing topical transitions with a latentfirst-order
Markov chain.Experiment results show that
the proposed Structural Topic Model can ef-
fectively discover topical structures in text,
and the identified structures significantly im-
prove the performance of tasks such as sen-
tence annotation and sentence ordering.
1Introduction
A great amount of effort has recently been made in applying statistical topic models(Hofmann,1999; Blei et al.,2003)to explore word co-occurrence pics,embedded in documents.Topic models have become important building blocks of many interesting applications(,(Blei and Jordan,2003;Blei and Lafferty,2007;Mei et al., 2007;Lu and Zhai,2008)).
In general,topic models can discover word clus-tering patterns in documents and project each doc-ument to a latent topic space formed by such word clusters.However,the topical structure in a ,the internal dependency between the top-ics,is generally not captured due to the exchange-ability assumption(Blei et al.,2003),i.e.,the doc-ument generation probabilities are invariant to con-tent permutation.In reality,natural language text rarely consists of isolated,unrelated
sentences,but rather collocated,structured and coherent groups of sentences(Hovy,1993).Ignoring such latent topi-cal structures inside the documents means wasting valuable clues about topics and thus would lead to non-optimal topic modeling.
Taking apartment rental advertisements as an ex-ample,when people write advertisements for their apartments,it’s natural tofirst introduce“size”and “address”of the apartment,and then“rent”and “contact”.Few people would talk about“restric-tion”first.If this kind of topical structures are cap-tured by a topic model,it would not only improve the topic mining results,but,more importantly,also help many other document analysis tasks,such as sentence annotation and sentence ordering. Nevertheless,very few existing topic models at-tempted to model such structural dependency among topics.The Aspect HMM model introduced in (Blei and Moreno,2001)combines pLSA(Hof-mann,1999)with HMM(Rabiner,1989)to perform document segmentation over text streams.However, Aspect HMM separately estimates the topics in the training set and depends on heuristics to infer the transitional relations between topics.The Hidden Topic Markov Model(HTMM)proposed by(Gru-ber et al.,2007)extends the traditional topic models by assuming words in each sentence share the same topic assignment,and topics transit between adja-cent sentences.However,the transitional structures among ,how likely one topic would fol-low another topic,are not captured in this model.
In this paper,we propose a new topic model, named Structural Topic Model(strTM)to model and analyze both latent topics and topical structures in text documents.To do so,strTM assumes:1)words in a document are either drawn from a content topic or a ,background)topic;2)words in the same sentence share the same content topic;and 3)content topics in the adjacent sentences follow a topic transition that satisfies thefirst order Markov property.Thefirst assumption distinguishes the se-mantics of the occurrence of each word in the doc-ument,the second requirement confines the unreal-istic“bag-of-word”assumption into a tighter unit, and the third assumption exploits the connection be-tween adjacent sentences.
To evaluate the usefulness of the identified top-ical structures by strTM,we applied strTM to the tasks of sentence annotation and sentence ordering, where correctly modeling the document structure is crucial.On the corpus of8,031apartment ad-vertisements from craiglist(Grenager et al.,2005) and1,991movie reviews from IMDB(Zhuang et al.,2006),strTM achieved encouraging improve-ment in both tasks compared with the baseline meth-ods that don’t explicitly model the topical structure. The results confirm the necessity of modeling the latent topical structures inside documents,and also demonstrate the advantages of the proposed strTM over existing topic models.
2Related Work
Topic models have been successfully applied to many problems,  e.g.,sentiment analysis(Mei et al.,2007),document summarization(Lu and Zhai, 2008)and image annotation(Blei and Jordan,2003). However,in most existing work,the dependency among the topics is loosely governed by the prior topic ,Dirichlet distribution. Some work has attempted to capture the interre-lationship among the latent topics.Correlated Topic Model(Blei and Lafferty,2007)replaces Dirichlet prior with logistic Normal prior for topic distribu-tion in each document in order to capture the cor-relation between the topics.HMM-LDA(Griffiths et al.,2005)distinguishes the short-range syntactic dependencies from long-range semantic dependen-cies among the words in each document.But in HMM-LDA,only the latent variables for the syn-tactic classes are treated as a locally dependent se-quence,while latent topics are treated the same as in other topic models.Chen et al.introduced the gen-eralized Mallows model to constrain the latent topic assignments(Chen et al.,2009).In their model, they assume there exists a canonical order among the topics in the collection of related documents and the same topics are forced not to appear in discon-nected portions of the topic sequence in one docu-ment(sampling without replacement).Our method relaxes this assumption by only postulating transi-tional dependency between topics in the adjacent sentences(sampling with replacement)and thus po-tentially allows a topic to appear multiple times in disconnected segments.As discussed in the pre-vious section,HTMM(Gruber et al.,2007)is the most similar model
to ours.HTMM models the document structure by assuming words in the same sentence share the same topic assignment and suc-cessive sentences are more likely to share the same topic.However,HTMM only loosely models the transition between topics as a binary relation:the same as the previous sentence’s assignment or draw a new one with a certain probability.This simpli-fied coarse modeling of dependency could not fully capture the complex structure across different docu-ments.In contrast,our strTM model explicitly cap-tures the regular topic transitions by postulating the first order Markov property over the topics. Another line of related work is discourse analysis in natural language processing:discourse segmen-tation(Sun et al.,2007;Galley et al.,2003)splits a document into a linear sequence of multi-paragraph passages,where lexical cohesion is used to link to-gether the textual units;discourse parsing(Soricut and Marcu,2003;Marcu,1998)tries to uncover a more sophisticated hierarchical coherence structure from text to represent the entire discourse.One work in this line that shares a similar goal as ours is the content models(Barzilay and Lee,2004),where an HMM is defined over text spans to perform infor-mation ordering and extractive summarization.A deficiency of the content models is that the identi-fication of clusters of text spans is done separately from transition modeling.Our strTM addresses this deficiency by defining a generative process to simul-taneously capture the topics and the transitional re-
lationship among topics:allowing topic modeling
and transition modeling to reinforce each other in a
principled framework.
3Structural Topic Model
In this section,we formally define the Structural
Topic Model(strTM)and discuss how it captures the
latent topics and topical structures within the docu-
ments simultaneously.From the theory of linguistic
analysis(Kamp,1981),we know that document ex-
hibits internal structures,where structural segments
encapsulate semantic units that are closely related.
In strTM,we treat a sentence as the basic structure
unit,and assume all the words in a sentence share the
same topical aspect.Besides,two adjacent segments
are assumed to be highly related(capturing cohesion
in text);specifically,in strTM we pose a strong tran-
孙惠芬
sitional dependency assumption among the topics:
the choice of topic for each sentence directly de-
pends on the previous sentence’s topic assignment,
<,first order Markov property.Moveover,tak-
ing the insights from HMM-LDA that not all the
words are content conveying(some of them may
just be a result of syntactic requirement),we intro-
一般等价物duce a dummy functional topic z B for every sen-
tence in the document.We use this functional topic
to capture the document-independent word distribu-
,corpus background(Zhai et al.,2004).As
a result,in strTM,every sentence is treated as a mix-
ture of content and functional topics.
Formally,we assume a corpus consists of D doc-
uments with a vocabulary of size V,and there are
k content topics embedded in the corpus.In a given
document d,there are m sentences and each sentence
i has N i words.We assume the topic transition prob-汽车助燃剂
ability p(z|z′)is drawn from a Multinomial distribu-
tion Mul(αz′),and the word emission probability un-der each topic p(w|z)is drawn from a Multinomial
distribution Mul(βz).
To get a unified description of the generation
process,we add another dummy topic T-START in
strTM,which is the initial topic with position“-1”
for every document but does not emit any words.
In addition,since our functional topic is assumed to
occur in all the sentences,we don’t need to model
its transition with other content topics.We use a
Binomial variableπto control the proportion be-tween content and functional topics in each sen-tence.Therefore,there are k+1topic transitions,one for T-START and others for k content topics;and k emission probabilities for the content topics,with an additional one for the functional topic z B(in total k+1emission probability distributions). Conditioned on the model parametersΘ= (α,β,π),the generative process of a document in strTM can be described as follows:
1.For each sentence s i in document d:
(a)Draw topic z i from Multinomial distribu-
tion conditioned on the previous sentence
s i−1’s topic assignment z i−1:
z i∼Mul(αz i−1)
(b)Draw each word w ij in sentence s i from
the mixture of content topic z i and func-
tional topic z B:
w ij∼πp(w ij|β,z i)+(1−π)p(w ij|β,z B) The joint probability of sentences and topics in one document defined by strTM is thus given by:
p(S0,S1,...,S m,z|α,β,π)=
m
i=1
p(z i|α,z i−1)p(S i|z i)
(1) where the topic to sentence emission probability is defined as:
p(S i|z i)=
N i
j=0
[
πp(w ij|β,z i)+(1−π)p(w ij|β,z B)
]
(2) This process is graphically illustrated in Figure1.
K+1
Figure1:Graphical Representation of strTM. From the definition of strTM,we can see that the document structure is characterized by a document-specific topic chain,and forcing the words in one
sentence to share the same content topic ensures se-mantic cohesion of the mined topics.Although we do not directly model the topic mixture for each doc-ument as the traditional topic models do,the word co-occurrence patterns within the same document are captured by topic propagation through the transi-tions.This can be easily understood when we write down the posterior probability of the topic assign-ment for a particular sentence:
p(z i|S0,S1,...,S m,Θ)
=p(S0,S1,...,S m|z i,Θ)p(z i) p(S0,S1,...,S m)
∝p(S0,S1,...,S i,z i)×p(S i+1,S i+2,...,S m|z i) =
z i−1
p(S0,...,S i−1,z i−1)p(z i|z i−1)p(S i|z i)
×
z i+1
p(S i+1,...,S m|z i+1)p(z i+1|z i)(3)
Thefirst part of Eq(3)describes the recursive in-fluence on the choice of topic for the i th sentence from its preceding sentences,while the second part captures how the succeeding sentences affect the current topic assignment.Intuitively,when we need to decide a sentence’s topic,we will look“back-
ward”and“forward”over all the sentences in the document to determine a“suitable”one.In addition, because of thefirst order Markov property,the local topical dependency gets more ,they are interacting directly through the transition proba-bilities p(z i|z i−1)and p(z i+1|z i).And such interac-tion on sentences farther away would get damped by the multiplication of such probabilities.This result is reasonable,especially in a long document,since neighboring sentences are more likely to cover sim-ilar topics than two sentences far apart.
4Posterior Inference and Parameter Estimation
The chain structure in strTM enables us to perform exact inference:posterior distribution can be ef-ficiently calculated by the forward-backward algo-rithm,the optimal topic sequence can be inferred using the Viterbi algorithm,and parameter estima-tion can be solved by the Expectation Maximization (EM)algorithm.More technical details can be found in(Rabiner,1989).In this section,we only discuss strTM-specific procedures.
In the E-Step of EM algorithm,we need to col-lect the expected count of a sequential topic pair (z,z′)and a topic-word pair(z,w)to update the model parametersαandβin the M-Step.In strTM, E[c(z,z′)]can be easily calculated by forward-backward algorithm.But we have to go one step further t
碘甘油o fetch the required sufficient statistics for E[c(z,w)],because our emission probabilities are defined over sentences.
Through forward-backward algorithm,we can get the posterior probability p(s i,z|d,Θ).In strTM, words in one sentence are independently drawn from either a specific content topic z or functional topic z B according to the mixture weightπ.Therefore, we can accumulate the expected count of(z,w)over all the sentences by:
E[c(z,w)]=
d,s∈d
πp(w|z)p(s,z|d,Θ)c(w,s)
πp(w|z)+(1−π)p(w|z B)
(4)
where c(w,s)indicates the frequency of word w in sentence s.
Eq(4)can be easily explained as follows.Since we already observe topic z and sentence s co-occur with probability p(s,z|d,Θ),each word w in s should share the same probability of be-ing observed with content topic z.Thus the ex-pected count of c(z,w)in this sentence would be p(s,z|d,Θ)c(w,s).However,since each sentence is also associated with the functional topic z B,the word w may also be drawn from z B.By applying the Bayes’rule,we can properly reallocate the ex-pected count of c(z,w)by Eq(4).The same strategy can be applied to obtain E[c(z B,w)].
As discussed in(Johnson,2007),to avoid the problem that EM algorithm tends to assign a uni-form word/state distribution to each hidden state, which deviates from the heavily skewed word/state distributions empirically observed,we can apply a Bayesian estimation approach for strTM.Thus we introduce prior distributions over the topic transi-tion Mul(αz′)and emission probabilities Mul(βz), and use the Variational Bayesian(VB)(Jordan et al., 1999)estimator to obtain a model with more skewed word/state distributions.
Since both the topic transition and emission prob-abilities are Multinomial distributions in strTM, the conjugate Dirichlet distribution is the natural
choice for imposing a prior on them(Diaconis and Ylvisaker,1979).Thus,we further assume:
αz∼Dir(η)(5)
βz∼Dir(γ)(6) where we use exchangeable Dirichlet distributions to control the sparsity ofαz andβz.Asηandγap-proach zero,the prior strongly favors the models in which each hidden state emits as few words/states as possible.In our experiments,we empirically tuned ηandγon different training corpus to optimize log-likelihood.
The resulting VB estimation only requires a mi-nor modification to the M-Step in the original EM algorithm:
¯αz=Φ(E[c(z′,z)]+η)
Φ(E[c(z)]+kη)
(7)
¯β
z=Φ(E[c(w,z)]+γ)
Φ(E[c(z)]+Vγ)
(8)
whereΦ(x)is the exponential of thefirst derivative of the log-gamma function.
The optimal setting ofπfor the proportion of con-tent topics in the documents is empirically tuned by cross-validation over the training corpus to maxi-mize the log-likelihood.
5Experimental Results
In this section,we demonstrate the effectiveness of strTM in identifying latent topical structures from documents,and quantitatively evaluate how the mined topic transitions can help the tasks of sen-tence annotation and sentence ordering.
5.1Data Set
We used two different data sets for evaluation:apart-ment advertisements(Ads)from(Grenager et al., 2005)and movie reviews(Review)from(Zhuang et al.,2006).
The Ads data consists of8,767advertisements for apartment rentals crawled from Craigslist website. 302of them have been labeled with11fields,in-cluding size,feature,address,etc.,on the sentence level.
The review data contains2,000movie reviews discussing11different movies from IMDB.These reviews are manually labeled with12movie feature labels(We didn’t use the additional opinion anno-tations in this data set.),e.g.,VP(vision effects), MS(music and sound effects),etc.,also on the sen-tences,but the annotations in the review data set is much sparser than that in the Ads data set(see in Ta-ble1).The sentence-level annotations make it pos-sible to quantitatively evaluate the discovered topic structures.
We performed simple preprocessing on these two data sets:1)removed a standard list of stop words,terms occurring in less than2documents;
2)discarded the documents with less than2sen-tences;3)aggregated sentence-level annotations into document-level labels(binary vector)for each document.Table1gives a brief summary on these two data sets after the processing.
Ads Review
Document Size8,0311,991
V ocabulary Size21,99314,507
中国同学录5460Avg Stn/Doc8.013.9
Avg Labeled Stn/Doc7.1*  5.1
Avg Token/Stn14.120.0
*Only in302labeled ads
Table1:Summary of evaluation data set
5.2Topic Transition Modeling
First,we qualitatively demonstrate the topical struc-ture identified by strTM from Ads data1.We trained strTM with11content topics in Ads data set,used word distribution under each class(estimated by maximum likelihood estimator on document-level labels)as priors to initialize the emission probabil-ity Mul(βz)in Eq(6),and treated document-level la-bels as the prior for transition from T-START in each document,so that the mined topics can be aligned with the predefined class labels.Figure2shows the identified topics and the transitions among them.To get a clearer view,we discarded the transitions be-low a threshold of0.1and removed all the isolated nodes.
From Figure2,we canfind some interesting top-ical structures.For example,people usually start with“size”,“features”and“address”,and end with“contact”information when they post an apart-1Due to the page limit,we only show the result in Ads data set.

本文发布于:2024-09-20 20:26:30,感谢您对本站的认可!

本文链接:https://www.17tex.com/xueshu/452383.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:汽车   同学录   助燃剂   中国
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议