首页 > 学术百科

创新实训（10）-提取式文本摘要之bert聚类

创新实训（10）-提取式⽂本摘要之bert聚类

创新实训(10)-提取式⽂本摘要之bert聚类

1. 思路

使⽤bert作为预训练模型，利⽤bert⽣成的词向量进⾏下游任务的处理，在这篇论⽂中使⽤的是k-means计算词向量分布的重⼼作为⽂本摘要的候选句⼦。可以看作是聚类的⼀种形式。

2.代码分析

基于Pytorch的Transformers框架，使⽤预训练的Bert模型或者是其他的预训练模型⽣成词向量，然后使⽤k-means或者expectation-maximization算法进⾏聚类。

2.1 简单使⽤

⾸先先来测试⼀下readme⾥给的例⼦：

from summarizer import Summarizer

body ='Text body that you want to summarize with BERT'

body2 ='Something else you want to summarize with BERT'

model = Summarizer()

model(body)儿童图画故事

model(body2)

将⽂本换成长⽂本测试，效果还可以。

测试⽂本：

The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price.

The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal.

Mubadala, an Abu Dhabi investment fund, purchased 90% of the building for $800 million in 2008.

Real estate firm Tishman Speyer had owned the other 10%.

The buyer is RFR Holding, a New York real estate company.

Officials with Tishman and RFR did not immediately respond to a request for comments.十字军东征的影响

It’s unclear when the deal will close.

The building sold fairly quickly after being publicly placed on the market only two months ago.

The sale was handled by CBRE Group.

The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.

The rent is rising from $7.75 million last year to $32.5 million this year to $41 million in 2028.

Meantime, rents in the building itself are not rising nearly that fast.

While the building is an iconic landmark in the New York skyline, it is competing against newer office towers with large floor-to-ceiling windows and all the modern amenities.

Still the building is among the best known in the city, even to people who have never been to New Yo

rk.

It is famous for its triangle-shaped, vaulted windows worked into the stylized crown, along with its distinctive eagle gargoyles near the top.

It has been featured prominently in many films, including Men in Black 3, Spider-Man, Armageddon, Two Weeks Notice and Independence Day.

The previous sale took place just before the 2008 financial meltdown led to a plunge in real estate prices.

Still there have been a number of high profile skyscrapers purchased for top dollar in recent years, including the Waldorf Astoria hotel, which Chinese firm Anbang Insurance purchased in 2016 for nearly $2 billion, and the Willis Tower in Chicago, which was formerly known as Sears Tower, once the world’s tallest.

Blackstone Group (BX) bought it for $1.3 billion 2015.

潘神的迷宫好看吗

The Chrysler Building was the headquarters of the American automaker until 1953, but it was named for and owned by Chrysler chief Walter Chrysler, not the company itself.

Walter Chrysler had set out to build the tallest building in the world, a competition at that time with another Manhattan skyscraper under construction at 40 Wall Street at the south end of Manhattan. He kept secret the plans for the spire that would grace the top of the building, building it inside the structure and out of view of the public until 40 Wall Street was complete.

Once the competitor could rise no higher, the spire of the Chrysler building was raised into view, giving it the title.

结果：

The Chrysler Building, the famous art deco New York skyscraper, will be sold for a small fraction of its previous sales price. The deal, first reported by The Real Deal, was for $150 million, according to a source familiar with the deal. The building sold fairly quickly after being publicly placed on the market only two months ago. The incentive to sell the building at such a huge loss was due to the soaring rent the owners pay to Cooper Union, a New York college, for the land under the building.

2.2 分析

下⾯分析代码实现：

2.2.1 Summarizer类

⾸先是Summarizer类，

class Summarizer(SingleModel):

def__init__(

self,

model:str='bert-large-uncased',

custom_model: PreTrainedModel =None,

custom_tokenizer: PreTrainedTokenizer =None,

hidden:int=-2,

reduce_option:str='mean',

中国扬州寄语市长sentence_handler: SentenceHandler = SentenceHandler(),

random_state:int=12345

"""

This is the main Bert Summarizer class.

:param model: This parameter is associated with the inherit string parameters from the transformers library.

:param custom_model: If you have a pre-trained model, you can add the model class here.

:param custom_tokenizer: If you have a custom tokenizer, you can add the tokenizer here.

:param hidden: This signifies which layer of the BERT model you would like to use as embeddings.

:param reduce_option: Given the output of the bert model, this param determines how you want to reduce results. :param greedyness: associated with the neuralcoref library. Determines how greedy coref should be.

param language: Which language to use for training.

:param random_state: The random state to reproduce summarizations.

"""

super(Summarizer, self).__init__(

model, custom_model, custom_tokenizer, hidden, reduce_option, sentence_handler, random_state )

继承了SingleModel类：

2.2.2 SingleModel类

class SingleModel(ModelProcessor):

"""

Deprecated for naming sake.

"""

def__init__(

self,

model='bert-large-uncased',

custom_model: PreTrainedModel =None,

custom_tokenizer: PreTrainedTokenizer =None,

hidden:int=-2,

reduce_option:str='mean',

sentence_handler: SentenceHandler = SentenceHandler(),

random_state:int=12345

super(SingleModel, self).__init__(

model=model, custom_model=custom_model, custom_tokenizer=custom_tokenizer,

hidden=hidden, reduce_option=reduce_option,

sentence_handler=sentence_handler, random_state=random_state

)

def run_clusters(self, content: List[str], ratio=0.2, algorithm='kmeans', use_first:bool=True)-> List[str]:

hidden = del(content, self.hidden, duce_option)

hidden_args = ClusterFeatures(hidden, algorithm, random_state=self.random_state).cluster(ratio)

if use_first:

if hidden_args[0]!=0:

hidden_args.insert(0,0)

return[content[j]for j in hidden_args]

SingleModel类继承了MultiProcessor类，实现了run_clusters⽅法，run_clusters⽅法调⽤了ClusterFeatures类2.2.3 ClusterFeature类：

class ClusterFeatures(object):

"""

Basic handling of clustering features.

"""

def__init__(

self,

features: ndarray,

algorithm:str='kmeans',

pca_k:int=None,

random_state:int=12345

"""

:param features: the embedding matrix created by bert parent

:param algorithm: Which clustering algorithm to use

:param pca_k: If you want the features to be ran through pca, this is the components number :param random_state: Random state

"""

if pca_k:

self.features = PCA(n_components=pca_k).fit_transform(features)

else:

self.features = features

self.algorithm = algorithm

self.pca_k = pca_k

self.random_state = random_state

def__get_model(self, k:int):

"""

Retrieve clustering model

:param k: amount of clusters

:return: Clustering model

住宅部品"""

if self.algorithm =='gmm':

return GaussianMixture(n_components=k, random_state=self.random_state)

return KMeans(n_clusters=k, random_state=self.random_state)

def__get_centroids(self, model):

"""

Retrieve centroids of model金城戒备

:param model: Clustering model

:return: Centroids

"""

if self.algorithm =='gmm':

ans_

return model.cluster_centers_

def__find_closest_args(self, centroids: np.ndarray):

"""

Find the closest arguments to centroid

:param centroids: Centroids to find closest

:return: Closest arguments

"""

centroid_min =1e10

cur_arg =-1

args ={}

used_idx =[]

for j, centroid in enumerate(centroids):

for i, feature in enumerate(self.features):

value = (feature - centroid)

if value < centroid_min and i not in used_idx:

cur_arg = i

centroid_min = value

used_idx.append(cur_arg)

args[j]= cur_arg

centroid_min =1e10

cur_arg =-1

return args

def cluster(self, ratio:float=0.1)-> List[int]:

"""

Clusters sentences based on the ratio

:param ratio: Ratio to use for clustering

:return: Sentences index that qualify for summary

"""

k =1if ratio *len(self.features)<1else int(len(self.features)* ratio)

model = self.__get_model(k).fit(self.features)

centroids = self.__get_centroids(model)

cluster_args = self.__find_closest_args(centroids)

sorted_values =sorted(cluster_args.values())

return sorted_values

def__call__(self, ratio:float=0.1)-> List[int]:

return self.cluster(ratio)

主要的逻辑在cluster()⽅法中，使⽤PCA进⾏特征提取，然后使⽤k-means或gmm进⾏聚类，然后根据聚类的结果进⾏排序。

3.尝试进⾏中⽂改造

看起来这个模型效果不错，我感觉主要还是Bert的功劳。既然它的效果不错，那么能不能应⽤于中⽂呢？

我在github的issue中到了和我有相同想法的⼈：

emmm，作者说可以是使⽤⽀持中⽂的Bert模型和Tokenizer替换即可，于是我去了Bert的中⽂模型bert-base-chinese。结果根本没有输出了。

之后经过debug，发现是分词使⽤的是英⽂，改为使⽤中⽂jieba分词之后就好了。

下⾯是测试结果：

测试⽂本：

新华社⽇内⽡6⽉30⽇电 6⽉30⽇，联合国⼈权理事会第44次会议在⽇内⽡举⾏。在当天的会议上，古巴代表53个国家作共同发⾔，⽀持中国⾹港特区维护国家安全⽴法。

古巴表⽰，不⼲涉主权国家内部事务是《联合国宪章》重要原则和国际关系基本准则。国家安全⽴法属于国家⽴法权⼒，这对世界上任何国家都是如此。这不是⼈权问题，不应在⼈权理事会讨论。

古巴强调，我们认为各国都有权通过⽴法维护国家安全，赞赏基于该⽬的采取的举措。我们欢迎中国⽴法机关通过《中华⼈民共和国⾹港特别⾏政区维护国家安全法》，并重申坚持“⼀国两制”⽅针。我们认为，这⼀举措有利于“⼀国两制”⾏稳致远，有利于⾹港长期繁荣稳定，⾹港⼴⼤居民的合法权利和⾃由也可在安全环境下得到更好⾏使。

古巴表⽰，我们重申，⾹港特别⾏政区是中国不可分割的⼀部分，⾹港事务是中国内政，外界不应⼲涉。我们敦促有关⽅⾯停⽌利⽤涉港问题⼲涉中国内政。

结果：

我们欢迎中国⽴法机关通过《中华⼈民共和国⾹港特别⾏政区维护国家安全法》，并重申坚持“⼀国两制”⽅针。

效果还可以。

issue截图

本文发布于:2024-09-21 03:25:02，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/279176.html

上一篇：魔鬼经济学(英文版)

下一篇：故事新编优秀作文10篇