[深度学习]PyTorch实现双向LSTM情感分析

[深度学习]PyTorch实现双向LSTM情感分析
⼀前⾔
情感分析(Sentiment Analysis),也称为情感分类,属于⾃然语⾔处理(Natural Language Processing,NLP)领域的⼀个分⽀任务,随着互联⽹的发展⽽兴起。多数情况下该任务分析⼀个⽂本所呈现的信息是正⾯、负⾯或者中性,也有⼀些研究会区分得更细,例如在正负极性中再进⾏分级,区分不同情感强度.
⽂本情感分析(Sentiment Analysis)是⾃然语⾔处理(NLP)⽅法中常见的应⽤,也是⼀个有趣的基本任务,尤其是以提炼⽂本情绪内容为⽬的的分类。它是对带有情感⾊彩的主观性⽂本进⾏分析、处理、归纳和推理的过程。
情感分析中的情感极性(倾向)分析。所谓情感极性分析,指的是对⽂本进⾏褒义、贬义、中性的判断。在⼤多应⽤场景下,只分为两类。例如对于“喜爱”和“厌恶”这两个词,就属于不同的情感倾向。
本⽂将采⽤LSTM模型,训练⼀个能够识别⽂本postive, negative情感的分类器。
RNN⽹络因为使⽤了单词的序列信息,所以准确率要⽐前向传递神经⽹络要⾼。
⽹络结构:
⾸先,将单词传⼊ embedding层,之所以使⽤嵌⼊层,是因为单词数量太多,使⽤嵌⼊式词向量来表⽰单词更有效率。在这⾥我们使⽤word2vec⽅式来实现,⽽且特别神奇的是,我们只需要加⼊嵌⼊层即可,⽹络会⾃主学习嵌⼊矩阵
参考下图
通过embedding 层, 新的单词表⽰传⼊ LSTM cells。这将是⼀个递归链接⽹络,所以单词的序列信息会在⽹络之间传递。最后, LSTM cells连接⼀个sigmoid output layer 。 使⽤sigmoid可以预测该⽂本是 积极的 还是 消极的 情感。输出层只有⼀个单元节点(使⽤sigmoid激活)。
只需要关注最后⼀个sigmoid的输出,损失只计算最后⼀步的输出和标签的差异。
⽂件说明:
(1) 是原始⽂本⽂件,共25000条,⼀⾏是⼀篇英⽂电影影评⽂本
(2) 是标签⽂件,共25000条,⼀⾏是⼀个标签,positive 或者 negative
⼆模型训练与预测
1、Data Preprocessing
建任何模型的第⼀步,永远是数据清洗。因为使⽤embedding 层,需要将单词编码成整数。
我们要去除标点符号。 同时,去除不同⽂本之间有分隔符号 \n,我们先把\n当成分隔符号,分割所有评论。 然后在将所有评论再次连接成为⼀个⼤的⽂本。
import numpy as np
# read data from text files
with open('./', 'r') as f:
reviews = f.read()
with open('./', 'r') as f:
labels = f.read()
水利u型槽成型机print(reviews[:1000])
print()
print(labels[:20])
from string import punctuation
# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])
# split by new lines and spaces
reviews_split = all_text.split('\n')
all_text = ' '.join(reviews_split)
# create a list of words
words = all_text.split()
2、Encoding the words
embedding lookup要求输⼊的⽹络数据是整数。最简单的⽅法就是创建数据字典:{单词:整数}。然后将评论全部⼀⼀对应转换成整数,传⼊⽹络。
# feel free to use this import
from collections import Counter
## Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, , reverse=True)
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
reviews_ints.append([vocab_to_int[word] for word in review.split()])
# stats about vocabulary
print('Unique words: ', len((vocab_to_int)))  # should ~ 74000+
print()
# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])
补充enumerate函数⽤法:
在enumerate函数内写上int整型数字,则以该整型数字作为起始去迭代⽣成结果。
3、Encoding the labels
将标签 “positive” or "negative"转换为数值。
# 1=positive, 0=negative label conversion
labels_split = labels.split('\n')
encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])
# outlier review stats
车载电视机
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))
消除长度为0的⾏
print('Number of reviews before removing outliers: ', len(reviews_ints))
## remove any reviews/labels with zero length from the reviews_ints list.
# get indices of any reviews with length 0
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]
# remove 0-length reviews and their labels
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])
print('Number of reviews after removing outliers: ', len(reviews_ints))
4、Padding sequences
将所以句⼦统⼀长度为200个单词:
1、评论长度⼩于200的,我们对其左边填充0
2、对于⼤于200的,我们只截取其前200个单词
#选择每个句⼦长为200
seq_len = 200
ib.keras import preprocessing
features = np.zeros((len(reviews_ints),seq_len),dtype=int)
#将reviews_ints值逐⾏赋值给features
features = preprocessing.sequence.pad_sequences(reviews_ints,200)
features.shape
ic编带
或者
def pad_features(reviews_ints, seq_length):
''' Return features of review_ints, where each review is padded with 0's
or truncated to the input seq_length.
'''
# getting the correct rows x cols shape
features = np.zeros((len(reviews_ints), seq_length), dtype=int)
# for each review, I grab that review and
for i, row in enumerate(reviews_ints):
features[i, -len(row):] = np.array(row)[:seq_length]
return features
# Test your implementation!
seq_length = 200
features = pad_features(reviews_ints, seq_length=seq_length)
## test statements - do not change - ##
assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews." assert len(features[0])==seq_length, "Each feature row should contain seq_length values."
管式反应器# print first 10 values of the first 30 batches
无菌车间
print(features[:30,:10])
5、Training, Test划分
split_frac = 0.8
甲醇制氢
## split data into training, validation, and test data (features and labels, x and y)
split_idx = int(len(features)*split_frac)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]
test_idx = int(len(remaining_x)*0.5)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]
## print out the shapes of your resultant feature data
print("\t\t\tFeature Shapes:")
print("Train set: \t\t{}".format(train_x.shape),
"\nValidation set: \t{}".format(val_x.shape),
"\nTest set: \t\t{}".format(test_x.shape))

本文发布于:2024-09-22 03:38:20,感谢您对本站的认可!

本文链接:https://www.17tex.com/tex/2/215537.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:情感   单词   分析
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议