[深度学习]PyTorch实现双向LSTM情感分析

⼀前⾔

情感分析（Sentiment Analysis），也称为情感分类，属于⾃然语⾔处理（Natural Language Processing，NLP）领域的⼀个分⽀任务，随着互联⽹的发展⽽兴起。多数情况下该任务分析⼀个⽂本所呈现的信息是正⾯、负⾯或者中性，也有⼀些研究会区分得更细，例如在正负极性中再进⾏分级，区分不同情感强度.

⽂本情感分析（Sentiment Analysis）是⾃然语⾔处理（NLP）⽅法中常见的应⽤，也是⼀个有趣的基本任务，尤其是以提炼⽂本情绪内容为⽬的的分类。它是对带有情感⾊彩的主观性⽂本进⾏分析、处理、归纳和推理的过程。

情感分析中的情感极性（倾向）分析。所谓情感极性分析，指的是对⽂本进⾏褒义、贬义、中性的判断。在⼤多应⽤场景下，只分为两类。例如对于“喜爱”和“厌恶”这两个词，就属于不同的情感倾向。

本⽂将采⽤LSTM模型，训练⼀个能够识别⽂本postive, negative情感的分类器。

RNN⽹络因为使⽤了单词的序列信息，所以准确率要⽐前向传递神经⽹络要⾼。

⽹络结构：

⾸先，将单词传⼊ embedding层，之所以使⽤嵌⼊层，是因为单词数量太多，使⽤嵌⼊式词向量来表⽰单词更有效率。在这⾥我们使⽤word2vec⽅式来实现，⽽且特别神奇的是，我们只需要加⼊嵌⼊层即可，⽹络会⾃主学习嵌⼊矩阵

参考下图

通过embedding 层, 新的单词表⽰传⼊ LSTM cells。这将是⼀个递归链接⽹络，所以单词的序列信息会在⽹络之间传递。最后， LSTM cells连接⼀个sigmoid output layer 。使⽤sigmoid可以预测该⽂本是积极的还是消极的情感。输出层只有⼀个单元节点（使⽤sigmoid激活）。

只需要关注最后⼀个sigmoid的输出，损失只计算最后⼀步的输出和标签的差异。

⽂件说明：

（1）是原始⽂本⽂件，共25000条，⼀⾏是⼀篇英⽂电影影评⽂本

（2）是标签⽂件，共25000条，⼀⾏是⼀个标签，positive 或者 negative

⼆模型训练与预测

1、Data Preprocessing

建任何模型的第⼀步，永远是数据清洗。因为使⽤embedding 层，需要将单词编码成整数。

我们要去除标点符号。同时，去除不同⽂本之间有分隔符号 \n，我们先把\n当成分隔符号，分割所有评论。然后在将所有评论再次连接成为⼀个⼤的⽂本。

import numpy as np

# read data from text files

with open('./', 'r') as f:

reviews = f.read()

with open('./', 'r') as f:

labels = f.read()

水利u型槽成型机print(reviews[:1000])

print()

print(labels[:20])

from string import punctuation

# get rid of punctuation

reviews = reviews.lower() # lowercase, standardize

all_text = ''.join([c for c in reviews if c not in punctuation])

# split by new lines and spaces

reviews_split = all_text.split('\n')

all_text = ' '.join(reviews_split)

# create a list of words

words = all_text.split()

2、Encoding the words

embedding lookup要求输⼊的⽹络数据是整数。最简单的⽅法就是创建数据字典：{单词：整数}。然后将评论全部⼀⼀对应转换成整数，传⼊⽹络。

# feel free to use this import

from collections import Counter

## Build a dictionary that maps words to integers

counts = Counter(words)

vocab = sorted(counts, , reverse=True)

vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}

## use the dict to tokenize each review in reviews_split

## store the tokenized reviews in reviews_ints

reviews_ints = []

for review in reviews_split:

reviews_ints.append([vocab_to_int[word] for word in review.split()])

# stats about vocabulary

print('Unique words: ', len((vocab_to_int))) # should ~ 74000+

print()

# print tokens in first review

print('Tokenized review: \n', reviews_ints[:1])

补充enumerate函数⽤法:

在enumerate函数内写上int整型数字，则以该整型数字作为起始去迭代⽣成结果。

3、Encoding the labels

将标签 “positive” or "negative"转换为数值。

# 1=positive, 0=negative label conversion

labels_split = labels.split('\n')

encoded_labels = np.array([1 if label == 'positive' else 0 for label in labels_split])

# outlier review stats

车载电视机

review_lens = Counter([len(x) for x in reviews_ints])

print("Zero-length reviews: {}".format(review_lens[0]))

print("Maximum review length: {}".format(max(review_lens)))

消除长度为0的⾏

print('Number of reviews before removing outliers: ', len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_ints list.

# get indices of any reviews with length 0

non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# remove 0-length reviews and their labels

reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]

encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))

4、Padding sequences

将所以句⼦统⼀长度为200个单词：

1、评论长度⼩于200的，我们对其左边填充0

2、对于⼤于200的，我们只截取其前200个单词

#选择每个句⼦长为200

seq_len = 200

ib.keras import preprocessing

features = np.zeros((len(reviews_ints),seq_len),dtype=int)

#将reviews_ints值逐⾏赋值给features

features = preprocessing.sequence.pad_sequences(reviews_ints,200)

features.shape

ic编带

或者

def pad_features(reviews_ints, seq_length):

''' Return features of review_ints, where each review is padded with 0's

or truncated to the input seq_length.

'''

# getting the correct rows x cols shape

features = np.zeros((len(reviews_ints), seq_length), dtype=int)

# for each review, I grab that review and

for i, row in enumerate(reviews_ints):

features[i, -len(row):] = np.array(row)[:seq_length]

return features

# Test your implementation!

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##

assert len(features)==len(reviews_ints), "Your features should have as many rows as reviews." assert len(features[0])==seq_length, "Each feature row should contain seq_length values."

管式反应器# print first 10 values of the first 30 batches

无菌车间

print(features[:30,:10])

5、Training, Test划分

split_frac = 0.8

甲醇制氢

## split data into training, validation, and test data (features and labels, x and y)

split_idx = int(len(features)*split_frac)

train_x, remaining_x = features[:split_idx], features[split_idx:]

train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.5)

val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]

val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data

print("\t\t\tFeature Shapes:")

print("Train set: \t\t{}".format(train_x.shape),

"\nValidation set: \t{}".format(val_x.shape),

"\nTest set: \t\t{}".format(test_x.shape))

本文发布于:2024-09-22 03:38:20，感谢您对本站的认可！

本文链接：https://www.17tex.com/tex/2/215537.html

上一篇：人教版七年级上册英语单词带音标(整理精华版)

下一篇：嵌入式系统原理及应用第四章习题

标签：情感单词分析

留言与评论（共有 0 条评论）