E2E-CER一种基于端到端的对话情感识别分类模型

2021年2月第2期Vol. 42 No. 2 2021

小型微型计算机系统

Journal of Chinese Computer Systems

E2E-CER ：一种基于端到端的对话情感识别分类模型

孙鹏，彭敦陆

（上海理工大学光电信息与计算机工程学院，上海200093）

E-mail : pengdl@ usst. edu. cn

摘要：人机对话中的情感识别对提升人机交互效率具有重要意义.当前，人机对话系统中的情感识别主要由特征提取和回归两步完成.但是，通常这两个步骤是相互独立的，目标并不一致，难以判断提取的特征是否为合适的情感特征.再者，在特征融合方面，传统方法仅将不同模态特征简单拼接，忽略了不同模态对分类结果形响的大小.针对以上问题，本文提出了一种端到端的

对话情感识别模型E2E-CER，该模型将情感识别过程整合在一个统一的系统中.此外，还引入了基于

注意力机制的多模态融合方法，提高了对上下文语境的学习能力，改善了动态特征融合效果.最后基于公共数据集EEMOCAP 进行情了感分类识别实验，

实验结果显示，同对话情感识别基线相比，所提模型表现明显高于平均水平，表明其在情感识别上的有效性.

关键词：端到端；多模态融合；情感识别；记忆网络；注意力机制中图分类号：TP391

文献标识码:A 文章编号：1000-1220（2021）02-0235-06

E2E-CER : END-TO-END Conversational Emotion Recognition Classification Model

SUN Peng,PENG Dun-lu

(School of Optical-Electrical and Computer Engineering , University of Shanghai for Science and Technology , Shanghai 200093 , China )

Abstract : The emotion recognition in human-machine dialogue is of great significance to improve the efficiency of human-machine in teraction. Currently , the emotion recognition in human-machine dialogue system is mainly completed by feature extraction and regres

sion. However,usually these two steps are independent of each other,their targets are not consistent ,it is difficult to judge whether the

extracted features are the appropriate emotional characteristics. In addition , in terms of feature fusion , different modal characteristics of

the traditional methods will only simple splicing , ignoring the different modal influence on the classification result. In view of the a- bove problem , this paper proposes E2E-CER : an end-to-end dialogue emotion recognition model , which integrates emotion recognition process in a unified system. Furthermore , we also introduce the multimodal fusion method based on attention mechanism , which im

proves the learning ability of context. Finally , based on the common data set IEMOCAP , the experiment on emotion classification and recognition was conducted. The experimental results show that compared with the baseline of conversation emotion recognition , the

proposed model is significantly higher than the average level , indicating its effectiveness in emotion recognition.

Key words : end-to-end ； multimode fusion ; emotional recognition ； memory network ； attention mechanism

1引言

开发具有情商的聊天机器人一直是人工智能的一个长期目标⑴•近十年来，情感识别领域致力于理解情绪的声学表现，并追求对语音内容更稳健的识别⑵.然而，随着此类系统在移动设备上的普及，尤其是语音等实时对话软件的普

遍应用，用户对此类系统的期望值也有所提高.一个重要的表征就是，人们期待机器能够理解对话中所携带的情感和意图，并能够以一定的同理心做出回应，从而可以改善整个人机交

互体验.

然而，想要跟踪对话中的情感动态是一项较大的挑战.因为对话人之间的情感是会被互相影响的，两者之间存在复杂

的依赖关系.根据Morris 和Keltner 的研究表明，对话中的情感动态变化主要由两个因素影响：自我依赖和他人依赖⑶. 自我依赖也被称作情感惯性，指的是对话过程中自身对自身

造成的情感影响.他人依赖则指的是其他人的情感状态也会

引起自身的情感状态变化.因此，在对话过程中对话双方更倾向于考虑对方的情感表达从而建立更融洽的对话情境.图1

中来自数据集的一段对话很好的印证了自我依赖和他人依赖

对情感动态的影响.然而现有的大多数对话系统只考虑到了自身依赖.例如Better 提出的根据当前的会话推断情绪的上

下文无关系统.Poria 提出的利用长短时记忆网络（LSTM ）对上下文语境进行建模等⑷.

本文提出的E2E-CER 综合考虑到了上述的两种情感依

赖.可将本文的贡献可以总结为以下几点：

1）本文提出了一种基于端到端的对话情感识别模型 E2E-CER，以原始数据作为输入，充分考虑了自我依赖和他人

依赖对情感检测的影响；

2）本文针对语音和文本的多模态融合，提出了基于注意

力机制的融合方法，以不同模态数据对分类结果的贡献值不

收稿日期:2020-01-24收修改稿日期：2020Q2-20基金项目：国家自然科学基金项目（61772342,61703278）资助.作者简介：孙鹏，男， 1993年生，硕士研究生，研究方向为自然语言处理；彭敦陆，男,1974年生，博士，教授,CCF 会员，研究方向为大数据管理、Web 数据管理、轨迹数据压缩技术、自然语言处理.