Voice language, the primary way of human communication, delivers not only verbal information but also emotional information through various characteristics such as voice intonation, height, and surrounding environment. Currently, many studies focus on...
Voice language, the primary way of human communication, delivers not only verbal information but also emotional information through various characteristics such as voice intonation, height, and surrounding environment. Currently, many studies focus on grasping emotion and speech recognition on voice for human-computer interaction and are developing deep neural network models by extracting various frequency characteristics of speech. Representatives of these speech-based deep learning algorithms include speech recognition, namely speech-to-text or automatic speech recognition, and speech emotion recognition. The development of these two algorithms has been developed for a long time, but multi-output algorithms that process them in parallel at the same time are rare. This paper introduces a multi-output model that recognizes speech and emotion in one voice, thinking that simultaneously understanding language and emotion, which are the most critical information in a human voice, will significantly help human-computer interaction. This model confirmed that there was no significant difference between the training of the language and emotion recognition models separately, with a word error rate of 6.59% in the speech recognition section and an accuracy of 79.67% on average in the emotion recognition section.