國立虎尾科技大學 |

Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors./
作者:	Abtahi, Farnaz.
面頁冊數:	1 online resource (111 pages)
附註:	Source: Dissertation Abstracts International, Volume: 79-07(E), Section: B.
標題:	Computer science. -
電子資源:	click for full text (PQDT)
ISBN:	9780355670059

Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors.
Abtahi, Farnaz.

Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors. - 1 online resource (111 pages)

Source: Dissertation Abstracts International, Volume: 79-07(E), Section: B.

Thesis (Ph.D.)--City University of New York, 2018.

Includes bibliographical references

The focus of the thesis is on Deep Learning methods and their applications on multimodal data, with a potential to explore the associations between modalities and replace missing and corrupt ones if necessary. We have chosen two important real-world applications that need to deal with multimodal data: 1) Speaker recognition and identification; 2) Facial expression recognition and emotion detection.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018

Mode of access: World Wide Web

ISBN: 9780355670059Subjects--Topical Terms:

573171
Computer science.
Index Terms--Genre/Form:

554714
Electronic books.

Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors.
LDR:03999ntm a2200349K 4500 001 912199
005 20180608102941.5
006 m o u
007 cr mn||||a|a||
008 190606s2018 xx obm 000 0 eng d
020 $a 9780355670059
035 $a (MiAaPQ)AAI10745436
035 $a (MiAaPQ)minarees:14928
035 $a AAI10745436
040 $a MiAaPQ $b eng $c MiAaPQ
100 1 $a Abtahi, Farnaz. $3 1184449
245 1 0 $a Multimodal Sensing and Data Processing for Speaker and Emotion Recognition Using Deep Learning Models with Audio, Video and Biomedical Sensors.
264 0 $c 2018
300 $a 1 online resource (111 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertation Abstracts International, Volume: 79-07(E), Section: B.
500 $a Advisers: Zhigang Zhu; Tony Ro.
502 $a Thesis (Ph.D.)--City University of New York, 2018.
504 $a Includes bibliographical references
520 $a The focus of the thesis is on Deep Learning methods and their applications on multimodal data, with a potential to explore the associations between modalities and replace missing and corrupt ones if necessary. We have chosen two important real-world applications that need to deal with multimodal data: 1) Speaker recognition and identification; 2) Facial expression recognition and emotion detection.
520 $a The first part of our work assesses the effectiveness of speech-related sensory data modalities and their combinations in speaker recognition using deep learning models. First, the role of electromyography (EMG) is highlighted as a unique biometric sensor in improving audio-visual speaker recognition or as a substitute in noisy or poorly-lit environments. Secondly, the effectiveness of deep learning is empirically confirmed through its higher robustness to all types of features in comparison to a number of commonly used baseline classifiers. Not only do deep models outperform the baseline methods, their power increases when they integrate multiple modalities, as different modalities contain information on different aspects of the data, especially between EMG and audio. Interestingly, our deep learning approach is word-independent. Plus, the EMG, audio, and visual parts of the samples from each speaker do not need to match. This increases the flexibility of our method in using multimodal data, particularly if one or more modalities are missing. With a dataset of 23 individuals speaking 22 words five times, we show that EMG can replace the audio/visual modalities, and when combined, significantly improve the accuracy of speaker recognition.
520 $a The second part describes a study on automated emotion recognition using four different modalities---audio, video, electromyography (EMG), and electroencephalography (EEG). We collected a dataset by recording the 4 modalities as 12 human subjects expressed six different emotions or maintained a neutral expression. Three different aspects of emotion recognition were investigated: model selection, feature selection, and data selection. Both generative models (DBNs) and discriminative models (LSTMs) were applied to the four modalities, and from these analyses we conclude that LSTM is better for audio and video together with their corresponding sophisticated feature extractors (MFCC and CNN), whereas DBN is better for both EMG and EEG. By examining these signals at different stages (pre-speech, during-speech, and post-speech) of the current and following trials, we have found that the most effective stages for emotion recognition from EEG occur after the emotion has been expressed, suggesting that the neural signals conveying an emotion are long-lasting.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2018
538 $a Mode of access: World Wide Web
650 4 $a Computer science. $3 573171
650 4 $a Artificial intelligence. $3 559380
655 7 $a Electronic books. $2 local $3 554714
690 $a 0984
690 $a 0800
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a City University of New York. $b Computer Science. $3 1184450
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10745436 $z click for full text (PQDT)