國立虎尾科技大學 |

考量雜訊的雙模式視訊及音訊的身份識別之深度學習方法設計及應用 = = Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises /

紀錄類型:	書目-語言資料,印刷品 : Monograph/item
正題名/作者:	考量雜訊的雙模式視訊及音訊的身份識別之深度學習方法設計及應用 =/ 莫學霖.
其他題名:	Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises /
其他題名:	Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises.
作者:	莫學霖
出版者:	雲林縣 :國立虎尾科技大學 , : 民113.07.,
面頁冊數:	[15], 73面 :圖, 表 ; : 30公分.;
附註:	指導教授: 丁英智, 顏義和.
標題:	noise. -
電子資源:	電子資源

考量雜訊的雙模式視訊及音訊的身份識別之深度學習方法設計及應用 = = Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises /
莫學霖

考量雜訊的雙模式視訊及音訊的身份識別之深度學習方法設計及應用 =Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises /Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises.莫學霖. - 初版. - 雲林縣 :國立虎尾科技大學 ,民113.07. - [15], 73面 :圖, 表 ;30公分.

指導教授: 丁英智, 顏義和.

碩士論文--國立虎尾科技大學電機工程系碩士班.

含參考書目.

身份識別已在市場上存在很久了，在傳統的身份識別都是命令式，在人們主動的情況下讓機器來識別身份，例如機場使用人臉及指紋兩種模式結合的身份辨識快速通關。未來的身份識別可以是感知方式的應用，此應用是在人們不知道的情況下被系統識別身份，尤其室內使用感知方式的身份識別，例如辦公室或超商等公共空間，此空間運用人的不同的走路姿勢來進行身份識別是一個很好的方式。但運用走路姿勢的RGB圖像進行感知方式的身份識別在人們認知裡會有種隱私被侵犯的疑慮，因為RGB圖像有豐富的色彩像素，所以運用加入雜訊的RGB圖像或點雲資料進行感知方式的身份識別是較保護隱私而可被接受的方式。在感知式的身份識別應用中也可以是多種感測模式結合的辨識設計，視訊及音訊的雙模式結合的身份辨識將是可行的。音訊模式的身份識別就是透過說話者的語音聲紋進行身份的分類，通常這種身份識別應用中所擷取到的語音資料都還帶有環境的雜訊，所以如何將雜訊移除以提升辨識系統性能非常重要。本論文所提出的雙模式身份識別就是結合走路姿勢與說話聲音的身份分類設計。本研究主要工作在感知方式的身份識別應用中使用RGB-D感測器來錄製走路姿勢的RGB圖像、走路姿勢的點雲資料和語音資料進行考量雜訊的雙模式視訊及音訊的身份識別。加入雜訊的RGB圖像、點雲資料與移除雜訊的語音資料等視訊或音訊之單一模式的身份識別將被用來作為性能比較(與雙模式比較)。雖然RGB圖像有豐富的色彩像素而能有高效的辨識，但是容易侵犯個人隱私，所以本論文將RGB圖像進行K-means處理或加入高斯模糊雜訊(Gaussian Blur)當雜訊來提高個人隱私。對於點雲資料，本論文是使用不具有RGB三原色的色彩像素的每個點的XYZ軸空間資訊，所以並沒有隱私問題。對於語音資料，本論文在乾淨的語音資料(在沒有雜訊的環境錄製)加入5種特定環境噪音來模擬真實環境中之含有雜訊的語音資料。視訊模式包含三種辨識通道，分別是未考量雜訊的RGB圖像、考量雜訊的RGB圖像和點雲資料等通道，未考量雜訊的RGB圖像通道和考量雜訊的RGB圖像通道都使用結合卷積神經網路(Convolutional Neural Network, CNN)和長短期記憶網路(Long Short-Term Memory, LSTM)的CNN-LSTM深度學習方法進行辨識，點雲資料通道使用LSTM深度學習模型進行辨識處理。音訊模式包含三種辨識通道，分別是未包含雜訊的乾淨語音、有特定環境噪音的語音和經過雜訊移除的語音等通道，這三種通道的資料都使用CNN-LSTM深度學習方法進行辨識。在經過雜訊移除的語音通道部分，語音資料在辨識處理前會先使用語音增強生成對抗網路(Speech Enhancement Generative Adversarial Network, SEGAN)深度學習模型把環境噪音給消除。本論文對於考量雜訊的視訊及音訊雙模式結合的身份識別為一種決策融合方法。每一種方法皆是三種不同辨識通道的決策融合，第一種模式融合方法是由考量雜訊的RGB圖像與點雲資料的兩個通道先進行決策融合，融合的結果再與經過移除雜訊的語音的通道計算結果進行融合。第二種模式融合方法是由考量雜訊的RGB圖像與經過移除雜訊的語音的兩種通道先進行決策融合，融合的結果再與點雲資料的通道計算結果進行融合。第三種模式融合方法是由點雲資料與經過移除雜訊的語音的兩種通道進行決策融合，融合的結果再與考量雜訊的RGB圖像的通道計算結果進行融合。本論文的實驗結果發現若以單一模式的身份識別而言，音訊模式較視訊模式表現更好，音訊模式的最佳辨識通道比視訊模式的最佳辨識通道高出0.67%；以視訊模式的各個通道的身份辨識而言，RGB圖像表現最佳(91.66%)，其次是加入Gaussian Blur的RGB圖像(83.33%)，點雲資料表現最差(75%)；以音訊模式的各個通道的身份辨識而言(第一種噪音環境)，沒包含雜訊的乾淨語音表現最佳(92.33%)，其次經過消除雜訊的語音(88.33%)，含有雜訊的語音表現最差(38.33%)。以考量雜訊的視訊及音訊雙模式結合的身份識別而言(第一種噪音環境)，具Gaussian Blur的RGB圖像都比經K-means處理的RGB圖像辨識率要高，第二種模式的決策融合方法表現最佳(99%)，其次第三種模式的決策融合方法(98.67%)，第一種模式的決策融合方法表現最差(92.33%)。由實驗得知不同模式的決策融合方式對於單一模式的視訊或音訊而言會有不同程度的身分識別辨識率的提升效能。.

(平裝)Subjects--Topical Terms:

1091824
noise.

考量雜訊的雙模式視訊及音訊的身份識別之深度學習方法設計及應用 = = Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises /
LDR:10914cam a2200241 i 4500 001 1129994
008 241015s2024 ch ak erm 000 0 chi d
035 $a (THES)112NYPI0441019
040 $a NFU $b chi $c NFU $e CCR
041 0 # $a chi $b chi $b eng
084 $a 008.165M $b 4471 113 $2 ncsclt
100 1 $a 莫學霖 $3 1449029
245 1 0 $a 考量雜訊的雙模式視訊及音訊的身份識別之深度學習方法設計及應用 = $b Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises / $c 莫學霖.
246 1 1 $a Designs of deep learning methods and applications on dual-mode video and audio identity recognition with considerations of noises.
250 $a 初版.
260 # $a 雲林縣 : $b 國立虎尾科技大學 , $c 民113.07.
300 $a [15], 73面 : $b 圖, 表 ; $c 30公分.
500 $a 指導教授: 丁英智, 顏義和.
500 $a 學年度: 112.
502 $a 碩士論文--國立虎尾科技大學電機工程系碩士班.
504 $a 含參考書目.
520 3 $a 身份識別已在市場上存在很久了，在傳統的身份識別都是命令式，在人們主動的情況下讓機器來識別身份，例如機場使用人臉及指紋兩種模式結合的身份辨識快速通關。未來的身份識別可以是感知方式的應用，此應用是在人們不知道的情況下被系統識別身份，尤其室內使用感知方式的身份識別，例如辦公室或超商等公共空間，此空間運用人的不同的走路姿勢來進行身份識別是一個很好的方式。但運用走路姿勢的RGB圖像進行感知方式的身份識別在人們認知裡會有種隱私被侵犯的疑慮，因為RGB圖像有豐富的色彩像素，所以運用加入雜訊的RGB圖像或點雲資料進行感知方式的身份識別是較保護隱私而可被接受的方式。在感知式的身份識別應用中也可以是多種感測模式結合的辨識設計，視訊及音訊的雙模式結合的身份辨識將是可行的。音訊模式的身份識別就是透過說話者的語音聲紋進行身份的分類，通常這種身份識別應用中所擷取到的語音資料都還帶有環境的雜訊，所以如何將雜訊移除以提升辨識系統性能非常重要。本論文所提出的雙模式身份識別就是結合走路姿勢與說話聲音的身份分類設計。本研究主要工作在感知方式的身份識別應用中使用RGB-D感測器來錄製走路姿勢的RGB圖像、走路姿勢的點雲資料和語音資料進行考量雜訊的雙模式視訊及音訊的身份識別。加入雜訊的RGB圖像、點雲資料與移除雜訊的語音資料等視訊或音訊之單一模式的身份識別將被用來作為性能比較(與雙模式比較)。雖然RGB圖像有豐富的色彩像素而能有高效的辨識，但是容易侵犯個人隱私，所以本論文將RGB圖像進行K-means處理或加入高斯模糊雜訊(Gaussian Blur)當雜訊來提高個人隱私。對於點雲資料，本論文是使用不具有RGB三原色的色彩像素的每個點的XYZ軸空間資訊，所以並沒有隱私問題。對於語音資料，本論文在乾淨的語音資料(在沒有雜訊的環境錄製)加入5種特定環境噪音來模擬真實環境中之含有雜訊的語音資料。視訊模式包含三種辨識通道，分別是未考量雜訊的RGB圖像、考量雜訊的RGB圖像和點雲資料等通道，未考量雜訊的RGB圖像通道和考量雜訊的RGB圖像通道都使用結合卷積神經網路(Convolutional Neural Network, CNN)和長短期記憶網路(Long Short-Term Memory, LSTM)的CNN-LSTM深度學習方法進行辨識，點雲資料通道使用LSTM深度學習模型進行辨識處理。音訊模式包含三種辨識通道，分別是未包含雜訊的乾淨語音、有特定環境噪音的語音和經過雜訊移除的語音等通道，這三種通道的資料都使用CNN-LSTM深度學習方法進行辨識。在經過雜訊移除的語音通道部分，語音資料在辨識處理前會先使用語音增強生成對抗網路(Speech Enhancement Generative Adversarial Network, SEGAN)深度學習模型把環境噪音給消除。本論文對於考量雜訊的視訊及音訊雙模式結合的身份識別為一種決策融合方法。每一種方法皆是三種不同辨識通道的決策融合，第一種模式融合方法是由考量雜訊的RGB圖像與點雲資料的兩個通道先進行決策融合，融合的結果再與經過移除雜訊的語音的通道計算結果進行融合。第二種模式融合方法是由考量雜訊的RGB圖像與經過移除雜訊的語音的兩種通道先進行決策融合，融合的結果再與點雲資料的通道計算結果進行融合。第三種模式融合方法是由點雲資料與經過移除雜訊的語音的兩種通道進行決策融合，融合的結果再與考量雜訊的RGB圖像的通道計算結果進行融合。本論文的實驗結果發現若以單一模式的身份識別而言，音訊模式較視訊模式表現更好，音訊模式的最佳辨識通道比視訊模式的最佳辨識通道高出0.67%；以視訊模式的各個通道的身份辨識而言，RGB圖像表現最佳(91.66%)，其次是加入Gaussian Blur的RGB圖像(83.33%)，點雲資料表現最差(75%)；以音訊模式的各個通道的身份辨識而言(第一種噪音環境)，沒包含雜訊的乾淨語音表現最佳(92.33%)，其次經過消除雜訊的語音(88.33%)，含有雜訊的語音表現最差(38.33%)。以考量雜訊的視訊及音訊雙模式結合的身份識別而言(第一種噪音環境)，具Gaussian Blur的RGB圖像都比經K-means處理的RGB圖像辨識率要高，第二種模式的決策融合方法表現最佳(99%)，其次第三種模式的決策融合方法(98.67%)，第一種模式的決策融合方法表現最差(92.33%)。由實驗得知不同模式的決策融合方式對於單一模式的視訊或音訊而言會有不同程度的身分識別辨識率的提升效能。.
520 3 $a Identity recognition has been present in the market for a long time. Traditionally, identity recognition relies on command-based methods, in which the system identifies individuals through active participations of the user, such as facial recognition and fingerprint scanning at airports for fast clearance. Identity recognition can become as the type of perceptual methods, in which the system identifies individuals without their awareness, especially in indoor environments like offices or convenience stores. Utilizing gait recognition through RGB images is a promising approach for such spaces. However, there are privacy concerns associated with RGB images due to their rich color pixels. Therefore, using noise-added RGB images or point cloud data for perceptual identity recognition is a more privacy-friendly and acceptable approach. In perceptual identity recognition, a multi-sensor fusion design combining visual and audio modes is feasible. Audio-based identity recognition classifies individuals based on their voiceprints. This type of recognition usually involves speech data with background noise, in which how to remove noises will be crucial for enhancing system performances. This thesis proposes a dual-mode identity recognition system combining gait and voiceprints recognition. The main purpose of the thesis is to use RGB-D sensors to record gait RGB images, point cloud data, and speech data for dual-mode identity recognition considering noises. Single-mode identity recognition using noise-added RGB images, point cloud data, and noise-removed speech data will be separately used for performance comparisons (against dual-mode). While RGB images have rich color pixels and high recognition efficiency, they can easily invade privacy. Therefore, this thesis applies K-means processing or Gaussian blur methods to make noises to original RGB images to enhance privacy. For point cloud data, which use XYZ spatial information without RGB color pixels, there are no privacy concerns. For speech data, this paper adds five specific types of environmental noises to clean speech data (recorded in a noise-free environment) to simulate noisy speech data in the real-world environment. The video mode includes three recognition channels: noise-free RGB images, noise-added RGB images, and point cloud data. Both the noise-free RGB image channel and the noise-added RGB image channel use a CNN-LSTM deep learning method combining convolutional neural networks (CNN) and long short-term memory (LSTM) for recognition, while the point cloud data channel uses an LSTM deep learning model. The audio mode also includes three recognition channels: clean speech, noisy speech, and noise-removed speech. All three channels use a CNN-LSTM deep learning method. For the noise-removed speech channel, speech data are pre-processed using a speech enhancement generative adversarial network (SEGAN) deep learning model to remove environmental noise before recognition. This thesis proposes three decision fusion methods for dual-mode identity recognition considering noise in video and audio modes. Each method fuses decisions from three different recognition channels. The first fusion method combines decisions from the noise-added RGB image and point cloud data channels, and then fuses the result with the noise-removed speech channel. The second fusion method combines decisions from the noise-added RGB image and noise-removed speech channels, and then fuses the result with the point cloud data channel. The third fusion method combines decisions from the point cloud data and noise-removed speech channels, and then fuses the result with the noise-added RGB image channel. The experimental results of this thesis reveal that, in terms of single-mode identity recognition, the audio mode performs better than the video mode, with the best recognition channel in the audio mode surpassing the best recognition channel in the video mode by 0.67%. Regarding identity recognition in various channels of the video mode, RGB images perform the best (91.66%), followed by RGB images with Gaussian blur (83.33%), and point cloud data perform the worst (75%). In the audio mode channels (under the first noise environment), clean speech without noise performs the best (92.33%), followed by noise-removed speech (88.33%), and noisy speech performs the worst (38.33%). For dual-mode identity recognition considering noise (under the first noise environment), RGB images with Gaussian blur achieve higher recognition rates than those processed with K-means. The second fusion method performs the best (99%), followed by the third fusion method (98.67%), and the first fusion method performs the worst (92.33%). The experiments demonstrate that different decision fusion methods can significantly enhance the identity recognition performance of single-mode video or audio channels..
563 $a (平裝)
650 # 4 $a noise. $3 1091824
650 # 4 $a audio data. $3 1451883
650 # 4 $a point cloud data. $3 1451882
650 # 4 $a RGB images. $3 1451881
650 # 4 $a deep learning. $3 1218309
650 # 4 $a identity recognition. $3 1451880
650 # 4 $a 雜訊. $3 1084083
650 # 4 $a 語音資料. $3 1451879
650 # 4 $a 點雲資料. $3 1451878
650 # 4 $a RGB圖像. $3 1451877
650 # 4 $a 深度學習. $3 1127425
650 # 4 $a 身份識別. $3 1451876
856 7 # $u https://handle.ncl.edu.tw/11296/qhgd42 $z 電子資源 $2 http