國立虎尾科技大學 |

Machine Learning for Detecting Trends and Topics From Research Papers and Proceedings.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Machine Learning for Detecting Trends and Topics From Research Papers and Proceedings./
作者:	Dixon, Jose.
面頁冊數:	1 online resource (112 pages)
附註:	Source: Masters Abstracts International, Volume: 84-11.
Contained By:	Masters Abstracts International84-11.
標題:	Computer science. -
電子資源:	click for full text (PQDT)
ISBN:	9798379533373

Machine Learning for Detecting Trends and Topics From Research Papers and Proceedings.
Dixon, Jose.

Machine Learning for Detecting Trends and Topics From Research Papers and Proceedings. - 1 online resource (112 pages)

Source: Masters Abstracts International, Volume: 84-11.

Thesis (M.S.)--Morgan State University, 2023.

Includes bibliographical references

1,000 portable document files are divided into five labels from the World Health Organization COVID-19 Research Downloadable Articles and PubMed Central databases for positive and negative papers. PDF files are converted into unstructured raw text files. Tokenization and lemmatization are done using the Natural Language Toolkit Library after removing punctuation. Training size variation and subsampling were varied experimentally to determine their effect on the performance measures. Supervised learning classification is performed using the Scikit-learn library and the following classifiers: Random Forest, Naive Bayes, Decision Tree, XGBoost, and Logistic Regression. Imbalanced sampling techniques are implemented using the Imbalanced-learn library based on the following techniques: Synthetic Minority Oversampling Technique, Random Oversampling, Random Undersampling, TomekLinks, and NearMiss to address the problem of distribution of positive and negative samples. R and the tidyverse are used to conduct statistical and exploratory data analysis on performance metrics. The machine learning classifiers achieve an average precision score of 78% and a recall score of 77%, while the sampling techniques have higher average precision and recall scores of 80% and 81%, respectively. Correcting imbalanced sampling supplied significant p-values from NearMiss, ROS, and SMOTE for precision and recall scores. This work has shown that training size variation, subsampling, and imbalanced sampling techniques with machine learning algorithms can improve performance in the results of precision, recall, accuracy, and area under the curve scores, including the analysis of variance.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024

Mode of access: World Wide Web

ISBN: 9798379533373Subjects--Topical Terms:

573171
Computer science.
Subjects--Index Terms:

Imbalanced samplingIndex Terms--Genre/Form:

554714
Electronic books.

Machine Learning for Detecting Trends and Topics From Research Papers and Proceedings.
LDR:03123ntm a22004097 4500 001 1142565
005 20240422071019.5
006 m o d
007 cr mn ---uuuuu
008 250605s2023 xx obm 000 0 eng d
020 $a 9798379533373
035 $a (MiAaPQ)AAI30313079
035 $a AAI30313079
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Dixon, Jose. $3 1466941
245 1 0 $a Machine Learning for Detecting Trends and Topics From Research Papers and Proceedings.
264 0 $c 2023
300 $a 1 online resource (112 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Masters Abstracts International, Volume: 84-11.
500 $a Includes supplementary digital materials.
500 $a Advisor: Rahman, Md Mahmudur.
502 $a Thesis (M.S.)--Morgan State University, 2023.
504 $a Includes bibliographical references
520 $a 1,000 portable document files are divided into five labels from the World Health Organization COVID-19 Research Downloadable Articles and PubMed Central databases for positive and negative papers. PDF files are converted into unstructured raw text files. Tokenization and lemmatization are done using the Natural Language Toolkit Library after removing punctuation. Training size variation and subsampling were varied experimentally to determine their effect on the performance measures. Supervised learning classification is performed using the Scikit-learn library and the following classifiers: Random Forest, Naive Bayes, Decision Tree, XGBoost, and Logistic Regression. Imbalanced sampling techniques are implemented using the Imbalanced-learn library based on the following techniques: Synthetic Minority Oversampling Technique, Random Oversampling, Random Undersampling, TomekLinks, and NearMiss to address the problem of distribution of positive and negative samples. R and the tidyverse are used to conduct statistical and exploratory data analysis on performance metrics. The machine learning classifiers achieve an average precision score of 78% and a recall score of 77%, while the sampling techniques have higher average precision and recall scores of 80% and 81%, respectively. Correcting imbalanced sampling supplied significant p-values from NearMiss, ROS, and SMOTE for precision and recall scores. This work has shown that training size variation, subsampling, and imbalanced sampling techniques with machine learning algorithms can improve performance in the results of precision, recall, accuracy, and area under the curve scores, including the analysis of variance.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2024
538 $a Mode of access: World Wide Web
650 4 $a Computer science. $3 573171
650 4 $a Information science. $3 561178
653 $a Imbalanced sampling
653 $a Machine learning
653 $a Statistical analysis
653 $a Subsampling
653 $a Text classification
653 $a Text retrieval
655 7 $a Electronic books. $2 local $3 554714
690 $a 0984
690 $a 0723
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a Morgan State University. $b Computer Science and Bioinformatics Program. $3 1466942
773 0 $t Masters Abstracts International $g 84-11.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30313079 $z click for full text (PQDT)