國立虎尾科技大學 |

Clinical Information Extraction from Unstructured Free-Texts.

紀錄類型:	書目-語言資料,印刷品 : Monograph/item
正題名/作者:	Clinical Information Extraction from Unstructured Free-Texts./
作者:	Tao, Mingzhe.
出版者:	Ann Arbor : ProQuest Dissertations & Theses, : 2018,
面頁冊數:	142 p.
附註:	Source: Dissertation Abstracts International, Volume: 79-12(E), Section: A.
Contained By:	Dissertation Abstracts International79-12A(E).
標題:	Information science. -
電子資源:	http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10842594
ISBN:	9780438255500

Clinical Information Extraction from Unstructured Free-Texts.
Tao, Mingzhe.

Clinical Information Extraction from Unstructured Free-Texts. - Ann Arbor : ProQuest Dissertations & Theses, 2018 - 142 p.

Source: Dissertation Abstracts International, Volume: 79-12(E), Section: A.

Thesis (Ph.D.)--State University of New York at Albany, 2018.

Information extraction (IE) is a fundamental component of natural language processing (NLP) that provides a deeper understanding of the texts. In the clinical domain, documents prepared by medical experts (e.g., discharge summaries, drug labels, medical history records) contain a significant amount of clinically-relevant information that is crucial to the overall well-being of patients. Unfortunately, in many cases, clinically-relevant information is presented in an unstructured format, predominantly consisting of free-texts, making it inaccessible to computerized methods. Automatic extraction of this information can improve accessibility. However, the presence of synonymous expressions, medical acronyms, misspellings, negated phrases, and ambiguous terminologies make automatic extraction difficult. The lack of annotated data, sometimes in less well-represented information categories, sometimes in all categories altogether, further complicates this task.

ISBN: 9780438255500Subjects--Topical Terms:

561178
Information science.

Clinical Information Extraction from Unstructured Free-Texts.
LDR:05700nam a2200385 4500 001 931688
005 20190716101635.5
008 190815s2018 ||||||||||||||||| ||eng d
020 $a 9780438255500
035 $a (MiAaPQ)AAI10842594
035 $a (MiAaPQ)sunyalb:12438
035 $a AAI10842594
040 $a MiAaPQ $c MiAaPQ
100 1 $a Tao, Mingzhe. $3 1213895
245 1 0 $a Clinical Information Extraction from Unstructured Free-Texts.
260 1 $a Ann Arbor : $b ProQuest Dissertations & Theses, $c 2018
300 $a 142 p.
500 $a Source: Dissertation Abstracts International, Volume: 79-12(E), Section: A.
500 $a Advisers: Ozlem Uzuner; Kevin Knuth.
502 $a Thesis (Ph.D.)--State University of New York at Albany, 2018.
520 $a Information extraction (IE) is a fundamental component of natural language processing (NLP) that provides a deeper understanding of the texts. In the clinical domain, documents prepared by medical experts (e.g., discharge summaries, drug labels, medical history records) contain a significant amount of clinically-relevant information that is crucial to the overall well-being of patients. Unfortunately, in many cases, clinically-relevant information is presented in an unstructured format, predominantly consisting of free-texts, making it inaccessible to computerized methods. Automatic extraction of this information can improve accessibility. However, the presence of synonymous expressions, medical acronyms, misspellings, negated phrases, and ambiguous terminologies make automatic extraction difficult. The lack of annotated data, sometimes in less well-represented information categories, sometimes in all categories altogether, further complicates this task.
520 $a In the past decade, there have been many efforts focused on extraction of clinical information, i.e., clinical IE. In this dissertation, we present novel extensions to IE methods for automatically identifying clinically-relevant information from documents such as discharge summaries and drug labels. We develop our methods on two sets of clinical IE tasks. In the first set of tasks, we identify medication names, dosages, modes, frequencies, durations, and reasons. We also link these entities in a relation extraction task to generate prescription entries (e.g., link dosages, modes, frequencies, durations, and reasons with their associated medication names). In the second set of tasks, we study a broader set of concepts such as adverse drug reactions (ADR), medical problems, and lab tests. When tackling these tasks, we aim to create novel knowledge representations that improve semantic interpretation of clinical free-texts. We combine these knowledge representations with new methods that help address annotation and data sparsity. Finally, we formulate a new way of looking at the problem of relation extraction. Through empirical evaluations, we show that:
520 $a (1) Knowledge representations that utilize real-valued word embeddings outperform their categorical counterparts. Categorical embeddings eliminate word-to-word distances in the high-dimensional space when converting words into discrete labels. Real-valued word embeddings do not have this problem. Therefore, they assist classifiers in extracting entities from the categories that have high morphological variance and result in higher overall performance.
520 $a (2) Introducing pseudo-sequences from unannotated data can improve extraction of entity categories that are sparsely represented in the training data. We use a supervised model trained on annotated data to predict pseudo-sequences from unannotated data with confidence rates. We show that confidence thresholds can guide us in selectively including pseudo-sequences that can boost performance, especially for the entity categories that are sparsely represented.
520 $a (3) We can address lack of available annotated data through pseudo-data generation. We experiment with three different methods of pseudo-data generation. The first method is based on professional gazetteers. It replaces entities in the annotated data with entries in professional gazetteers. The second method takes advantage of the Euclidean distance between the word embeddings. It replaces entity tokens in the annotated data with their top three nearest neighbors in the word vector set. Our results and manual analyses suggest that the method based on Euclidean distance is suitable for entity categories with high morphological variance but short textual spans. On the other hand, the gazetteer-based pseudo-datasets are effective for entity categories with longer textual spans, when the gazetteers offer a good representation of the entities within each category. Given these findings, we propose a third method that hybridizes the two pseudo-datasets, outperforming both.
520 $a (4) Sequence labeling approach to relation extraction can benefit this task. Sequence labeling can identify textual excerpts that contain entities and enables subsequent extraction of sequences of related entities from these excerpts.
520 $a Cross-validated results across multiple clinical IE tasks show overall significant performance improvement from the knowledge representations, pseudo-sequences, pseudo-data, and relation extraction models we proposed in our study. The generalized findings from this dissertation are applicable to current and future IE studies in the clinical domain.
590 $a School code: 0668.
650 4 $a Information science. $3 561178
650 4 $a Computer science. $3 573171
650 4 $a Bioinformatics. $3 583857
690 $a 0723
690 $a 0984
690 $a 0715
710 2 $a State University of New York at Albany. $b Information Science. $3 1213896
773 0 $t Dissertation Abstracts International $g 79-12A(E).
790 $a 0668
791 $a Ph.D.
792 $a 2018
793 $a English
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10842594