國立虎尾科技大學 |

Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data./
作者:	Sarkar, Chandrima.
面頁冊數:	1 online resource (140 pages)
附註:	Source: Dissertation Abstracts International, Volume: 77-03(E), Section: B.
Contained By:	Dissertation Abstracts International77-03B(E).
標題:	Computer science. -
電子資源:	click for full text (PQDT)
ISBN:	9781339136318

Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data.
Sarkar, Chandrima.

Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data. - 1 online resource (140 pages)

Source: Dissertation Abstracts International, Volume: 77-03(E), Section: B.

Thesis (Ph.D.)--University of Minnesota, 2015.

Includes bibliographical references

In the past few decades predictive modeling has emerged as an important tool for exploratory data analysis and decision making in health care. Predictive modeling is a commonly used statistical and data mining technique that works by analyzing historical and current data and generating a model to help predict future outcomes. It gives us the power to discover hidden relationships in volumes of data and use those insights to confidently predict the outcome of future events and interactions. In health care, complex models can be created to combine patient information like demographic and clinical information from care providers, in order to predict and improve model accuracy. Predictive modeling in health care seeks out subtle data patterns to enhance decision making such as care providers can recommend prescription drugs and services based on patient profile.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018

Mode of access: World Wide Web

ISBN: 9781339136318Subjects--Topical Terms:

573171
Computer science.
Index Terms--Genre/Form:

554714
Electronic books.

Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data.
LDR:08203ntm a2200421Ki 4500 001 918752
005 20181030085013.5
006 m o u
007 cr mn||||a|a||
008 190606s2015 xx obm 000 0 eng d
020 $a 9781339136318
035 $a (MiAaPQ)AAI3728239
035 $a (MiAaPQ)umn:16425
035 $a AAI3728239
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Sarkar, Chandrima. $3 1193162
245 1 0 $a Improving Predictive Modeling in High Dimensional, Heterogeneous and Sparse Health Care Data.
264 0 $c 2015
300 $a 1 online resource (140 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertation Abstracts International, Volume: 77-03(E), Section: B.
500 $a Advisers: Jaideep Srivastava; Sarah Cooley.
502 $a Thesis (Ph.D.)--University of Minnesota, 2015.
504 $a Includes bibliographical references
520 $a In the past few decades predictive modeling has emerged as an important tool for exploratory data analysis and decision making in health care. Predictive modeling is a commonly used statistical and data mining technique that works by analyzing historical and current data and generating a model to help predict future outcomes. It gives us the power to discover hidden relationships in volumes of data and use those insights to confidently predict the outcome of future events and interactions. In health care, complex models can be created to combine patient information like demographic and clinical information from care providers, in order to predict and improve model accuracy. Predictive modeling in health care seeks out subtle data patterns to enhance decision making such as care providers can recommend prescription drugs and services based on patient profile.
520 $a Although all predictive techniques have different strengths and weaknesses, model accuracy is mostly dependent on the raw input data with various features used to train a predictive model. Model building often requires data pre-processing in order to reduce the impact of the skewed property of the data or outliers. This helps by significantly improving performance. From hundreds of available raw data fields, a subset is selected and fields are pre-processed before being presented to a predictive modeling technique. For example, there can be thousands of variables consisting of genetic, clinical and demographic information for different groups of patients. Therefore detecting significant variables for a particular group of patient can enhance model accuracy. Hence, the secret behind a good predictive model often times depends on good pre-processing and more so than the technique used to train the model.
520 $a While the above responsibilities of an effective and efficient data pre-processing mechanism and its usage with predictive modeling in health care data are better understood, three key challenges were identified that faces this data pre-processing task. These include, 1) High dimensionality: The challenge of high-dimensionality arises in diverse fields, ranging from health care and computational biology to financial engineering and risk management. This work identifies that there is no single feature selection strategy that is robust towards different families of classification or prediction algorithm. The existing feature selection techniques produce different results with different predictive models. This can be a problem when deciding about the best predictive model to use while working with real high dimensional health care data and especially without domain experts.
520 $a 2) Heterogeneity in the data and data redundancy: Most of the real world data is heterogeneous in nature, i.e. the population consists of overlapping homogeneous groups. In health care, Electronic Health Records (EHR) data consists of diverse groups of patients with a wide range of diverse health conditions. This thesis identifies that predictive modeling with a single learning model over heterogeneous data can result in inconclusive results and ineffective explanation of an outcome. Therefore, it has been proposed in this thesis that, there is a need for data segmentation/ co-clustering technique that extracts groups from data while removing insignificant features and extraneous rows, giving result to an improved predictive modeling with a learning model.
520 $a 3) Data sparseness: When a row is created, storage is allocated for every column, irrespective of whether a value exists for a given field. This gives rise to sparse data which has a relatively high percentage of the variable's cells, missing the actual data. In health care, not all patients undergo every possible medical diagnostics and lab results are equally sparse. Such Sparse information or missing values causes predictive models to produce inconclusive results. One primitive technique is manual imputation of missing values by the domain experts. Today, this scenario is almost impossible as the data is huge and high dimensional in nature. A variety of statistical and machine learning based missing value estimation techniques exist which estimates missing values by statistical analysis of the data set available. However, most of these techniques do not consider the importance of a domain expert's opinion in estimating missing data. It has been proposed in this thesis that techniques that use statistical information from the data as well as opinion of the experts can estimate missing values more effectively. This imputation procedure can results in non-sparse data which is closer to the ground truth and that improves predictive modeling.
520 $a In this thesis, the following computational approaches has been proposed for handling challenges described above for an effective and improved predictive modeling -- 1) For handling high-dimensional data a novel robust rank aggregation-based feature selection technique has been developed using exclusive rank aggregation strategies by Borda (1781) and Kemeny (1959). The concept of robustness of a feature selection algorithm has been introduced, which can be defined as the property that characterizes the stability of a ranked feature set toward achieving similar classification accuracy across a wide range of classifiers. This concept has been quantified with an evaluation measure namely, the robustness index (RI). The concept of inter-rater agreement for improving the quality of the rank aggregation approach for feature selection has also been proposed in this thesis.
520 $a 2) The concept of a co-clustering has been proposed that is dedicated towards improving predictive modeling. The novel idea of Learning based Co-Clustering (LCC) has been developed as an optimization problem for a more effective and improved predictive analysis. An important property of this algorithm is that there is no need to specify the number of co-clusters. A separate model testing framework has also been proposed in this work, for reducing model over-fitting and for a more accurate result. The methodology has been evaluated on health care data as a case study as well as several other publicly available data sets.
520 $a 3) A missing value imputation technique based on domain expert's knowledge and statistical analysis of the available data has been proposed in this thesis. The medical domain of HSCT has been chosen for the case study and the domain expert's knowledge is a group of stem cell transplant physician's opinion. The machine learning approach developed can be defined as -- rule mining with expert knowledge and similarity scoring based missing value imputation. This technique has been developed and validated using real world medical data set. The results demonstrate the effectiveness and utility of this technique in practice.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2018
538 $a Mode of access: World Wide Web
650 4 $a Computer science. $3 573171
650 4 $a Health care management. $3 1148454
655 7 $a Electronic books. $2 local $3 554714
690 $a 0984
690 $a 0769
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a University of Minnesota. $b Computer Science. $3 1180176
773 0 $t Dissertation Abstracts International $g 77-03B(E).
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=3728239 $z click for full text (PQDT)