國立虎尾科技大學 |

Topics in Clustering : = Feature Selection and Semiparametric Modeling.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Topics in Clustering :/
其他題名:	Feature Selection and Semiparametric Modeling.
作者:	Pu, Xiao.
面頁冊數:	1 online resource (107 pages)
附註:	Source: Dissertation Abstracts International, Volume: 78-12(E), Section: B.
Contained By:	Dissertation Abstracts International78-12B(E).
標題:	Statistics. -
電子資源:	click for full text (PQDT)
ISBN:	9780355081947

Topics in Clustering : = Feature Selection and Semiparametric Modeling.
Pu, Xiao.

Topics in Clustering :Feature Selection and Semiparametric Modeling. - 1 online resource (107 pages)

Source: Dissertation Abstracts International, Volume: 78-12(E), Section: B.

Thesis (Ph.D.)

Includes bibliographical references

The first part of this thesis is concerned with Sparse Clustering, which assumes that a potentially large set of features are associated with clustering observations but the true underlying clusters differ only with respect to some of the features. We propose two approaches for this purpose, both of which allow us to group the observations using only a carefully-chosen subset of the features. The first approach assumes that the data are generated from Gaussian mixture models in high dimensions and the difference between mean vectors of the Gaussian components is sparse. Enlightened by the connection between sparse principal component analysis (SPCA) and sparse clustering, we adapted multiple estimation strategies from SPCA to perform sparse clustering. We provide theoretical guarantee of the aggregated estimator and develop an iterative algorithm to uncover the important feature set in sparse clustering. The second one is a hill-climbing approach, which alternates between selecting the s most important features (that correspond to the s smallest within-cluster dissimilarities) and clustering observations based on the selected feature subset. This approach has been shown to be competitive with existing methods in literature on simulated and real-world datasets.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018

Mode of access: World Wide Web

ISBN: 9780355081947Subjects--Topical Terms:

556824
Statistics.
Index Terms--Genre/Form:

554714
Electronic books.

Topics in Clustering : = Feature Selection and Semiparametric Modeling.
LDR:04038ntm a2200349Ki 4500 001 911200
005 20180529081859.5
006 m o u
007 cr mn||||a|a||
008 190606s2017 xx obm 000 0 eng d
020 $a 9780355081947
035 $a (MiAaPQ)AAI10268721
035 $a (MiAaPQ)ucsd:16387
035 $a AAI10268721
040 $a MiAaPQ $b eng $c MiAaPQ
099 $a TUL $f hyy $c available through World Wide Web
100 1 $a Pu, Xiao. $3 1182884
245 1 0 $a Topics in Clustering : $b Feature Selection and Semiparametric Modeling.
264 0 $c 2017
300 $a 1 online resource (107 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertation Abstracts International, Volume: 78-12(E), Section: B.
500 $a Adviser: Ery Arias-Castro.
502 $a Thesis (Ph.D.) $c University of California, San Diego $d 2017.
504 $a Includes bibliographical references
520 $a The first part of this thesis is concerned with Sparse Clustering, which assumes that a potentially large set of features are associated with clustering observations but the true underlying clusters differ only with respect to some of the features. We propose two approaches for this purpose, both of which allow us to group the observations using only a carefully-chosen subset of the features. The first approach assumes that the data are generated from Gaussian mixture models in high dimensions and the difference between mean vectors of the Gaussian components is sparse. Enlightened by the connection between sparse principal component analysis (SPCA) and sparse clustering, we adapted multiple estimation strategies from SPCA to perform sparse clustering. We provide theoretical guarantee of the aggregated estimator and develop an iterative algorithm to uncover the important feature set in sparse clustering. The second one is a hill-climbing approach, which alternates between selecting the s most important features (that correspond to the s smallest within-cluster dissimilarities) and clustering observations based on the selected feature subset. This approach has been shown to be competitive with existing methods in literature on simulated and real-world datasets.
520 $a In the second part of the thesis, we consider a semiparametric approach to clustering and develop related theory. We first consider the problem of fitting a mixture model under the assumption that the mixture components are symmetric and log-concave. We study the nonparametric maximum likelihood estimation (NPMLE) of a monotone and log-concave probability density (which we do as part of our algorithm), and derive some results in terms of existence, uniqueness and uniform consistency of the MLE. To fit the mixture model, we propose a semiparametric EM (SEM) algorithm, which can be adapted to other semiparametric mixture models. We then consider mixture modeling in high dimensions using radial (or elliptical) distributions. In the process of working on this problem, we uncovered a difficulty in estimating the densities. We found that the i.i.d. d-dimensional data points sampled from a rotationally invariant distribution F with f( x) = g(||x||), are highly concentrated on the sphere of a d-dimensional ball as d → infinity. This extends the well-known behavior of the normal distribution (its concentration around the sphere of radius square-root of the dimension) to other radial densities. We establish a form of concentration of measure, and even a convergence in distribution, under additional assumptions. We draw some possible consequences for statistical modeling in high-dimensions, including a possible universality property of Gaussian Mixtures.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2018
538 $a Mode of access: World Wide Web
650 4 $a Statistics. $3 556824
655 7 $a Electronic books. $2 local $3 554714
690 $a 0463
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a University of California, San Diego. $b Mathematics with a Specialization in Statistics. $3 1182885
773 0 $t Dissertation Abstracts International $g 78-12B(E).
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10268721 $z click for full text (PQDT)