國立虎尾科技大學 |

Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods./
作者:	Nair, Rathin Radhakrishnan.
面頁冊數:	1 online resource (147 pages)
附註:	Source: Dissertation Abstracts International, Volume: 79-03(E), Section: B.
標題:	Computer science. -
電子資源:	click for full text (PQDT)
ISBN:	9780355311020

Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods.
Nair, Rathin Radhakrishnan.

Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods. - 1 online resource (147 pages)

Source: Dissertation Abstracts International, Volume: 79-03(E), Section: B.

Thesis (Ph.D.)--State University of New York at Buffalo, 2017.

Includes bibliographical references

Metadata by definition is any set of data that describes and provides information about other data. Specifically, document metadata entails any information that can better represent, or guide in the improved understanding of a document. The most common document metadata available include title, author, edit time etc, which are auto-generated at the time of file creation. There is also content-based metadata but often is currently overlooked e.g. information from graphics, author specific characteristics etc. In this thesis, we focus on studying approaches to extracting and understanding such implicit content-based document metadata in machine-printed and handwritten documents. The two key contributions of this thesis work are to handle (a) graphics metadata: where we offer new approaches to extract and understand information graphics and (b) handwritten text metadata: where we seek to capture author specific feature representation.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018

Mode of access: World Wide Web

ISBN: 9780355311020Subjects--Topical Terms:

573171
Computer science.
Index Terms--Genre/Form:

554714
Electronic books.

Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods.
LDR:04658ntm a2200385K 4500 001 912172
005 20180608102941.5
006 m o u
007 cr mn||||a|a||
008 190606s2017 xx obm 000 0 eng d
020 $a 9780355311020
035 $a (MiAaPQ)AAI10622555
035 $a (MiAaPQ)buffalo:15483
035 $a AAI10622555
040 $a MiAaPQ $b eng $c MiAaPQ
100 1 $a Nair, Rathin Radhakrishnan. $3 1184410
245 1 0 $a Metadata Analysis in Unstructured Documents Using Classical and Deep Learning Methods.
264 0 $c 2017
300 $a 1 online resource (147 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertation Abstracts International, Volume: 79-03(E), Section: B.
500 $a Adviser: Venugopal Govindaraju.
502 $a Thesis (Ph.D.)--State University of New York at Buffalo, 2017.
504 $a Includes bibliographical references
520 $a Metadata by definition is any set of data that describes and provides information about other data. Specifically, document metadata entails any information that can better represent, or guide in the improved understanding of a document. The most common document metadata available include title, author, edit time etc, which are auto-generated at the time of file creation. There is also content-based metadata but often is currently overlooked e.g. information from graphics, author specific characteristics etc. In this thesis, we focus on studying approaches to extracting and understanding such implicit content-based document metadata in machine-printed and handwritten documents. The two key contributions of this thesis work are to handle (a) graphics metadata: where we offer new approaches to extract and understand information graphics and (b) handwritten text metadata: where we seek to capture author specific feature representation.
520 $a The vast amount of publicly available scanned handwritten document collections are unstructured but current approaches such as OCRs make the assumption that the document under consideration maintains a uniform structure. Hence in non-uniform documents they find it challenging to handle text and non-text data, for e.g. in a line plot with text content an OCR would overlook the text data. This calls for an automated technique to process these types of documents and digitize them.
520 $a In the first part of the thesis, we study a class of deep learning architectures to help us segment the different parts of the document image. Specifically, to facilitate segmentation, we discuss a novel approach using convolution neural networks(CNN) to learn a feature representation for different types of data like machine-printed, handwritten text, graphics.
520 $a The second part of the thesis addresses obtaining information from non-text data which opens an unexplored avenue of metadata information, thus advancing existing text data understanding techniques. We discuss novel methods to extract text and non-text data from information graphics like line plots, phase diagrams etc and infer a representational message using Bayesian networks.
520 $a Finally, we discuss a neural network model that performs adaptive handwriting recognition and works with limited labeled data. Long Short Term Memory (LSTM) is a sub-class of algorithms that have been used in the domain of handwriting recognition over the years. We postulated that authors follow a unique writing style both in terms of handwriting and sentence formulation, hence we developed an adaptive LSTM-based handwriting recognition model. We exploit the user-specific features by adapting neural networks to better recognize handwritten text.
520 $a In summary, this thesis discusses an end-to-end system for converting a collection of documents into a digital archive. These digital archives will enable indexing and searching the collection. We implement a CNN based network to spot the different section in individual pages of the collection. And on the identified text sections we propose to implement a LSTM and neural network based language model for recognition and transcription. Finally, we discuss some approaches to handle non-text data and since understanding graphics needed some definitive goals, we focus specifically on information graphics such as line plots and phase diagrams in our work.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2018
538 $a Mode of access: World Wide Web
650 4 $a Computer science. $3 573171
650 4 $a Artificial intelligence. $3 559380
655 7 $a Electronic books. $2 local $3 554714
690 $a 0984
690 $a 0800
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a State University of New York at Buffalo. $b Computer Science and Engineering. $3 1180201
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10622555 $z click for full text (PQDT)