國立虎尾科技大學 |

Investigating Supplemental Context for Word Sense Disambiguation.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Investigating Supplemental Context for Word Sense Disambiguation./
作者:	Black, Alan E.
面頁冊數:	1 online resource (185 pages)
附註:	Source: Dissertation Abstracts International, Volume: 79-12(E), Section: A.
Contained By:	Dissertation Abstracts International79-12A(E).
標題:	Information science. -
電子資源:	click for full text (PQDT)
ISBN:	9780438155190

Investigating Supplemental Context for Word Sense Disambiguation.
Black, Alan E.

Investigating Supplemental Context for Word Sense Disambiguation. - 1 online resource (185 pages)

Source: Dissertation Abstracts International, Volume: 79-12(E), Section: A.

Thesis (Ph.D.)--Drexel University, 2018.

Includes bibliographical references

The key to word sense disambiguation is context. Because most words have multiple meanings (i.e. senses), there arises ambiguity when interpreting a word in isolation. Additional information is required to resolve the ambiguity. This additional information is referred to as context. Microtext (e.g., tweets) present a special case where messages are limited to a small number of characters, thereby severely limiting the available context in the text itself. This work is motivated by the importance of Twitter as a unique data source, and the difficulties of precise data collection when confronted with the daunting volume of messages that flow through the system on regular basis. Twitter has become a valuable source of information for academic researchers, industry analysts, marketing organizations, and others. The ultimate goal is to develop a tool that can help users collect relevant Twitter data from the vast sea of messages in the Twitter search index, or from streaming data sources, by leveraging additional context opportunities to aid in word sense disambiguation of a search term. While the language used in tweets presents problems, the Twitter platform offers opportunities to help make sense of users' messages. Various data retrieval mechanisms made available through a variety of APIs (application programing interfaces) allow developers to request additional information that may be used to form a supplemented context within which to better understand a message. In particular, this combination of message text and supplemental context may be employed to disambiguate among the meanings (i.e. senses) of a search word. We investigate two sources of supplemental context; previous tweets from a message's author (i.e. a twitter timeline) and tweets within a temporal window relative to the tweet under investigation's creation timestamp. Contexts for the former were collected on-demand from a RESTful Twitter API while context for the later was collected in bulk using a streaming API resulting in an experimental pool of over 10 million tweets. We propose a simple heuristic that can aid in the automated collection of supplemental context. The results of the heuristic's application are explored using a standard approach to word sense disambiguation combined with a variety of underlying concept-to-concept similarity measures. In addition, we develop a Blue Standard approach to generating sense-tagged test data. This work is motived by issues and limitations associated with employing human coders. The performance of systems that process human (natural) language has always been evaluated against a "gold standard" that is human derived. Engaging people to create tagged corpora for use as gold standards in natural language processing research is a time consuming and expensive proposition. Reuse of existing data is often the only viable option, even if the domain of discourse, writing style (e.g., formal vs. informal) or other significant attributes of the text or coding are not well suited to answering a particular research question. We propose a semi-supervised instance tagging methodology designed to produce sense tagged twitter data for the purpose of studying word sense disambiguation in social media. The proposed approach leverages the foundational work by Yarowsky (1993) who demonstrated that collocations are strongly indicative of word sense. We use the Blue Standard approach to build a test set of over 380,000 sense tagged tweets for use in our WSD experiments.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018

Mode of access: World Wide Web

ISBN: 9780438155190Subjects--Topical Terms:

561178
Information science.
Index Terms--Genre/Form:

554714
Electronic books.

Investigating Supplemental Context for Word Sense Disambiguation.
LDR:04732ntm a2200337Ki 4500 001 920926
005 20181227095853.5
006 m o u
007 cr mn||||a|a||
008 190606s2018 xx obm 000 0 eng d
020 $a 9780438155190
035 $a (MiAaPQ)AAI10838841
035 $a (MiAaPQ)drexel:11578
035 $a AAI10838841
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Black, Alan E. $3 1195869
245 1 0 $a Investigating Supplemental Context for Word Sense Disambiguation.
264 0 $c 2018
300 $a 1 online resource (185 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertation Abstracts International, Volume: 79-12(E), Section: A.
500 $a Adviser: Rosina O. Weber.
502 $a Thesis (Ph.D.)--Drexel University, 2018.
504 $a Includes bibliographical references
520 $a The key to word sense disambiguation is context. Because most words have multiple meanings (i.e. senses), there arises ambiguity when interpreting a word in isolation. Additional information is required to resolve the ambiguity. This additional information is referred to as context. Microtext (e.g., tweets) present a special case where messages are limited to a small number of characters, thereby severely limiting the available context in the text itself. This work is motivated by the importance of Twitter as a unique data source, and the difficulties of precise data collection when confronted with the daunting volume of messages that flow through the system on regular basis. Twitter has become a valuable source of information for academic researchers, industry analysts, marketing organizations, and others. The ultimate goal is to develop a tool that can help users collect relevant Twitter data from the vast sea of messages in the Twitter search index, or from streaming data sources, by leveraging additional context opportunities to aid in word sense disambiguation of a search term. While the language used in tweets presents problems, the Twitter platform offers opportunities to help make sense of users' messages. Various data retrieval mechanisms made available through a variety of APIs (application programing interfaces) allow developers to request additional information that may be used to form a supplemented context within which to better understand a message. In particular, this combination of message text and supplemental context may be employed to disambiguate among the meanings (i.e. senses) of a search word. We investigate two sources of supplemental context; previous tweets from a message's author (i.e. a twitter timeline) and tweets within a temporal window relative to the tweet under investigation's creation timestamp. Contexts for the former were collected on-demand from a RESTful Twitter API while context for the later was collected in bulk using a streaming API resulting in an experimental pool of over 10 million tweets. We propose a simple heuristic that can aid in the automated collection of supplemental context. The results of the heuristic's application are explored using a standard approach to word sense disambiguation combined with a variety of underlying concept-to-concept similarity measures. In addition, we develop a Blue Standard approach to generating sense-tagged test data. This work is motived by issues and limitations associated with employing human coders. The performance of systems that process human (natural) language has always been evaluated against a "gold standard" that is human derived. Engaging people to create tagged corpora for use as gold standards in natural language processing research is a time consuming and expensive proposition. Reuse of existing data is often the only viable option, even if the domain of discourse, writing style (e.g., formal vs. informal) or other significant attributes of the text or coding are not well suited to answering a particular research question. We propose a semi-supervised instance tagging methodology designed to produce sense tagged twitter data for the purpose of studying word sense disambiguation in social media. The proposed approach leverages the foundational work by Yarowsky (1993) who demonstrated that collocations are strongly indicative of word sense. We use the Blue Standard approach to build a test set of over 380,000 sense tagged tweets for use in our WSD experiments.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2018
538 $a Mode of access: World Wide Web
650 4 $a Information science. $3 561178
650 4 $a Computer science. $3 573171
655 7 $a Electronic books. $2 local $3 554714
690 $a 0723
690 $a 0984
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a Drexel University. $b Information Studies. $3 1184437
773 0 $t Dissertation Abstracts International $g 79-12A(E).
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10838841 $z click for full text (PQDT)