國立虎尾科技大學 |

Fusing Multimodal Knowledge in Language Models.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Fusing Multimodal Knowledge in Language Models./
作者:	Michihiro Yasunaga.
面頁冊數:	1 online resource (226 pages)
附註:	Source: Dissertations Abstracts International, Volume: 85-11, Section: B.
Contained By:	Dissertations Abstracts International85-11B.
標題:	Computer science. -
電子資源:	click for full text (PQDT)
ISBN:	9798382230481

Fusing Multimodal Knowledge in Language Models.
Michihiro Yasunaga.

Fusing Multimodal Knowledge in Language Models. - 1 online resource (226 pages)

Source: Dissertations Abstracts International, Volume: 85-11, Section: B.

Thesis (Ph.D.)--Stanford University, 2024.

Includes bibliographical references

Language models, such as GPT-4, have the capability to generate textual responses to user queries. They are used across various tasks, including question answering, translation, summarization, and personal assistance. However, to create more versatile AI assistants, these models need to handle more diverse and complex tasks involving domain or visual knowledge, such as answering medical questions and explaining or generating images. This necessity motivates the development of models that can access and leverage diverse knowledge sources beyond text, such as databases and images.In this thesis, we aim to develop language models capable of using multimodal knowledge, encompassing text, knowledge graphs, and images, to address various user queries. Text provides broad and contextually rich knowledge, knowledge graphs often supply structured domain knowledge, and images facilitate various visual applications.This thesis consists of five chapters. The first chapter introduces methods for language models to efficiently learn knowledge from textual data. Specifically, we train language models on a sequence of multiple related documents, encouraging them to learn and reason about knowledge with long-range dependencies. This approach yields strong performance on complex long-context and multi-step reasoning tasks. In the second chapter, we introduce methods that enable language models to harness knowledge graph information. Specifically, we develop a new model architecture, a hybrid of language models and graph neural networks, along with a training objective that fuses text and knowledge graph representations. This method demonstrates strong performance on tasks involving domain knowledge, such as medical question answering. In the third chapter, to empower language models to use and generate visual content alongside textual information, we design unified multimodal models capable of encoding, retrieving, and decoding interleaved sequences of text and images. The model employs a retriever to fetch textual or visual knowledge and integrates it into a multimodal Transformer that encodes and decodes both text and images using token representations. Finally, in the forth and fifth chapters, we demonstrate the application of textual, structured, and visual knowledge fusion techniques to solve practical healthcare tasks, including clinical trial outcome prediction and multimodal medical question answering.In summary, this thesis builds models capable of comprehending and generating multimodal content, spanning text, knowledge graphs, and images.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024

Mode of access: World Wide Web

ISBN: 9798382230481Subjects--Topical Terms:

573171
Computer science.
Subjects--Index Terms:

DatabaseIndex Terms--Genre/Form:

554714
Electronic books.

Fusing Multimodal Knowledge in Language Models.
LDR:03897ntm a22003857 4500 001 1146332
005 20240812064411.5
006 m o d
007 cr bn ---uuuuu
008 250605s2024 xx obm 000 0 eng d
020 $a 9798382230481
035 $a (MiAaPQ)AAI31255788
035 $a (MiAaPQ)dz688yd5162
035 $a AAI31255788
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Michihiro Yasunaga. $3 1471706
245 1 0 $a Fusing Multimodal Knowledge in Language Models.
264 0 $c 2024
300 $a 1 online resource (226 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertations Abstracts International, Volume: 85-11, Section: B.
500 $a Advisor: Jure Leskovec;Percy Liang.
502 $a Thesis (Ph.D.)--Stanford University, 2024.
504 $a Includes bibliographical references
520 $a Language models, such as GPT-4, have the capability to generate textual responses to user queries. They are used across various tasks, including question answering, translation, summarization, and personal assistance. However, to create more versatile AI assistants, these models need to handle more diverse and complex tasks involving domain or visual knowledge, such as answering medical questions and explaining or generating images. This necessity motivates the development of models that can access and leverage diverse knowledge sources beyond text, such as databases and images.In this thesis, we aim to develop language models capable of using multimodal knowledge, encompassing text, knowledge graphs, and images, to address various user queries. Text provides broad and contextually rich knowledge, knowledge graphs often supply structured domain knowledge, and images facilitate various visual applications.This thesis consists of five chapters. The first chapter introduces methods for language models to efficiently learn knowledge from textual data. Specifically, we train language models on a sequence of multiple related documents, encouraging them to learn and reason about knowledge with long-range dependencies. This approach yields strong performance on complex long-context and multi-step reasoning tasks. In the second chapter, we introduce methods that enable language models to harness knowledge graph information. Specifically, we develop a new model architecture, a hybrid of language models and graph neural networks, along with a training objective that fuses text and knowledge graph representations. This method demonstrates strong performance on tasks involving domain knowledge, such as medical question answering. In the third chapter, to empower language models to use and generate visual content alongside textual information, we design unified multimodal models capable of encoding, retrieving, and decoding interleaved sequences of text and images. The model employs a retriever to fetch textual or visual knowledge and integrates it into a multimodal Transformer that encodes and decodes both text and images using token representations. Finally, in the forth and fifth chapters, we demonstrate the application of textual, structured, and visual knowledge fusion techniques to solve practical healthcare tasks, including clinical trial outcome prediction and multimodal medical question answering.In summary, this thesis builds models capable of comprehending and generating multimodal content, spanning text, knowledge graphs, and images.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2024
538 $a Mode of access: World Wide Web
650 4 $a Computer science. $3 573171
653 $a Database
653 $a Language models
653 $a Summarization
653 $a AI assistants
655 7 $a Electronic books. $2 local $3 554714
690 $a 0984
690 $a 0800
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a Stanford University. $3 1184533
773 0 $t Dissertations Abstracts International $g 85-11B.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=31255788 $z click for full text (PQDT)