國立虎尾科技大學 |

End-to-End Fine-Grained Vision & Language Understanding.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	End-to-End Fine-Grained Vision & Language Understanding./
作者:	Kamath, Aishwarya.
面頁冊數:	1 online resource (211 pages)
附註:	Source: Dissertations Abstracts International, Volume: 85-08, Section: B.
Contained By:	Dissertations Abstracts International85-08B.
標題:	Computer science. -
電子資源:	click for full text (PQDT)
ISBN:	9798381730746

End-to-End Fine-Grained Vision & Language Understanding.
Kamath, Aishwarya.

End-to-End Fine-Grained Vision & Language Understanding. - 1 online resource (211 pages)

Source: Dissertations Abstracts International, Volume: 85-08, Section: B.

Thesis (Ph.D.)--New York University, 2024.

Includes bibliographical references

This thesis focuses on the field of vision and language understanding, which aims to develop systems that can understand and reason about data from multiple modalities, such as images, videos and text. We start by discussing the recent history of computer vision, focusing on how advancements in natural language processing have accelerated progress in vision and language understanding. Nevertheless, visual features commonly employed were still restricted to a fixed vocabulary of predefined classes, limiting the scope of these systems. We propose an alternative approach to visual feature extraction that allows for detection of any object described in free form text, resolving difficulties associated with pre-trained fixed vocabulary detectors. This approach, which we call "modulated detection", is end-to-end trainable, allowing the model to fine-tune the feature extraction to the task of interest. We demonstrate significant gains in performance across a variety of tasks while also providing interpretable predictions. While the idea of modulated detection opened up the doors to truly open-world image understanding, it still required expensive annotated data in the form of densely annotated image-text pairs, having bounding box annotations for noun phrases mentioned in the associated text. To combat this, we propose a novel coarse-to-fine grained pre-training strategy, coupled with a model architecture that makes it possible to scale up fine-grained pre-training using web-scale image captioning data. Further, we find that existing fine-grained vision and language benchmarks do not truly test the models' ability to understand context. This blind spot causes the benchmarks to significantly overestimate the capabilities of the models. To counteract this, we propose a novel task called Contextual Phrase Detection (CPD), along with a human annotated evaluation dataset called TRICD that provides a more accurate lens into the performance of current state-of-the art models. Finally, we explore an application of vision and language understanding within the context of embodied agents, specifically focusing on Vision and Language Navigation (VLN). Our model achieves a new state-of-the-art on the multilingual VLN benchmark RxR by leveraging in-domain data augmentation. Remarkably, this success is attained solely through offline training without agent-environment interaction. This approach opens the doors to integrating embodied tasks like VLN into standard vision and language multi-task training.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024

Mode of access: World Wide Web

ISBN: 9798381730746Subjects--Topical Terms:

573171
Computer science.
Subjects--Index Terms:

Computer visionIndex Terms--Genre/Form:

554714
Electronic books.

End-to-End Fine-Grained Vision & Language Understanding.
LDR:03973ntm a22004097 4500 001 1149675
005 20241022112612.5
006 m o d
007 cr bn ---uuuuu
008 250605s2024 xx obm 000 0 eng d
020 $a 9798381730746
035 $a (MiAaPQ)AAI30811218
035 $a AAI30811218
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Kamath, Aishwarya. $3 1476001
245 1 0 $a End-to-End Fine-Grained Vision & Language Understanding.
264 0 $c 2024
300 $a 1 online resource (211 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertations Abstracts International, Volume: 85-08, Section: B.
500 $a Advisor: LeCun, Yann;Cho, Kyunghyun.
502 $a Thesis (Ph.D.)--New York University, 2024.
504 $a Includes bibliographical references
520 $a This thesis focuses on the field of vision and language understanding, which aims to develop systems that can understand and reason about data from multiple modalities, such as images, videos and text. We start by discussing the recent history of computer vision, focusing on how advancements in natural language processing have accelerated progress in vision and language understanding. Nevertheless, visual features commonly employed were still restricted to a fixed vocabulary of predefined classes, limiting the scope of these systems. We propose an alternative approach to visual feature extraction that allows for detection of any object described in free form text, resolving difficulties associated with pre-trained fixed vocabulary detectors. This approach, which we call "modulated detection", is end-to-end trainable, allowing the model to fine-tune the feature extraction to the task of interest. We demonstrate significant gains in performance across a variety of tasks while also providing interpretable predictions. While the idea of modulated detection opened up the doors to truly open-world image understanding, it still required expensive annotated data in the form of densely annotated image-text pairs, having bounding box annotations for noun phrases mentioned in the associated text. To combat this, we propose a novel coarse-to-fine grained pre-training strategy, coupled with a model architecture that makes it possible to scale up fine-grained pre-training using web-scale image captioning data. Further, we find that existing fine-grained vision and language benchmarks do not truly test the models' ability to understand context. This blind spot causes the benchmarks to significantly overestimate the capabilities of the models. To counteract this, we propose a novel task called Contextual Phrase Detection (CPD), along with a human annotated evaluation dataset called TRICD that provides a more accurate lens into the performance of current state-of-the art models. Finally, we explore an application of vision and language understanding within the context of embodied agents, specifically focusing on Vision and Language Navigation (VLN). Our model achieves a new state-of-the-art on the multilingual VLN benchmark RxR by leveraging in-domain data augmentation. Remarkably, this success is attained solely through offline training without agent-environment interaction. This approach opens the doors to integrating embodied tasks like VLN into standard vision and language multi-task training.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2024
538 $a Mode of access: World Wide Web
650 4 $a Computer science. $3 573171
650 4 $a Computer engineering. $3 569006
653 $a Computer vision
653 $a Contextual phrase detection
653 $a Multimodal understanding
653 $a Natural language processing
653 $a Open-vocabulary detection
653 $a Vision and language navigation
655 7 $a Electronic books. $2 local $3 554714
690 $a 0984
690 $a 0464
690 $a 0800
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a New York University. $b Center for Data Science. $3 1468521
773 0 $t Dissertations Abstracts International $g 85-08B.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30811218 $z click for full text (PQDT)