國立虎尾科技大學 |

Language Supervision for Computer Vision.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Language Supervision for Computer Vision./
作者:	Desai, Karan P.
面頁冊數:	1 online resource (179 pages)
附註:	Source: Dissertations Abstracts International, Volume: 85-12, Section: B.
Contained By:	Dissertations Abstracts International85-12B.
標題:	Computer science. -
電子資源:	click for full text (PQDT)
ISBN:	9798382739380

Language Supervision for Computer Vision.
Desai, Karan P.

Language Supervision for Computer Vision. - 1 online resource (179 pages)

Source: Dissertations Abstracts International, Volume: 85-12, Section: B.

Thesis (Ph.D.)--University of Michigan, 2024.

Includes bibliographical references

Representation learning lies at the core of modern Artificial Intelligence. In computer vision, labeled image datasets like ImageNet have been the standard choice for representation learning. Despite being empirically successful, this approach is expensive to scale due to labeling costs. Moreover, the representation quality is limited by the size and diversity of datasets and their associated label ontologies.My research explores using natural language supervision for computer vision. Using natural language allows us to go beyond fixed label ontologies and scale up to more general sources such as internet data. Toward this goal, my dissertation explores four problems - (1) Learning representations: I propose one of the first methods for language-supervised visual learning that uses image captioning as the training objective, showing its efficacy compared to ImageNet-trained methods on downstream tasks like object detection and segmentation. (2) Scaling data: I explore social media as a rich source of high-quality image descriptions and curate a dataset of 12 million image-text pairs while ensuring responsible curation practices. (3) Understanding data: It is difficult to comprehend the diversity of visual concepts present in millions of image-text pairs. I posit that images and text naturally organize into a tree-like hierarchy and propose an approach for learning representations that capture this hierarchy using tools from hyperbolic geometry. (4) Transfer to downstream tasks: Large vision-language models show impressive zero-shot transfer capabilities on image-level tasks like classification and retrieval. However, their transferability to pixel-level tasks like object detection and segmentation has relied on expensive labeled mask annotations. I propose an object detector to efficiently transfer pre-trained vision models to segment and classify visual objects without any fine-tuning, unlike existing detectors that train using orders of magnitude more labeled masks to achieve high performance.In summary, my research affirms that using language supervision can drive the next leap of progress in computer vision and has immense utility in practical applications.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024

Mode of access: World Wide Web

ISBN: 9798382739380Subjects--Topical Terms:

573171
Computer science.
Subjects--Index Terms:

Computer visionIndex Terms--Genre/Form:

554714
Electronic books.

Language Supervision for Computer Vision.
LDR:03560ntm a22003977 4500 001 1149757
005 20241022112631.5
006 m o d
007 cr bn ---uuuuu
008 250605s2024 xx obm 000 0 eng d
020 $a 9798382739380
035 $a (MiAaPQ)AAI31348980
035 $a (MiAaPQ)umichrackham005447
035 $a AAI31348980
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Desai, Karan P. $3 1476091
245 1 0 $a Language Supervision for Computer Vision.
264 0 $c 2024
300 $a 1 online resource (179 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertations Abstracts International, Volume: 85-12, Section: B.
500 $a Advisor: Johnson, Justin C.
502 $a Thesis (Ph.D.)--University of Michigan, 2024.
504 $a Includes bibliographical references
520 $a Representation learning lies at the core of modern Artificial Intelligence. In computer vision, labeled image datasets like ImageNet have been the standard choice for representation learning. Despite being empirically successful, this approach is expensive to scale due to labeling costs. Moreover, the representation quality is limited by the size and diversity of datasets and their associated label ontologies.My research explores using natural language supervision for computer vision. Using natural language allows us to go beyond fixed label ontologies and scale up to more general sources such as internet data. Toward this goal, my dissertation explores four problems - (1) Learning representations: I propose one of the first methods for language-supervised visual learning that uses image captioning as the training objective, showing its efficacy compared to ImageNet-trained methods on downstream tasks like object detection and segmentation. (2) Scaling data: I explore social media as a rich source of high-quality image descriptions and curate a dataset of 12 million image-text pairs while ensuring responsible curation practices. (3) Understanding data: It is difficult to comprehend the diversity of visual concepts present in millions of image-text pairs. I posit that images and text naturally organize into a tree-like hierarchy and propose an approach for learning representations that capture this hierarchy using tools from hyperbolic geometry. (4) Transfer to downstream tasks: Large vision-language models show impressive zero-shot transfer capabilities on image-level tasks like classification and retrieval. However, their transferability to pixel-level tasks like object detection and segmentation has relied on expensive labeled mask annotations. I propose an object detector to efficiently transfer pre-trained vision models to segment and classify visual objects without any fine-tuning, unlike existing detectors that train using orders of magnitude more labeled masks to achieve high performance.In summary, my research affirms that using language supervision can drive the next leap of progress in computer vision and has immense utility in practical applications.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2024
538 $a Mode of access: World Wide Web
650 4 $a Computer science. $3 573171
650 4 $a Computer engineering. $3 569006
653 $a Computer vision
653 $a Representation learning
653 $a Hyperbolic geometry
653 $a Natural language
655 7 $a Electronic books. $2 local $3 554714
690 $a 0984
690 $a 0800
690 $a 0464
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a University of Michigan. $b Computer Science & Engineering. $3 1181967
773 0 $t Dissertations Abstracts International $g 85-12B.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=31348980 $z click for full text (PQDT)