國立虎尾科技大學 |

Tackling GPU Memory Size Limitations.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Tackling GPU Memory Size Limitations./
作者:	Geil, Afton Noelle.
面頁冊數:	1 online resource (125 pages)
附註:	Source: Dissertations Abstracts International, Volume: 85-01, Section: B.
Contained By:	Dissertations Abstracts International85-01B.
標題:	Computer engineering. -
電子資源:	click for full text (PQDT)
ISBN:	9798379775056

Tackling GPU Memory Size Limitations.
Geil, Afton Noelle.

Tackling GPU Memory Size Limitations. - 1 online resource (125 pages)

Source: Dissertations Abstracts International, Volume: 85-01, Section: B.

Thesis (Ph.D.)--University of California, Davis, 2023.

Includes bibliographical references

GPUs are now in widespread use for many non-graphics applications, like machine learning, scientific computations, and computer vision, but many challenges remain for achieving their full potential in many areas. Some algorithms and data structure operations, originally developed with sequential CPU architectures in mind, appear to be inherently serial in nature, and require new methods to adapt them to take advantage of the many-core GPU architecture. This dissertation describes methods for utilizing this massive parallelism to solve problems on large datasets while also grappling with the limitations on GPU memory size.First, we present an approach to maximum clique enumeration (finding all maximum cliques in a graph) on the GPU via an iterative breadth-first traversal of the search tree. In order to achieve high performance on the GPU, our implementation aims to maximize available parallelism and minimize divergence between threads. The biggest challenge for a breadth-first implementation is the memory required to store all of the intermediate clique candidates. To mitigate this issue, we employ a variety of strategies to prune away non-maximum candidates and present a thorough examination of the performance and memory benefits of each of these options. We also explore a windowing strategy as a middle-ground between breadth-first and depth-first approaches, and investigate the resulting trade-off between parallel efficiency and memory usage. Our results demonstrate that when we are able to manage the memory requirements, our approach achieves high throughput for large graphs indicating this approach is a good choice for GPU performance. We demonstrate an average speedup of 1.9x over previous parallel work, and obtain our best performance on graphs with low average degree.Finally, we present our GPU implementation of the quotient filter, a compact data structure designed to implement approximate membership queries. The quotient filter is similar to the more well-known Bloom filter; however, in addition to set insertion and membership queries, the quotient filter also supports deletions and merging filters without requiring rehashing of the data set. Furthermore, the quotient filter can be extended to include counters without increasing the memory footprint. We implement two types of quotient filters on the GPU: the standard quotient filter and the rank-and-select-based quotient filter. We describe the parallelization of all filter operations, including a comparison of the four different methods we devised for parallelizing quotient filter construction. In solving this problem, we found that we needed an operation similar to a parallel scan, but for non-associative operators. One outcome of this work is a variety of methods for computing parallel scan-type operations on a non-associative operator.For membership queries, we achieve a throughput of up to 1.13 billion items/second for the rank-and-select-based quotient filter: a speedup of 3x over the BloomGPU filter. Our fastest filter build method achieves a speedup of 2.1--3.1x over BloomGPU, with a peak throughput of 621 million items/second, and a rate of 516 million items/second for a 70% full filter. However, we find that our filters do not perform incremental updates as fast as the BloomGPU filter. For a batch of 2 million items, we perform incremental inserts at a rate of 81 million items/second -- a 2.5x slowdown compared to BloomGPU's throughput of 201 million items/second. The quotient filter's memory footprint is comparable to that of a Bloom filter.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024

Mode of access: World Wide Web

ISBN: 9798379775056Subjects--Topical Terms:

569006
Computer engineering.
Subjects--Index Terms:

Approximate membership queryIndex Terms--Genre/Form:

554714
Electronic books.

Tackling GPU Memory Size Limitations.
LDR:04957ntm a22003977 4500 001 1143554
005 20240517104558.5
006 m o d
007 cr mn ---uuuuu
008 250605s2023 xx obm 000 0 eng d
020 $a 9798379775056
035 $a (MiAaPQ)AAI30241909
035 $a AAI30241909
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Geil, Afton Noelle. $3 1468276
245 1 0 $a Tackling GPU Memory Size Limitations.
264 0 $c 2023
300 $a 1 online resource (125 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertations Abstracts International, Volume: 85-01, Section: B.
500 $a Advisor: Owens, John D.
502 $a Thesis (Ph.D.)--University of California, Davis, 2023.
504 $a Includes bibliographical references
520 $a GPUs are now in widespread use for many non-graphics applications, like machine learning, scientific computations, and computer vision, but many challenges remain for achieving their full potential in many areas. Some algorithms and data structure operations, originally developed with sequential CPU architectures in mind, appear to be inherently serial in nature, and require new methods to adapt them to take advantage of the many-core GPU architecture. This dissertation describes methods for utilizing this massive parallelism to solve problems on large datasets while also grappling with the limitations on GPU memory size.First, we present an approach to maximum clique enumeration (finding all maximum cliques in a graph) on the GPU via an iterative breadth-first traversal of the search tree. In order to achieve high performance on the GPU, our implementation aims to maximize available parallelism and minimize divergence between threads. The biggest challenge for a breadth-first implementation is the memory required to store all of the intermediate clique candidates. To mitigate this issue, we employ a variety of strategies to prune away non-maximum candidates and present a thorough examination of the performance and memory benefits of each of these options. We also explore a windowing strategy as a middle-ground between breadth-first and depth-first approaches, and investigate the resulting trade-off between parallel efficiency and memory usage. Our results demonstrate that when we are able to manage the memory requirements, our approach achieves high throughput for large graphs indicating this approach is a good choice for GPU performance. We demonstrate an average speedup of 1.9x over previous parallel work, and obtain our best performance on graphs with low average degree.Finally, we present our GPU implementation of the quotient filter, a compact data structure designed to implement approximate membership queries. The quotient filter is similar to the more well-known Bloom filter; however, in addition to set insertion and membership queries, the quotient filter also supports deletions and merging filters without requiring rehashing of the data set. Furthermore, the quotient filter can be extended to include counters without increasing the memory footprint. We implement two types of quotient filters on the GPU: the standard quotient filter and the rank-and-select-based quotient filter. We describe the parallelization of all filter operations, including a comparison of the four different methods we devised for parallelizing quotient filter construction. In solving this problem, we found that we needed an operation similar to a parallel scan, but for non-associative operators. One outcome of this work is a variety of methods for computing parallel scan-type operations on a non-associative operator.For membership queries, we achieve a throughput of up to 1.13 billion items/second for the rank-and-select-based quotient filter: a speedup of 3x over the BloomGPU filter. Our fastest filter build method achieves a speedup of 2.1--3.1x over BloomGPU, with a peak throughput of 621 million items/second, and a rate of 516 million items/second for a 70% full filter. However, we find that our filters do not perform incremental updates as fast as the BloomGPU filter. For a batch of 2 million items, we perform incremental inserts at a rate of 81 million items/second -- a 2.5x slowdown compared to BloomGPU's throughput of 201 million items/second. The quotient filter's memory footprint is comparable to that of a Bloom filter.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2024
538 $a Mode of access: World Wide Web
650 4 $a Computer engineering. $3 569006
650 4 $a Computer science. $3 573171
653 $a Approximate membership query
653 $a Graph algorithms
653 $a Graphics processing units
653 $a Maximum clique
653 $a Parallel data structures
653 $a Quotient filter
655 7 $a Electronic books. $2 local $3 554714
690 $a 0464
690 $a 0984
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a University of California, Davis. $b Electrical and Computer Engineering. $3 1178925
773 0 $t Dissertations Abstracts International $g 85-01B.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30241909 $z click for full text (PQDT)