國立虎尾科技大學 |

GPU Memory Architecture Optimization.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	GPU Memory Architecture Optimization./
作者:	Dai, Hongwen.
面頁冊數:	1 online resource (110 pages)
附註:	Source: Dissertation Abstracts International, Volume: 79-04(E), Section: B.
Contained By:	Dissertation Abstracts International79-04B(E).
標題:	Computer engineering. -
電子資源:	click for full text (PQDT)
ISBN:	9780355458282

GPU Memory Architecture Optimization.
Dai, Hongwen.

GPU Memory Architecture Optimization. - 1 online resource (110 pages)

Source: Dissertation Abstracts International, Volume: 79-04(E), Section: B.

Thesis (Ph.D.)--North Carolina State University, 2017.

Includes bibliographical references

General purpose computation on graphics processing units (GPGPU) has become prevalent in high performance computing. Besides the massive multithreading, GPUs have adopted multi-level cache hierarchies to mitigate the long off-chip memory access latency. However, significant cache trashing and severe memory pipeline stalls exist due to the massive multithreading. In this dissertation, we study the inefficiencies of memory request handling in GPUs and propose architectural designs to improve performance for both single kernel execution and concurrent kernel execution.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018

Mode of access: World Wide Web

ISBN: 9780355458282Subjects--Topical Terms:

569006
Computer engineering.
Index Terms--Genre/Form:

554714
Electronic books.

GPU Memory Architecture Optimization.
LDR:05089ntm a2200373Ki 4500 001 920690
005 20181203094032.5
006 m o u
007 cr mn||||a|a||
008 190606s2017 xx obm 000 0 eng d
020 $a 9780355458282
035 $a (MiAaPQ)AAI10708343
035 $a AAI10708343
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Dai, Hongwen. $3 1195559
245 1 0 $a GPU Memory Architecture Optimization.
264 0 $c 2017
300 $a 1 online resource (110 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertation Abstracts International, Volume: 79-04(E), Section: B.
500 $a Adviser: Huiyang Zhou.
502 $a Thesis (Ph.D.)--North Carolina State University, 2017.
504 $a Includes bibliographical references
520 $a General purpose computation on graphics processing units (GPGPU) has become prevalent in high performance computing. Besides the massive multithreading, GPUs have adopted multi-level cache hierarchies to mitigate the long off-chip memory access latency. However, significant cache trashing and severe memory pipeline stalls exist due to the massive multithreading. In this dissertation, we study the inefficiencies of memory request handling in GPUs and propose architectural designs to improve performance for both single kernel execution and concurrent kernel execution.
520 $a First, I will present our work of a model-driven approach to GPU cache bypassing. In this work, we propose a simple yet effective performance model to estimate the number of cache hits and reservation failures due to cache-miss-related resource congestion as a function of the number of warps/thread-blocks to access/bypass the cache. Based on the model, we design a cost-effective hardware-based scheme to identify the optimal number of warps/ thread-blocks to bypass the L1 D-cache. The key difference from prior works on GPU cache bypassing is that we do not rely on accurate prediction on hot cache lines. Compared to warp throttling, we do not limit the number of active warps so as to exploit the available thread-level parallelism (TLP) and otherwise underutilized NoC (Network on Chip) and off-chip memory bandwidth. Furthermore, our scheme is simple to implement and it does not alter the existing cache organization.
520 $a Second, I will illustrate our study on a sound baseline in GPU memory architecture research. We thoroughly investigate the performance impact of different design choices and suggest a set of sound configurations to use in future studies. First, we show that advanced cache indexing functions greatly reduce conflict misses and improve cache performance on GPUs. Then, we demonstrate that among the two cache line allocation policies, allocate-onfill brings a better performance than allocate-on-miss. Third, we show that the number of MSHRs (Miss Status Holding Registers) is an important design factor to explore. Fourth, we demonstrate that Modulo mapping of memory partitions may cause severe partition camping, resulting in underutilization of DRAM bandwidth and the capacity of banked L2 cache. On the other hand, Xor mapping can greatly mitigate the problem of memory partition camping. Furthermore, we show the fact that a realistic number of in-flight bypassed requests can be supported should be taken into account in GPU cache bypass study to achieve more reliable results and conclusions.
520 $a Third, I will present our schemes to accelerate concurrent kernel execution (CKE) in GPUs. We show that although it is potential for intra-SM sharing to better utilize resources within an SM, the negative interference among kernels may undermine the overall performance. Specifically, as concurrent kernels share the memory subsystem, one kernel, even as computing-intensive, may starve from not being able to issue memory requests in time. Besides, severe L1 D-cache thrashing and memory pipeline stalls caused by one kernel, especially a memory-intensive one, will impact other kernels, further hurting the overall performance. In this study, we investigate various schemes to overcome the aforementioned problems exposed in intra-SM sharing. We first highlight that cache partitioning techniques proposed for CPUs are not effective for GPUs. Then we propose two schemes to reduce memory pipeline stalls. The first one is to balance memory accesses from individual kernels. The second one is to limit the number of inflight memory instructions issued from individual kernels. Our proposed schemes significantly improve system throughput and fairness of two state-of-the-art intra-SM sharing schemes, Warped-Slicer and SMK with lightweight hardware overheads.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2018
538 $a Mode of access: World Wide Web
650 4 $a Computer engineering. $3 569006
650 4 $a Computer science. $3 573171
650 4 $a Electrical engineering. $3 596380
655 7 $a Electronic books. $2 local $3 554714
690 $a 0464
690 $a 0984
690 $a 0544
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a North Carolina State University. $b Computer Engineering. $3 1195560
773 0 $t Dissertation Abstracts International $g 79-04B(E).
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10708343 $z click for full text (PQDT)