國立虎尾科技大學 |

Optimizing Memory Efficiency For Many-Core Architecture.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Optimizing Memory Efficiency For Many-Core Architecture./
作者:	Li, Chao.
面頁冊數:	1 online resource (158 pages)
附註:	Source: Dissertation Abstracts International, Volume: 78-08(E), Section: B.
Contained By:	Dissertation Abstracts International78-08B(E).
標題:	Computer engineering. -
電子資源:	click for full text (PQDT)
ISBN:	9781369621365

Optimizing Memory Efficiency For Many-Core Architecture.
Li, Chao.

Optimizing Memory Efficiency For Many-Core Architecture. - 1 online resource (158 pages)

Source: Dissertation Abstracts International, Volume: 78-08(E), Section: B.

Thesis (Ph.D.)--North Carolina State University, 2016.

Includes bibliographical references

Massively parallel, throughput-oriented processors such as graphics processing units (GPUs) leverage high thread-level parallelism to overlap long latency memory accesses with computation. On-chip memory resources, including register files, shared memory, and data caches which were designed to provide high-bandwidth low-latency data accesses, remain critical to application performance.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2018

Mode of access: World Wide Web

ISBN: 9781369621365Subjects--Topical Terms:

569006
Computer engineering.
Index Terms--Genre/Form:

554714
Electronic books.

Optimizing Memory Efficiency For Many-Core Architecture.
LDR:04736ntm a2200385Ki 4500 001 919582
005 20181129115238.5
006 m o u
007 cr mn||||a|a||
008 190606s2016 xx obm 000 0 eng d
020 $a 9781369621365
035 $a (MiAaPQ)AAI10583457
035 $a AAI10583457
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Li, Chao. $3 795910
245 1 0 $a Optimizing Memory Efficiency For Many-Core Architecture.
264 0 $c 2016
300 $a 1 online resource (158 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertation Abstracts International, Volume: 78-08(E), Section: B.
500 $a Adviser: Huiyang Zhou.
502 $a Thesis (Ph.D.)--North Carolina State University, 2016.
504 $a Includes bibliographical references
520 $a Massively parallel, throughput-oriented processors such as graphics processing units (GPUs) leverage high thread-level parallelism to overlap long latency memory accesses with computation. On-chip memory resources, including register files, shared memory, and data caches which were designed to provide high-bandwidth low-latency data accesses, remain critical to application performance.
520 $a In this dissertation, we study these memory resources and propose optimizations to improve the memory efficiency for GPU architecture in a whole stack, from hardware architecture, compiler, to application-level algorithms. First, I will present our work on understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs. To manage on-chip caches, either software-managed or hardware-managed schemes can be employed. State-of-the-art accelerators, such as the NVIDIA Fermi or Kepler GPUs support both software-managed caches, aka, shared memory, and hardware-managed L1 data caches (D-caches). Shared memory and the L1 D-cache on a GPU utilize the same physical storage and their capacity can be configured at runtime. In this work, we present an in-depth study to reveal interesting and sometimes unexpected tradeoffs between shared memory and the hardware-managed L1 D-caches in GPU architecture.
520 $a Secondly, I will present our novel compiler scheme for the judicious utilization of onchip memory resources on GPUs. To manage these intricate on-chip memory resources is nontrivial for application developers. Moreover, the varying on-chip resource across different GPU generations makes performance portability a daunting challenge. In this study, we propose compiler-driven automatic data placement scheme, to refine GPU programs by altering data placement among different on-chip resources to achieve both performance enhancement and performance portability.
520 $a Thirdly, I will present our study on novel cache optimization for GPU architecture. Onchip L1 D-caches are critical resources for providing high-bandwidth and low-latency data accesses. We observe that the memory access streams to L1 D-caches for many applications contain a significant amount of requests with low reuse, which greatly reduce the cache efficacy. We propose an efficient locality monitoring mechanism to dynamically filter the access stream on cache insertion such that only the data with high reuse and short reuse distances are stored in the L1 D-cache. We propose a design that integrates the locality filtering functionality into the decoupled tag store of the current L1 D-cache through simple and cost effective hardware extensions.
520 $a Finally, I will present our study on algorithm-level memory efficiency optimization for GPUs. We optimize the memory efficiency for accelerating deep Convolutional Neural Networks (CNNs), the state-of-the-art machine learning algorithm, on GPUs. Existing works on GPU-accelerated deep CNNs mainly focus on the computational efficiency of CNNs while the memory efficiency of CNNs have been largely overlooked. In this study, we look into the memory efficiency of various CNN layers on GPUs and reveal the performance implications from both data layouts and memory access patterns. Then we propose a set of methods to optimize memory efficiency for accelerating CNNs on GPUs. The experiment results demonstrate the effectiveness of our memory optimizations and their universal effects on different types of layers and various complete networks.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2018
538 $a Mode of access: World Wide Web
650 4 $a Computer engineering. $3 569006
650 4 $a Computer science. $3 573171
650 4 $a Electrical engineering. $3 596380
655 7 $a Electronic books. $2 local $3 554714
690 $a 0464
690 $a 0984
690 $a 0544
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a North Carolina State University. $3 845424
773 0 $t Dissertation Abstracts International $g 78-08B(E).
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=10583457 $z click for full text (PQDT)