語系:
繁體中文
English
說明(常見問題)
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
Compressing Language Models Using Low-Rank Decomposition and Characterizing the Accuracy - Efficiency Trade-Offs.
紀錄類型:
書目-語言資料,手稿 : Monograph/item
正題名/作者:
Compressing Language Models Using Low-Rank Decomposition and Characterizing the Accuracy - Efficiency Trade-Offs./
作者:
Moar, Chakshu.
面頁冊數:
1 online resource (53 pages)
附註:
Source: Masters Abstracts International, Volume: 85-11.
Contained By:
Masters Abstracts International85-11.
標題:
Computer engineering. -
電子資源:
click for full text (PQDT)
ISBN:
9798382745343
Compressing Language Models Using Low-Rank Decomposition and Characterizing the Accuracy - Efficiency Trade-Offs.
Moar, Chakshu.
Compressing Language Models Using Low-Rank Decomposition and Characterizing the Accuracy - Efficiency Trade-Offs.
- 1 online resource (53 pages)
Source: Masters Abstracts International, Volume: 85-11.
Thesis (M.S.)--University of California, Irvine, 2024.
Includes bibliographical references
Large language models (LLMs) have emerged and presented their general problem-solving capabilities with one model. However, the model size has increased dramatically with billions of parameters to enable such broad problem-solving capabilities. In addition, due to the dominance of matrix-matrix and matrix-vector multiplications in LLMs, the compute-to-model size ratio is significantly lower than that of convolutional neural networks (CNNs). This shift pushes LLMs from a computation-bound regime to a memory-bound regime. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today.Model compression methods such as quantization and parameter pruning have been actively explored for achieving the memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, Tucker decomposition, on recent language models including an open-source LLM, Llama 2.We formalize the low-rank decomposition design space and show that the decomposition design space is huge (e.g., O(237) for Llama2-7B). To navigate such a huge design space, we characterize the design space and prune ineffective design space utilizing the learning from our characterization results (e.g., we can reduce the pruned ranks to 1 without a noticeable model accuracy drop). On the pruned design space, we perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9% model size reduction with minimal accuracy drops, which range from 4%p to 10%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service in scale (e.g., AI agent assist and real-time coding assistant), where the latency is as important as the model accuracy.
Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024
Mode of access: World Wide Web
ISBN: 9798382745343Subjects--Topical Terms:
569006
Computer engineering.
Subjects--Index Terms:
Deep neural networksIndex Terms--Genre/Form:
554714
Electronic books.
Compressing Language Models Using Low-Rank Decomposition and Characterizing the Accuracy - Efficiency Trade-Offs.
LDR
:03563ntm a22003977 4500
001
1146468
005
20240812064623.5
006
m o d
007
cr bn ---uuuuu
008
250605s2024 xx obm 000 0 eng d
020
$a
9798382745343
035
$a
(MiAaPQ)AAI30997552
035
$a
AAI30997552
040
$a
MiAaPQ
$b
eng
$c
MiAaPQ
$d
NTU
100
1
$a
Moar, Chakshu.
$3
1471857
245
1 0
$a
Compressing Language Models Using Low-Rank Decomposition and Characterizing the Accuracy - Efficiency Trade-Offs.
264
0
$c
2024
300
$a
1 online resource (53 pages)
336
$a
text
$b
txt
$2
rdacontent
337
$a
computer
$b
c
$2
rdamedia
338
$a
online resource
$b
cr
$2
rdacarrier
500
$a
Source: Masters Abstracts International, Volume: 85-11.
500
$a
Advisor: Kwon, Hyoukjun.
502
$a
Thesis (M.S.)--University of California, Irvine, 2024.
504
$a
Includes bibliographical references
520
$a
Large language models (LLMs) have emerged and presented their general problem-solving capabilities with one model. However, the model size has increased dramatically with billions of parameters to enable such broad problem-solving capabilities. In addition, due to the dominance of matrix-matrix and matrix-vector multiplications in LLMs, the compute-to-model size ratio is significantly lower than that of convolutional neural networks (CNNs). This shift pushes LLMs from a computation-bound regime to a memory-bound regime. Therefore, optimizing the memory footprint and traffic is an important optimization direction for LLMs today.Model compression methods such as quantization and parameter pruning have been actively explored for achieving the memory footprint and traffic optimization. However, the accuracy-efficiency trade-off of rank pruning (i.e., low-rank decomposition) for LLMs is not well-understood yet. Therefore, in this work, we characterize the accuracy-efficiency trade-off of a low-rank decomposition method, Tucker decomposition, on recent language models including an open-source LLM, Llama 2.We formalize the low-rank decomposition design space and show that the decomposition design space is huge (e.g., O(237) for Llama2-7B). To navigate such a huge design space, we characterize the design space and prune ineffective design space utilizing the learning from our characterization results (e.g., we can reduce the pruned ranks to 1 without a noticeable model accuracy drop). On the pruned design space, we perform thorough case studies of accuracy-efficiency trade-offs using six widely used LLM benchmarks on BERT and Llama 2 models. Our results show that we can achieve a 9% model size reduction with minimal accuracy drops, which range from 4%p to 10%p, depending on the difficulty of the benchmark, without any retraining to recover accuracy after decomposition. The results show that low-rank decomposition can be a promising direction for LLM-based applications that require real-time service in scale (e.g., AI agent assist and real-time coding assistant), where the latency is as important as the model accuracy.
533
$a
Electronic reproduction.
$b
Ann Arbor, Mich. :
$c
ProQuest,
$d
2024
538
$a
Mode of access: World Wide Web
650
4
$a
Computer engineering.
$3
569006
650
4
$a
Electrical engineering.
$3
596380
653
$a
Deep neural networks
653
$a
Large language models
653
$a
Machine learning
653
$a
Model compression
653
$a
Optimization
653
$a
Tensor decomposition
655
7
$a
Electronic books.
$2
local
$3
554714
690
$a
0464
690
$a
0544
710
2
$a
ProQuest Information and Learning Co.
$3
1178819
710
2
$a
University of California, Irvine.
$b
Electrical and Computer Engineering.
$3
1192604
773
0
$t
Masters Abstracts International
$g
85-11.
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30997552
$z
click for full text (PQDT)
筆 0 讀者評論
多媒體
評論
新增評論
分享你的心得
Export
取書館別
處理中
...
變更密碼[密碼必須為2種組合(英文和數字)及長度為10碼以上]
登入