語系:
繁體中文
English
說明(常見問題)
登入
回首頁
切換:
標籤
|
MARC模式
|
ISBD
Efficient Training Methods for Transformer Networks.
紀錄類型:
書目-語言資料,手稿 : Monograph/item
正題名/作者:
Efficient Training Methods for Transformer Networks./
作者:
Lialin, Vladislav.
面頁冊數:
1 online resource (194 pages)
附註:
Source: Dissertations Abstracts International, Volume: 85-11, Section: B.
Contained By:
Dissertations Abstracts International85-11B.
標題:
Computer science. -
電子資源:
click for full text (PQDT)
ISBN:
9798382733890
Efficient Training Methods for Transformer Networks.
Lialin, Vladislav.
Efficient Training Methods for Transformer Networks.
- 1 online resource (194 pages)
Source: Dissertations Abstracts International, Volume: 85-11, Section: B.
Thesis (Ph.D.)--University of Massachusetts Lowell, 2024.
Includes bibliographical references
Over the past five years, deep learning methods have changed the landscape of natural language processing (NLP) and machine learning. Novel architectures, such as Transformers, in combination with old training techniques, such as language modeling, have become a universal way to approach any NLP task. Moreover, scaling laws have revealed that model performance can be reliably and predictably improved by increasing pre-training data or model size. Soon, the training costs grew from using just one GPU over a day to using thousands of GPUs over multiple months. Unlike the 2012 - 2018 period, neural network architecture is not the key factor anymore. Instead, good scaling properties and training effectiveness are.In this thesis, we focus on the topic of computational efficiency in contemporary NLP and what models learn during training. We first present our early work on a practical application of Continual Learning (CL) for neural semantic parsing. Then, we study at scale, how different pre-training objectives, amounts of pre-training data, and architecture variations affect the model's linguistic capabilities. To develop novel efficient training methods we then establish, categorize, and survey the state-of-the-art methods for parameter-efficient fine-tuning.We introduce ReLoRA, the first-of-its-kind method that utilizes low-rank updates to train high-rank networks. ReLoRA consistently outperforms LoRA in both fine-tuning and pre-training large transformer models up to 1.3B parameters. ReLoRA becomes more effective as the model size grows. Our largest experiment demonstrate RAM usage reduction by 5Gb per GPU, and up to 40\\% wall-clock time reduction, depending on the hardware setup. Further, our results show similar performance to regular training making ReLoRA a promising candidate for improving the efficiency of large model training.We conclude that our novel approach, ReLoRA, serves as a significant advancement in reducing computational and memory costs associated with training large-scale NLP models. Limitations of this study include focus on larger models where model weights and optimizer states take a significant chunk from the GPU memory and the extent to which neural network training can be approximated via a sequence of low-rank updates. However, our research demonstrates promising results for models up to 1B parameters and we expect them to perform even better at larger scale, where ReLoRA can bring even larger improvements. Overall, the methods and insights presented in this thesis have the potential to significantly influence future developments in NLP and machine learning, pushing the boundaries of what is computationally feasible.
Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024
Mode of access: World Wide Web
ISBN: 9798382733890Subjects--Topical Terms:
573171
Computer science.
Subjects--Index Terms:
Computational efficiencyIndex Terms--Genre/Form:
554714
Electronic books.
Efficient Training Methods for Transformer Networks.
LDR
:04130ntm a22004217 4500
001
1150412
005
20241028051441.5
006
m o d
007
cr bn ---uuuuu
008
250605s2024 xx obm 000 0 eng d
020
$a
9798382733890
035
$a
(MiAaPQ)AAI30696415
035
$a
AAI30696415
040
$a
MiAaPQ
$b
eng
$c
MiAaPQ
$d
NTU
100
1
$a
Lialin, Vladislav.
$3
1476889
245
1 0
$a
Efficient Training Methods for Transformer Networks.
264
0
$c
2024
300
$a
1 online resource (194 pages)
336
$a
text
$b
txt
$2
rdacontent
337
$a
computer
$b
c
$2
rdamedia
338
$a
online resource
$b
cr
$2
rdacarrier
500
$a
Source: Dissertations Abstracts International, Volume: 85-11, Section: B.
500
$a
Advisor: Rumshisky, Anna.
502
$a
Thesis (Ph.D.)--University of Massachusetts Lowell, 2024.
504
$a
Includes bibliographical references
520
$a
Over the past five years, deep learning methods have changed the landscape of natural language processing (NLP) and machine learning. Novel architectures, such as Transformers, in combination with old training techniques, such as language modeling, have become a universal way to approach any NLP task. Moreover, scaling laws have revealed that model performance can be reliably and predictably improved by increasing pre-training data or model size. Soon, the training costs grew from using just one GPU over a day to using thousands of GPUs over multiple months. Unlike the 2012 - 2018 period, neural network architecture is not the key factor anymore. Instead, good scaling properties and training effectiveness are.In this thesis, we focus on the topic of computational efficiency in contemporary NLP and what models learn during training. We first present our early work on a practical application of Continual Learning (CL) for neural semantic parsing. Then, we study at scale, how different pre-training objectives, amounts of pre-training data, and architecture variations affect the model's linguistic capabilities. To develop novel efficient training methods we then establish, categorize, and survey the state-of-the-art methods for parameter-efficient fine-tuning.We introduce ReLoRA, the first-of-its-kind method that utilizes low-rank updates to train high-rank networks. ReLoRA consistently outperforms LoRA in both fine-tuning and pre-training large transformer models up to 1.3B parameters. ReLoRA becomes more effective as the model size grows. Our largest experiment demonstrate RAM usage reduction by 5Gb per GPU, and up to 40\\% wall-clock time reduction, depending on the hardware setup. Further, our results show similar performance to regular training making ReLoRA a promising candidate for improving the efficiency of large model training.We conclude that our novel approach, ReLoRA, serves as a significant advancement in reducing computational and memory costs associated with training large-scale NLP models. Limitations of this study include focus on larger models where model weights and optimizer states take a significant chunk from the GPU memory and the extent to which neural network training can be approximated via a sequence of low-rank updates. However, our research demonstrates promising results for models up to 1B parameters and we expect them to perform even better at larger scale, where ReLoRA can bring even larger improvements. Overall, the methods and insights presented in this thesis have the potential to significantly influence future developments in NLP and machine learning, pushing the boundaries of what is computationally feasible.
533
$a
Electronic reproduction.
$b
Ann Arbor, Mich. :
$c
ProQuest,
$d
2024
538
$a
Mode of access: World Wide Web
650
4
$a
Computer science.
$3
573171
650
4
$a
Linguistics.
$3
557829
650
4
$a
Language.
$3
571568
653
$a
Computational efficiency
653
$a
Deep learning
653
$a
Natural language processing
653
$a
Parameter-efficient fine-tuning
653
$a
Scaling laws
653
$a
Transformers
655
7
$a
Electronic books.
$2
local
$3
554714
690
$a
0800
690
$a
0679
690
$a
0984
690
$a
0290
710
2
$a
University of Massachusetts Lowell.
$b
Computer Science.
$3
1437794
710
2
$a
ProQuest Information and Learning Co.
$3
1178819
773
0
$t
Dissertations Abstracts International
$g
85-11B.
856
4 0
$u
http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30696415
$z
click for full text (PQDT)
筆 0 讀者評論
多媒體
評論
新增評論
分享你的心得
Export
取書館別
處理中
...
變更密碼[密碼必須為2種組合(英文和數字)及長度為10碼以上]
登入