國立虎尾科技大學 |

Efficient Training Methods for Transformer Networks.

紀錄類型:	書目-語言資料,手稿 : Monograph/item
正題名/作者:	Efficient Training Methods for Transformer Networks./
作者:	Lialin, Vladislav.
面頁冊數:	1 online resource (194 pages)
附註:	Source: Dissertations Abstracts International, Volume: 85-11, Section: B.
Contained By:	Dissertations Abstracts International85-11B.
標題:	Language. -
電子資源:	click for full text (PQDT)
ISBN:	9798382733890

Efficient Training Methods for Transformer Networks.
Lialin, Vladislav.

Efficient Training Methods for Transformer Networks. - 1 online resource (194 pages)

Source: Dissertations Abstracts International, Volume: 85-11, Section: B.

Thesis (Ph.D.)--University of Massachusetts Lowell, 2024.

Includes bibliographical references

Over the past five years, deep learning methods have changed the landscape of natural language processing (NLP) and machine learning. Novel architectures, such as Transformers, in combination with old training techniques, such as language modeling, have become a universal way to approach any NLP task. Moreover, scaling laws have revealed that model performance can be reliably and predictably improved by increasing pre-training data or model size. Soon, the training costs grew from using just one GPU over a day to using thousands of GPUs over multiple months. Unlike the 2012 - 2018 period, neural network architecture is not the key factor anymore. Instead, good scaling properties and training effectiveness are.In this thesis, we focus on the topic of computational efficiency in contemporary NLP and what models learn during training. We first present our early work on a practical application of Continual Learning (CL) for neural semantic parsing. Then, we study at scale, how different pre-training objectives, amounts of pre-training data, and architecture variations affect the model's linguistic capabilities. To develop novel efficient training methods we then establish, categorize, and survey the state-of-the-art methods for parameter-efficient fine-tuning.We introduce ReLoRA, the first-of-its-kind method that utilizes low-rank updates to train high-rank networks. ReLoRA consistently outperforms LoRA in both fine-tuning and pre-training large transformer models up to 1.3B parameters. ReLoRA becomes more effective as the model size grows. Our largest experiment demonstrate RAM usage reduction by 5Gb per GPU, and up to 40\\% wall-clock time reduction, depending on the hardware setup. Further, our results show similar performance to regular training making ReLoRA a promising candidate for improving the efficiency of large model training.We conclude that our novel approach, ReLoRA, serves as a significant advancement in reducing computational and memory costs associated with training large-scale NLP models. Limitations of this study include focus on larger models where model weights and optimizer states take a significant chunk from the GPU memory and the extent to which neural network training can be approximated via a sequence of low-rank updates. However, our research demonstrates promising results for models up to 1B parameters and we expect them to perform even better at larger scale, where ReLoRA can bring even larger improvements. Overall, the methods and insights presented in this thesis have the potential to significantly influence future developments in NLP and machine learning, pushing the boundaries of what is computationally feasible.

Electronic reproduction.
Ann Arbor, Mich. :
ProQuest,
2024

Mode of access: World Wide Web

ISBN: 9798382733890Subjects--Topical Terms:

571568
Language.
Subjects--Index Terms:

Computational efficiencyIndex Terms--Genre/Form:

554714
Electronic books.

Efficient Training Methods for Transformer Networks.
LDR:04130ntm a22004217 4500 001 1150412
005 20241028051441.5
006 m o d
007 cr bn ---uuuuu
008 250605s2024 xx obm 000 0 eng d
020 $a 9798382733890
035 $a (MiAaPQ)AAI30696415
035 $a AAI30696415
040 $a MiAaPQ $b eng $c MiAaPQ $d NTU
100 1 $a Lialin, Vladislav. $3 1476889
245 1 0 $a Efficient Training Methods for Transformer Networks.
264 0 $c 2024
300 $a 1 online resource (194 pages)
336 $a text $b txt $2 rdacontent
337 $a computer $b c $2 rdamedia
338 $a online resource $b cr $2 rdacarrier
500 $a Source: Dissertations Abstracts International, Volume: 85-11, Section: B.
500 $a Advisor: Rumshisky, Anna.
502 $a Thesis (Ph.D.)--University of Massachusetts Lowell, 2024.
504 $a Includes bibliographical references
520 $a Over the past five years, deep learning methods have changed the landscape of natural language processing (NLP) and machine learning. Novel architectures, such as Transformers, in combination with old training techniques, such as language modeling, have become a universal way to approach any NLP task. Moreover, scaling laws have revealed that model performance can be reliably and predictably improved by increasing pre-training data or model size. Soon, the training costs grew from using just one GPU over a day to using thousands of GPUs over multiple months. Unlike the 2012 - 2018 period, neural network architecture is not the key factor anymore. Instead, good scaling properties and training effectiveness are.In this thesis, we focus on the topic of computational efficiency in contemporary NLP and what models learn during training. We first present our early work on a practical application of Continual Learning (CL) for neural semantic parsing. Then, we study at scale, how different pre-training objectives, amounts of pre-training data, and architecture variations affect the model's linguistic capabilities. To develop novel efficient training methods we then establish, categorize, and survey the state-of-the-art methods for parameter-efficient fine-tuning.We introduce ReLoRA, the first-of-its-kind method that utilizes low-rank updates to train high-rank networks. ReLoRA consistently outperforms LoRA in both fine-tuning and pre-training large transformer models up to 1.3B parameters. ReLoRA becomes more effective as the model size grows. Our largest experiment demonstrate RAM usage reduction by 5Gb per GPU, and up to 40\\% wall-clock time reduction, depending on the hardware setup. Further, our results show similar performance to regular training making ReLoRA a promising candidate for improving the efficiency of large model training.We conclude that our novel approach, ReLoRA, serves as a significant advancement in reducing computational and memory costs associated with training large-scale NLP models. Limitations of this study include focus on larger models where model weights and optimizer states take a significant chunk from the GPU memory and the extent to which neural network training can be approximated via a sequence of low-rank updates. However, our research demonstrates promising results for models up to 1B parameters and we expect them to perform even better at larger scale, where ReLoRA can bring even larger improvements. Overall, the methods and insights presented in this thesis have the potential to significantly influence future developments in NLP and machine learning, pushing the boundaries of what is computationally feasible.
533 $a Electronic reproduction. $b Ann Arbor, Mich. : $c ProQuest, $d 2024
538 $a Mode of access: World Wide Web
650 4 $a Language. $3 571568
650 4 $a Linguistics. $3 557829
650 4 $a Computer science. $3 573171
653 $a Computational efficiency
653 $a Deep learning
653 $a Natural language processing
653 $a Parameter-efficient fine-tuning
653 $a Scaling laws
653 $a Transformers
655 7 $a Electronic books. $2 local $3 554714
690 $a 0800
690 $a 0679
690 $a 0984
690 $a 0290
710 2 $a ProQuest Information and Learning Co. $3 1178819
710 2 $a University of Massachusetts Lowell. $b Computer Science. $3 1437794
773 0 $t Dissertations Abstracts International $g 85-11B.
856 4 0 $u http://pqdd.sinica.edu.tw/twdaoapp/servlet/advanced?query=30696415 $z click for full text (PQDT)