Parameter-efficient fine-tuning of large-scale pre-trained language models

被引：215

作者：

Ding, Ning ^{[1
,2
]}

Qin, Yujia ^{[1
,2
]}

Yang, Guang ^{[1
]}

Wei, Fuchao ^{[1
]}

Yang, Zonghan ^{[1
]}

Su, Yusheng ^{[1
,2
]}

Hu, Shengding ^{[1
,2
]}

Chen, Yulin ^{[3
]}

Chan, Chi-Min ^{[1
]}

Chen, Weize ^{[1
,2
]}

Yi, Jing ^{[1
,2
]}

Zhao, Weilin ^{[1
,2
]}

Wang, Xiaozhi ^{[1
]}

Liu, Zhiyuan ^{[1
,2
]}

Zheng, Hai-Tao ^{[3
]}

Chen, Jianfei ^{[1
]}

Liu, Yang ^{[1
]}

Tang, Jie ^{[1
,2
]}

Li, Juanzi ^{[1
]}

Sun, Maosong ^{[1
,2
]}

机构：

[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China

[2] Beijing Acad Artificial Intelligence, Beijing, Peoples R China

[3] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China

来源：

NATURE MACHINE INTELLIGENCE | 2023年 / 5卷 / 03期

基金：

中国国家自然科学基金;

关键词：

All Open Access; Hybrid Gold;

D O I：

10.1038/s42256-023-00626-4

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the prevalence of pre-trained language models (PLMs) and the pre-training-fine-tuning paradigm, it has been continuously shown that larger models tend to yield better performance. However, as PLMs scale up, fine-tuning and storing all the parameters is prohibitively costly and eventually becomes practically infeasible. This necessitates a new branch of research focusing on the parameter-efficient adaptation of PLMs, which optimizes a small portion of the model parameters while keeping the rest fixed, drastically cutting down computation and storage costs. In general, it demonstrates that large-scale models could be effectively stimulated by the optimization of a few parameters. Despite the various designs, here we discuss and analyse the approaches under a more consistent and accessible term 'delta-tuning', where 'delta' a mathematical notation often used to denote changes, is borrowed to refer to the portion of parameters that are 'changed' during training. We formally describe the problem and propose a unified categorization criterion for existing delta-tuning methods to explore their correlations and differences. We also discuss the theoretical principles underlying the effectiveness of delta-tuning and interpret them from the perspectives of optimization and optimal control. Furthermore, we provide a holistic empirical study on over 100 natural language processing tasks and investigate various aspects of delta-tuning. With comprehensive study and analysis, our research demonstrates the theoretical and practical properties of delta-tuning in the adaptation of PLMs. Training a deep neural network can be costly but training time is reduced when a pre-trained network can be adapted to different use cases. Ideally, only a small number of parameters needs to be changed in this process of fine-tuning, which can then be more easily distributed. In this Analysis, different methods of fine-tuning with only a small number of parameters are compared on a large set of natural language processing tasks.

引用

页码：220 / +

页数：25

共 52 条

[31] Mahabadi RK, 2021, ADV NEUR IN, V34
[32] Mahabadi RK, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), P565
[33] Pfeiffer J, 2021, 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), P487
[34] Pfeiffer J, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING: SYSTEM DEMONSTRATIONS, P46
[35] Qin YJ, 2022, Arxiv, DOI arXiv:2110.07867
[36] Radford A., 2018, Technical report
[37] Radford A., 2019, Technical report
[38] Raffel C, 2020, J MACH LEARN RES, V21
[39] Rücklé A, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P7930
[40] Schick T, 2021, 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), P255

← 1 2 3 4 5 6 →