Parameter-efficient fine-tuning of large-scale pre-trained language models

被引:215
作者
Ding, Ning [1 ,2 ]
Qin, Yujia [1 ,2 ]
Yang, Guang [1 ]
Wei, Fuchao [1 ]
Yang, Zonghan [1 ]
Su, Yusheng [1 ,2 ]
Hu, Shengding [1 ,2 ]
Chen, Yulin [3 ]
Chan, Chi-Min [1 ]
Chen, Weize [1 ,2 ]
Yi, Jing [1 ,2 ]
Zhao, Weilin [1 ,2 ]
Wang, Xiaozhi [1 ]
Liu, Zhiyuan [1 ,2 ]
Zheng, Hai-Tao [3 ]
Chen, Jianfei [1 ]
Liu, Yang [1 ]
Tang, Jie [1 ,2 ]
Li, Juanzi [1 ]
Sun, Maosong [1 ,2 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[2] Beijing Acad Artificial Intelligence, Beijing, Peoples R China
[3] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
All Open Access; Hybrid Gold;
D O I
10.1038/s42256-023-00626-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the prevalence of pre-trained language models (PLMs) and the pre-training-fine-tuning paradigm, it has been continuously shown that larger models tend to yield better performance. However, as PLMs scale up, fine-tuning and storing all the parameters is prohibitively costly and eventually becomes practically infeasible. This necessitates a new branch of research focusing on the parameter-efficient adaptation of PLMs, which optimizes a small portion of the model parameters while keeping the rest fixed, drastically cutting down computation and storage costs. In general, it demonstrates that large-scale models could be effectively stimulated by the optimization of a few parameters. Despite the various designs, here we discuss and analyse the approaches under a more consistent and accessible term 'delta-tuning', where 'delta' a mathematical notation often used to denote changes, is borrowed to refer to the portion of parameters that are 'changed' during training. We formally describe the problem and propose a unified categorization criterion for existing delta-tuning methods to explore their correlations and differences. We also discuss the theoretical principles underlying the effectiveness of delta-tuning and interpret them from the perspectives of optimization and optimal control. Furthermore, we provide a holistic empirical study on over 100 natural language processing tasks and investigate various aspects of delta-tuning. With comprehensive study and analysis, our research demonstrates the theoretical and practical properties of delta-tuning in the adaptation of PLMs. Training a deep neural network can be costly but training time is reduced when a pre-trained network can be adapted to different use cases. Ideally, only a small number of parameters needs to be changed in this process of fine-tuning, which can then be more easily distributed. In this Analysis, different methods of fine-tuning with only a small number of parameters are compared on a large set of natural language processing tasks.
引用
收藏
页码:220 / +
页数:25
相关论文
共 52 条
  • [1] Aghajanyan A., 2021, PROC ACLIJCNLP, P7319
  • [2] PID control system analysis, design, and technology
    Ang, KH
    Chong, G
    Li, Y
    [J]. IEEE TRANSACTIONS ON CONTROL SYSTEMS TECHNOLOGY, 2005, 13 (04) : 559 - 576
  • [3] Ben-Zaken E, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): (SHORT PAPERS), VOL 2, P1
  • [4] Bengio Y, 2001, ADV NEUR IN, V13, P932
  • [5] Boyd S. P., 1991, LINEAR CONTROLLER DE
  • [6] Brown TB, 2020, ADV NEUR IN, V33
  • [7] Chowdhery A, 2022, Arxiv, DOI arXiv:2204.02311
  • [8] Devlin J, 2019, 2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, P4171
  • [9] Ding N, 2022, PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022): PROCEEDINGS OF SYSTEM DEMONSTRATIONS, P105
  • [10] Gao TY, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (ACL-IJCNLP 2021), VOL 1, P3816