Parameter-efficient fine-tuning of large-scale pre-trained language models

被引:215
作者
Ding, Ning [1 ,2 ]
Qin, Yujia [1 ,2 ]
Yang, Guang [1 ]
Wei, Fuchao [1 ]
Yang, Zonghan [1 ]
Su, Yusheng [1 ,2 ]
Hu, Shengding [1 ,2 ]
Chen, Yulin [3 ]
Chan, Chi-Min [1 ]
Chen, Weize [1 ,2 ]
Yi, Jing [1 ,2 ]
Zhao, Weilin [1 ,2 ]
Wang, Xiaozhi [1 ]
Liu, Zhiyuan [1 ,2 ]
Zheng, Hai-Tao [3 ]
Chen, Jianfei [1 ]
Liu, Yang [1 ]
Tang, Jie [1 ,2 ]
Li, Juanzi [1 ]
Sun, Maosong [1 ,2 ]
机构
[1] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[2] Beijing Acad Artificial Intelligence, Beijing, Peoples R China
[3] Tsinghua Univ, Tsinghua Shenzhen Int Grad Sch, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
All Open Access; Hybrid Gold;
D O I
10.1038/s42256-023-00626-4
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the prevalence of pre-trained language models (PLMs) and the pre-training-fine-tuning paradigm, it has been continuously shown that larger models tend to yield better performance. However, as PLMs scale up, fine-tuning and storing all the parameters is prohibitively costly and eventually becomes practically infeasible. This necessitates a new branch of research focusing on the parameter-efficient adaptation of PLMs, which optimizes a small portion of the model parameters while keeping the rest fixed, drastically cutting down computation and storage costs. In general, it demonstrates that large-scale models could be effectively stimulated by the optimization of a few parameters. Despite the various designs, here we discuss and analyse the approaches under a more consistent and accessible term 'delta-tuning', where 'delta' a mathematical notation often used to denote changes, is borrowed to refer to the portion of parameters that are 'changed' during training. We formally describe the problem and propose a unified categorization criterion for existing delta-tuning methods to explore their correlations and differences. We also discuss the theoretical principles underlying the effectiveness of delta-tuning and interpret them from the perspectives of optimization and optimal control. Furthermore, we provide a holistic empirical study on over 100 natural language processing tasks and investigate various aspects of delta-tuning. With comprehensive study and analysis, our research demonstrates the theoretical and practical properties of delta-tuning in the adaptation of PLMs. Training a deep neural network can be costly but training time is reduced when a pre-trained network can be adapted to different use cases. Ideally, only a small number of parameters needs to be changed in this process of fine-tuning, which can then be more easily distributed. In this Analysis, different methods of fine-tuning with only a small number of parameters are compared on a large set of natural language processing tasks.
引用
收藏
页码:220 / +
页数:25
相关论文
共 52 条
  • [31] Mahabadi RK, 2021, ADV NEUR IN, V34
  • [32] Mahabadi RK, 2021, 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), P565
  • [33] Pfeiffer J, 2021, 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), P487
  • [34] Pfeiffer J, 2020, PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING: SYSTEM DEMONSTRATIONS, P46
  • [35] Qin YJ, 2022, Arxiv, DOI arXiv:2110.07867
  • [36] Radford A., 2018, Technical report
  • [37] Radford A., 2019, Technical report
  • [38] Raffel C, 2020, J MACH LEARN RES, V21
  • [39] Rücklé A, 2021, 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), P7930
  • [40] Schick T, 2021, 16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), P255