Democratizing protein language models with parameter-efficient fine-tuning

成果类型:
Article
署名作者:
Sledzieski, Samuel; Kshirsagar, Meghana; Baek, Minkyung; Dodhia, Rahul; Ferres, Juan Lavista; Berger, Bonnie
署名单位:
Microsoft; Massachusetts Institute of Technology (MIT); Seoul National University (SNU); Massachusetts Institute of Technology (MIT)
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-14908
DOI:
10.1073/pnas.2405840121
发表日期:
2024-06-25
关键词:
prediction
摘要:
Proteomics has been revolutionized by large protein language models (PLMs), which learn unsupervised representations from large corpora of sequences. These models are typically fine-tuned in a supervised setting to adapt the model to specific downstream tasks. However, the computational and memory footprint of fine-tuning (FT) large PLMs presents a barrier for many research groups with limited computational resources. Natural language processing has seen a similar explosion in the size models, where these challenges have been addressed by methods for parameter -efficient fine-tuning (PEFT). In this work, we introduce this paradigm to proteomics through leveraging the parameter -efficient method LoRA and training new models for two important tasks: predicting protein-protein interactions (PPIs) and predicting the symmetry of homooligomer quaternary structures. We show that these approaches are competitive with traditional FT while requiring reduced memory and substantially fewer parameters. We additionally show that for the PPI prediction task, training only the classification head also remains competitive with full FT, using five orders of magnitude fewer parameters, and that each of these methods outperform stateof-the-art PPI prediction methods with substantially reduced compute. We further perform a comprehensive evaluation of the hyperparameter space, demonstrate that PEFT of PLMs is robust to variations in these hyperparameters, and elucidate where best practices for PEFT in proteomics differ from those in natural language processing. All our model adaptation and evaluation code is available open -source at https://github.com/microsoft/peft_proteomics. Thus, we provide a blueprint democratize the power of PLM adaptation to groups with limited computational resources.