Explaining neural scaling laws
成果类型:
Article
署名作者:
Bahri, Yasaman; Dyer, Ethan; Kaplan, Jared; Lee, Jaehoon; Sharma, Utkarsh
署名单位:
Alphabet Inc.; DeepMind; Johns Hopkins University
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-9769
DOI:
10.1073/pnas.2311878121
发表日期:
2024-07-02
关键词:
integral-operators
eigenvalues
摘要:
The population loss of trained deep neural networks often follows precise power -law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance -limited and resolution -limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance -limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution -limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution -limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origin and relationships between scaling exponents.