Fast and scalable ensemble learning method for versatile polygenic risk prediction

成果类型:
Article
署名作者:
Chen, Tony; Zhang, Haoyu; Mazumder, Rahul; Lin, Xihong
署名单位:
Harvard University; Harvard T.H. Chan School of Public Health; National Institutes of Health (NIH) - USA; NIH National Cancer Institute (NCI); NIH National Cancer Institute- Division of Cancer Epidemiology & Genetics; Massachusetts Institute of Technology (MIT); Harvard University
刊物名称:
PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA
ISSN/ISSBN:
0027-12083
DOI:
10.1073/pnas.2403210121
发表日期:
2024-08-13
关键词:
linkage disequilibrium selection regression accuracy disease models regularization association insights common
摘要:
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary- level data (ALL- Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL- Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large- scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL- Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20- fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL- Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL- Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state- of- the- art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL- Sum is available as a user- friendly R software package with publicly available reference data for streamlined analysis.