Stochastic Bregman Proximal Gradient Method Revisited: Kernel Conditioning and Painless Variance Reduction

成果类型:
Article; Early Access
署名作者:
Zhang, Junyu
署名单位:
National University of Singapore
刊物名称:
MATHEMATICAL PROGRAMMING
ISSN/ISSBN:
0025-5610
DOI:
10.1007/s10107-025-02285-2
发表日期:
2025
关键词:
1st-order methods continuity
摘要:
We investigate stochastic Bregman proximal gradient (SBPG) methods for minimizing a finite-sum nonconvex function Psi(x):=1n & sum;i=1nfi(x)+phi(x)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Psi (x):=\frac{1}{n}\sum _{i=1}<^>nf_i(x)+\phi (x)$$\end{document}, where phi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi $$\end{document} is convex and nonsmooth, while fi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f_i$$\end{document}, instead of gradient global Lipschitz continuity, satisfies a smooth-adaptability condition w.r.t. some kernel h. Standard acceleration techniques for stochastic algorithms (momentum, shuffling, variance reduction) depend on bounding stochastic errors by gradient differences that are further controlled via Lipschitz property. Lacking this, existing SBPG results are mostly limited to vanilla stochastic approximation schemes that cannot obtain the optimal O(n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\sqrt{n})$$\end{document} complexity dependence on n. Moreover, existing works report complexities under various nonstandard stationarity measures that largely deviate from the standard minimal limiting Fr & eacute;chet subdifferential dist(0,partial derivative Psi())\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{dist}(0,\partial \Psi (\cdot ))$$\end{document}. Our analysis reveals that these popular nonstandard stationarity measures are often much smaller than dist(0,partial derivative Psi())\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{dist}(0,\partial \Psi (\cdot ))$$\end{document} by a large or even unbounded instance-dependent mismatch factor, leading to overstated solution quality and producing non-stationary output. This also implies that current complexities based on nonstandard measures are actually asymptotic and instance-dependent if translated to dist(0,partial derivative Psi())\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{dist}(0,\partial \Psi (\cdot ))$$\end{document}. To resolve these issues, we design a new gradient mapping D phi,h lambda()\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal {D}_{\phi ,h}<^>\lambda (\cdot )$$\end{document} by BPG residuals in dual space and a new kernel-conditioning (KC) regularity, under which the mismatch between & Vert;D phi,h lambda()& Vert;\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Vert \mathcal {D}_{\phi ,h}<^>\lambda (\cdot )\Vert $$\end{document} and dist(0,partial derivative Psi())\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{dist}(0,\partial \Psi (\cdot ))$$\end{document} is provably O(1) and instance-free. Moreover, KC-regularity guarantees Lipschitz-like bounds for gradient differences, providing general analysis tools for momentum, shuffling, and variance reduction under smooth-adaptability. We illustrate this point on variance reduced SBPG methods and establish an O(n)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$O(\sqrt{n})$$\end{document} complexity dependence for & Vert;D phi,h lambda()& Vert;\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\Vert \mathcal {D}_{\phi ,h}<^>\lambda (\cdot )\Vert $$\end{document}, providing instance-free (worst-case) complexity under dist(0,partial derivative Psi())\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textrm{dist}(0,\partial \Psi (\cdot ))$$\end{document}.
来源URL: