Nonconvex optimization: Escaping saddle points and finding local minima
Overview
Finding the global minimum of nonconvex optimization problems is challenging because local algorithms get stuck in local minima. Often, there are many local minima and they are each separated by large energy barriers. Accordingly, instead of finding the global minimum, one may settle for finding a local minimum: local minima can often still be used effectively in situations such as training machine learning models. An effective approach to finding a local minimum is gradient descent, but gradient descent can run into the problem of getting stuck near saddle points, which are not local minima but nonetheless have a vanishing gradient. Efficiently finding local minima thus requires methods for escaping saddle points. Limited work in this area suggests a potential polynomial quantum speedup [1] in the dimension dependence for finding local minima, using subroutines for Hamiltonian simulation and quantum gradient estimation.
Actual end-to-end problem(s) solved
Suppose we have a classical algorithm \(\mathcal{A}_f\) for (approximately) computing a function \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) which requires \(C_f\) gates to perform with a reversible classical circuit. The amount of error tolerable is specified later. Following [1], suppose further that \(f\) is \(\ell\)-smooth and \(\rho\)-Hessian Lipschitz, that is
where \(\nabla f\) denotes the gradient of \(f\) (a vector), \(\nabla^2 f\) denotes the Hessian of \(f\) (a matrix), and \(\lVert \cdot \rVert\) denotes the standard Euclidean norm for vector arguments, and the spectral norm for matrix arguments.
The end-to-end problem solved is to take as input a specification of the function \(f\), an initial point \(x_0\), and an error parameter \(\epsilon\), and to output an \(\epsilon\)-approximate second-order stationary point (i.e. approximate local minimum) \(x\), defined as satisfying
where \(\lambda_{\min}(\cdot)\) denotes the minimum eigenvalue of its argument. In other words, the gradient should be nearly zero, and the Hessian should be close to a positive-semidefinite matrix.
Dominant resource cost/complexity
Reference [1] gives a quantum algorithm that performs the \(C_f\)-gate quantum circuit for coherently computing \(f\) a number of times scaling as
where \(x_0\) is the initial point and \(f^*\) is the global minimum of \(f\). The evaluation of \(f\) must be correct up to precision \(\mathcal{O}\left( \epsilon^2/n^4 \right)\). Note that the work of [1] initially showed a \(\log^2(n)\) dependence, which was later improved to \(\log(n)\) using the improved simulation method of [2]. Any additional gate overhead is not quoted in [1].
The idea is to run normal gradient descent, which has gradient query cost independent of \(n\), until reaching an approximate saddle point. Classical algorithms typically apply random perturbations to detect a direction of negative curvature and continue the gradient descent. Instead, the quantum algorithm constructs a Gaussian wavepacket localized at the saddle point, and evolves according to the Schrödinger equation
where \(\Delta\) denotes the Laplacian operator. The intuition is that, in the directions of positive curvature, the particle stays localized (as in a harmonic potential), while in the directions of negative curvature, the particle quickly disperses. Thus, when the position of the particle is measured, it is likely to have escaped the saddle point in a direction of negative curvature, and gradient descent can be continued. The other technical ingredient is the quantum gradient estimation algorithm, which uses a constant number of (coherent) queries to the function \(f\) to estimate \(\nabla f\).
A similar approach was taken in [3] for analyzing the complexity of escaping a saddle point when one has access to noisy queries to the value of the function \(f\). Additionally, lower bounds on the \(\epsilon\)-dependence of quantum algorithms for this problem are given in [4].
Existing error corrected resource estimates
This problem has received relatively little attention, and no resource estimates have been performed.
Caveats
Reference [1] gives the query complexity of the quantum algorithm but does not perform a full end-to-end resource analysis. (However, it does numerically study the performance of the quantum algorithm in a couple of toy examples.) Additionally, many practical scenarios are said to obey a "cheap gradient principle" [5, 6], which says that computing the gradient is almost as easy as computing the function itself, and in these scenarios, no significant quantum speedup is available. Finally, in the setting of variational quantum algorithms, this does not avoid the issue of barren plateaus, which refers to the situation where a large portion of the parameter space has a gradient (and Hessian) that vanishes exponentially with \(n\). These regions would be characterized as \(\epsilon\)-approximate local minima unless \(\epsilon\) is made exponentially small in \(n\).
Comparable classical complexity and challenging instance sizes
The best classical algorithm [7] for this problem makes
queries to the gradient of \(f\). Note that \(\text{poly}(n)\) queries to the value of \(f\) would be needed to construct a query to the gradient. (When the quantum algorithm in [1] was first discovered, the best classical algorithm required \(\mathcal{O}\left( \log(n)^6 \right)\) gradient queries [8, Theorem 3], and this was later improved.)
Speedup
The quantum algorithm in [1] has the same query complexity as the classical algorithm in [7]; the difference is that the quantum algorithm makes (coherent) queries to an evaluation oracle, while the classical algorithm requires access to a gradient oracle. Thus, if classical gradient queries are available, there is no speedup, and if no gradient query is available, the speedup can be exponential.
Outlook
It is unclear whether the algorithm for finding local minima could lead to a practical speedup, as it depends highly on the (non)availability of an efficient classical procedure for implementing gradient oracles; a quantum speedup is possible only when such oracles are difficult to implement classically. However, the algorithm represents a useful end-to-end problem where the quantum gradient estimation primitive can be applied. It is also notable that the quantum algorithm employs Hamiltonian simulation, a primitive not used in most other approaches to continuous optimization. Relatedly, [9] proposes a quantum subroutine called "quantum Hamiltonian descent" which is a genuinely quantum counterpart to classical gradient descent, via Hamiltonian simulation of an equation similar to Eq. \(\eqref{eq:nonconvex_Ham}\). Unlike classical gradient descent, it can exploit quantum tunneling to avoid getting stuck in local minima; thus, it can potentially find global minima of nonconvex functions. Establishing concrete end-to-end problems where quantum approaches based on Hamiltonian simulation yield an advantage in nonconvex optimization is an interesting direction for future work.
Bibliography
-
Chenyi Zhang, Jiaqi Leng, and Tongyang Li. Quantum algorithms for escaping from saddle points. Quantum, 5:529, 8 2021. arXiv: https://arxiv.org/abs/2007.10253. URL: https://doi.org/10.22331/q-2021-08-20-529, doi:10.22331/q-2021-08-20-529.
-
Andrew M. Childs, Jiaqi Leng, Tongyang Li, Jin-Peng Liu, and Chenyi Zhang. Quantum simulation of real-space dynamics. Quantum, 6:860, 11 2022. arXiv: https://arxiv.org/abs/2203.17006. URL: https://doi.org/10.22331/q-2022-11-17-860, doi:10.22331/q-2022-11-17-860.
-
Weiyuan Gong, Chenyi Zhang, and Tongyang Li. Robustness of quantum algorithms for nonconvex optimization. arXiv: https://arxiv.org/abs/2212.02548, 2022.
-
Chenyi Zhang and Tongyang Li. Quantum lower bounds for finding stationary points of nonconvex functions. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning (ICML), volume 202 of Proceedings of Machine Learning Research, 41268–41299. PMLR, 7 2023. arXiv: https://arxiv.org/abs/2212.03906. URL: https://proceedings.mlr.press/v202/zhang23u.html.
-
Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM, 2008. doi:10.1137/1.9780898717761.
-
Jérôme Bolte, Ryan Boustany, Edouard Pauwels, and Béatrice Pesquet-Popescu. Nonsmooth automatic differentiation: a cheap gradient principle and other complexity results. arXiv: https://arxiv.org/abs/2206.01730, 2022.
-
Chenyi Zhang and Tongyang Li. Escape saddle points by a simple gradient-descent based algorithm. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34 (NIPS), 8545–8556. Curran Associates, Inc., 2021. arXiv: https://arxiv.org/abs/2111.14069. URL: https://proceedings.neurips.cc/paper\_files/paper/2021/file/47bd8ac1becf213f155a82244b4a696a-Paper.pdf.
-
Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory (COLT), volume 75 of Proceedings of Machine Learning Research, 1042–1085. PMLR, 7 2018. arXiv: https://arxiv.org/abs/1711.10456. URL: https://proceedings.mlr.press/v75/jin18a.html.
-
Jiaqi Leng, Ethan Hickman, Joseph Li, and Xiaodi Wu. Quantum hamiltonian descent. arXiv: https://arxiv.org/abs/2303.01471, 2023.