Quantum machine learning via quantum linear algebra

Overview

Linear algebra in high dimensional spaces with tensor product structure is the workhorse of quantum computation as well as of much of machine learning (ML). It is therefore unsurprising that efforts have been made to find quantum algorithms for various learning tasks, including but not restricted to: cluster-finding [1], principal component analysis [2], least-squares fitting [3, 4], recommendation systems [5], binary classification [6], and Gaussian process regression [7]. One of the main computational bottlenecks in all of these tasks is the manipulation of large matrices. Significant speedup for this class of problems has been argued for via quantum linear algebra, as exemplified by the quantum linear system solver (QLSS). The main question marks for viability are: (i) can quantum linear algebra be fully dequantized [8] for ML tasks, (ii) can the classical training data be loaded efficiently into a quantum random access memory (QRAM), and (iii) do the quantum ML algorithms that avoid the above mentioned pitfalls address relevant machine learning problems? Our current understanding suggests that significant quantum advantage would require an exceptional confluence of (i)-(iii) that has not yet been found in the specific applications analyzed to date, though modest speedups are plausible.

ML applications

The structure of this section differs from other sections in this survey, due to the one-off nature of many of the quantum machine learning proposals and the fact that they are often heuristic. Rather than cover every proposal, we explore three specific applications. Each example explains which end-to-end problem is being solved and roughly how the proposed quantum algorithm solves that problem, arriving at its dominant complexity. In each case, the quantum algorithm assumes access to fast coherent data access (log-depth QRAM) and leverages quantum primitives for solving linear systems (and linear algebra more generally). Under certain conditions, these primitives can be exponentially faster than classical methods that manipulate all the entries of vectors in the exponentially large vector space. However, for these examples, it is crucial to carefully define the end-to-end problem, as exponential advantages can be lost at the readout step, where the answer to a machine learning question must be retrieved from the quantum state encoding the solution to the linear algebra problem. In the three examples below, this is accomplished with some form of amplitude or overlap estimation.

Furthermore, we note that, even if these quantum algorithms are exponentially faster than classical algorithms that manipulate the full state vector, in some cases this speedup has been "dequantized" via algorithms that merely sample from the entries of the vector. Specifically, for some end-to-end problems, there exist classical "quantum-inspired" algorithms [8, 9, 10] that solve the problem in time only polynomially slower than the quantum algorithm. The assumption that the quantum algorithm has fast QRAM access to the classical data is analogous to the assumption that the classical algorithm has fast sample-and-query (SQ) access to the data. We do not cover these techniques in detail, but we note that most of the machine learning tasks based on linear algebra for which quantum algorithms have been proposed have also been dequantized in some capacity [9]. However, in some cases it remains possible that there could be an exponential quantum advantage if the quantum algorithm is able to exploit additional structure in the matrices involved, such as sparsity, that the classical algorithm is not. The three examples below roughly illustrate the spectrum of possibilities: some tasks are fully dequantized, whereas others, to the best of our current knowledge, could still support exponential advantages if certain conditions are met.

Example 1: Gaussian process regression

Actual end-to-end problem:

Gaussian process regression (GPR) is a nonparametric, Bayesian method for regression. GPR is closely related to kernel methods [11], as well as to other regression models, including linear regression [12]. Our presentation of the problem follows that of [12, Chapter 2] and [13]. Given training data \(\{x_j,y_j\}_{j=1}^M\), with inputs \(x_j \in \mathbb{R}^N\) and noisy outputs \(y_j\in\mathbb{R}\), the goal is to model the underlying function \(f(x)\) generating the output \(y\)

\[\begin{equation} y=f(x)+\epsilon_{\rm noise}, \end{equation}\]

where \(\epsilon_{\rm noise}\) is drawn from i.i.d. Gaussian noise with variance \(\sigma^2\). Modeling \(f(x)\) as a Gaussian process means that for inputs \(\{x_j\}_{j=1}^M\), the outputs \(\{f(x_j)\}_{j=1}^M\) are treated as random variables with a joint multivariate Gaussian distribution, in such a way that any subset of these values are jointly normally distributed in a manner consistent with the global distribution. While this multivariate Gaussian distribution governing \(\{f(x_j)\}_{j=1}^M\) will generally be correlated for different \(j\), the additional additive error \(\epsilon_{\rm noise}\) on our observations \(y_j\) is independent from the choice of \(f(x_j)\) and uncorrelated from point to point. The Gaussian process is specified by the distribution \(\mathcal{N}\left( m, K \right)\) where \(m\) is the length-\(M\) vector obtained by evaluating a "mean function\" \(m(x)\) at the points \(\{x_j\}_{j=1}^M\), and \(K\) is an \(M \times M\) covariance kernel matrix obtained by evaluating a covariance kernel function \(k(x,x')\) at \(x,x' \in \{x_j\}_{j=1}^M\). The functional form of the mean and covariance kernel are specified by the user and determine the properties of the Gaussian process, such as its smoothness.¹ These functions typically contain a small number of hyperparameters which can be optimized using the training data. A commonly used covariance kernel function is the squared exponential covariance function \(k(x,x') = \exp{\left(-\frac{1}{2\ell^2} (x-x')^2\right)}\) where \(\ell\) is a hyperparameter controlling the length scale of the Gaussian process.

Given choices for \(m(x)\) and \(k(x,x')\) and the observed data \(\{x_j,y_j\}_{j=1}^M\), our task is to predict the value \(f(x_*)\) of a new test point \(x_*\). Because the Gaussian process assumes that all \(M+1\) values \(\{f(x_1),\ldots,f(x_M), f(x_*)\}\) have a jointly Gaussian distribution, it is possible to condition upon the observed data to obtain the distribution for \(f(x_*)\) which is \(p(f_* | x_*, \{x_j,y_j\}) \sim \mathcal{N}\left(\bar{f}_* , \mathbb{V}[f_*] \right)\). Our goal is to compute \(\bar{f}_*\), the mean (linear predictor) of the distribution for \(f(x_*)\), as well as the variance \(\mathbb{V}[f_*]\), which gives uncertainty on the prediction. Computing the underlying multivariate Gaussian distribution can be bypassed by exploiting the closure of Gaussians under linear operations, in particular, conditioning. This re-expresses the problem as linear algebra with the kernel matrix. Assuming the common choice of \(m(x) =0\), and defining the length-\(M\) vector \(k_* \in \mathbb{R}^M\) to have its \(j\)th entry given by \(k(x_*, x_j)\), we obtain

\[\begin{align} \bar{f}_* & = k_*^\intercal [K + \sigma^2 I]^{-1} y \\ \mathbb{V}[f_*] & = k(x_*, x_*) - k_*^\intercal [K + \sigma^2 I]^{-1} k_* \end{align}\]

which characterize the prediction for the test point. The advantages of GPR are a small number of hyperparameters, model interpretability, and that it naturally returns uncertainty estimates for the predictions. Its main disadvantage is the computational cost.

Dominant resource cost:

In classical implementations, the cost is dominated by performing the inversion \([K+ \sigma^2 I]^{-1}\), typically via a Cholesky decomposition, resulting in a complexity of \(\mathcal{O}\left( M^3 \right)\) (see [12, Chapter 8] and [14] for approximations used to reduce the classical cost). In [7], a quantum algorithm was proposed that leverages the quantum linear system solver (QLSS) to perform this inversion more efficiently. The quantum computer uses the classical data to infer the linear predictor and variance for a test point \(x_*\), and this process must be repeated for the computation of each new test point output. We analyze the complexity of computing \(\bar{f}_*\), with a simple extension for \(\mathbb{V}[f_*]\). Given classically observed/precomputed values of \(y\) and \(k_*\), the quantum algorithm uses state preparation from classical data (based on QRAM) to prepare quantum states representing \(\frac{1}{\nrm{y}} \ket{y}\) and \(\frac{1}{\nrm{k_*}} \ket{k_*}\),² each with a gate depth of \(\mathcal{O}\left( \log(M) \right)\) (though using \(\mathcal{O}\left( M \right)\) gates overall). The algorithm also uses a block-encoding of classical data (also using QRAM) for \(A:=[K + \sigma^2 I]\), with a normalization factor of \(\alpha = \nrm{K + \sigma^2 I}_F\) (Frobenius norm).³ The state-of-the-art QLSS has complexity \(\mathcal{O}\left( \frac{\alpha \kappa}{\nrm{A}} \log(1/\epsilon) \right)\) calls to an \(\alpha\)-normalized block-encoding of matrix \(A\) with condition number \(\kappa\). In this case, the minimum singular value of \(A\) is at least \(\sigma^2\), so \(\kappa/\nrm{A}\leq \sigma^{-2}\). The QLSS produces the normalized state \(\ket{A^{-1}y}\), and a similar approach yields an estimate for the norm \(\lVert A^{-1}y \rVert\) to relative error \(\epsilon\) at cost \(\widetilde{\mathcal{O}}\left( \alpha \kappa/\nrm{A}\epsilon \right)\). Given unitary circuits performing these tasks, we can estimate the quantity \(\bar{f}_* = \braket{k_*}{A^{-1}y} \cdot \nrm{k^*}\nrm{A^{-1}y}\) to precision \(\epsilon\) using overlap estimation with gate depth upper bounded by

\[\begin{equation} \widetilde{\mathcal{O}}\left( \log(M) \cdot \nrm{K+\sigma^2 I}_F \sigma^{-2} \cdot \frac{\nrm{k_*} \left\lVert[K + \sigma^2 I]^{-1}y\right\rVert }{\epsilon} \right)\,, \end{equation}\]

where the three factors come from QRAM, QLSS, and overlap estimation, respectively. Using QRAM as described above would use \(\mathcal{O}\left( M^2 \right)\) ancilla qubits. Note that classical "quantum-inspired" methods for solving linear systems, based on sample-and-query (SQ) access, also have \(\mathrm{poly}(\nrm{A}_F,\kappa,\epsilon^{-1},\log(M))\) complexity [9, 15, 10], and thus the quantum algorithm as stated above offers at most a polynomial speedup in the case of dense matrices.

On the other hand, [7] considers the case where the vectors and kernels are sparse⁴ and uses this to reduce the cost of the quantum algorithm and of QRAM. In this case, using block-encodings of sparse matrices, the factor \(\nrm{A}_F\) in the complexity expression is replaced by a factor \(s \nrm{A}_{\max}\), where \(s\) is the sparsity of the matrix \(A\) and \(\nrm{A}_{\max}\) is the maximum magnitude of any entry of \(A\)—log-depth QRAM with \(\Omega(M)\) ancilla qubits would still be necessary to implement the sparse-access oracle to the \(sM\) arbitrary nonzero entries of \(A\) in depth \(\mathcal{O}\left( \log(M) \right)\). The upshot is that in the sparse case, because the algorithm assumes the kernel is not low rank, this algorithm is not dequantized by SQ access [16] and may still offer an exponential speedup over quantum-inspired methods. However, we note that the assumption of sparsity in \([K + \sigma^2 I]\) may also enable the use of more efficient classical algorithms for computing the inverse (see QLSS). Moreover, we must include the classical precomputation of evaluating the entries of this matrix. A related, and similarly efficient, quantum algorithm is proposed in [13] for optimizing the hyperparameters of the GP kernel by maximizing the marginal likelihood of the observed data given the model.

Example 2: Support vector machines

Actual end-to-end problem:

The task for the support vector machine (SVM) is to classify an \(N\)-dimensional vector \(x_*\) into one of two classes (\(y_* = \pm 1\)), given \(M\) labeled data points of the form \(\{(x_j,y_j): x_j\in \mathbb{R}^N, y_j=\pm 1\}_{j=1,...,M}\) used for training. The training phase solves a continuous optimization problem to find a maximum-margin hyperplane, described by normal vector \(w \in \mathbb{R}^M\) and offset \(b \in \mathbb{R}\), which separates the training data. That is, data points with \(y_j=1\) lie on one side of the plane, and data points with \(y_j=-1\) lie on the other side. Once trained, the classification of \(x_*\) is inferred via the formula

\[\begin{equation} \label{eq:SVM_classify} y_*={\rm sign}\left(b+\langle w, x_*\rangle\right)\,. \end{equation}\]

In the "hard-margin" version of the problem where all training points must be classified correctly (assuming it is possible to do so, i.e. the data is linearly separable), the solution \((w,b)\) is given by

\[\begin{equation} \label{eq:SVM_hardmargin} \argmin_{(w,b)} \; \nrm{w}^2, \qquad \text{subject to:} \qquad y_j \cdot (\langle w,x_j \rangle + b) \geq 1 \qquad \forall j \end{equation}\]

where \(\nrm{\cdot}\) denotes the standard Euclidean vector norm.

In the "soft-margin" version of the problem, the hyperplane need not correctly classify all training points. The relation \(y_j \cdot (\langle w, x_j\rangle+b) \geq 1\) is relaxed to \(y_j \cdot (\langle w, x_j \rangle +b) \geq 1-\xi_j\), with \(\xi_j \geq 0\). Now, \((w,b)\) are determined by

\[\begin{equation} \label{eq:SVM_softmargin} \argmin_{(w,b, \xi)} \; \nrm{w}^2 + \gamma \nrm{\xi}_1, \qquad \text{subject to:} \qquad y_j \cdot (\langle w,x_j \rangle + b) \geq 1-\xi_j \qquad \forall j\,, \end{equation}\]

where \(\nrm{\cdot}_1\) denotes the vector 1-norm, and \(\gamma\) is a user-specified parameter related to how much to penalize points that lie within the margin. Both Eqs. \(\eqref{eq:SVM_hardmargin}\) and \(\eqref{eq:SVM_softmargin}\) are convex programs, in particular, quadratic programs, which can also be rewritten as second-order cone programs [17]. Another feature of these formulations is that the solution vectors \(w\) and \(\xi\) are usually sparse; the \(j\)th entry is only nonzero for values of \(j\) where \(x_j\) lies on or within the margin near the hyperplane—these \(x_j\) are called the "support vectors."

In [18], a "least-squares" version of the SVM problem was proposed, which has no inequality constraints:⁵

\[\begin{equation} \label{eq:SVM_LS} \argmin_{(w,b,\xi)} \; \nrm{w}^2 + \frac{\gamma}{M} \nrm{\xi}^2, \qquad \text{subject to:} \qquad y_j \cdot (\langle w,x_j \rangle + b) = 1-\xi_j \qquad \forall j\,. \end{equation}\]

This is an equality-constrained least-squares problem, which is simpler than a quadratic program and can be solved using Lagrange multipliers and inverting a linear system. Specifically, one introduces vector \(\beta \in \mathbb{R}^M\) and solves the \((M+1) \times (M+1)\) linear system \(Au = v\), where

\[\begin{equation} A=\begin{pmatrix} 0 & \mathbf{1}^\intercal/\sqrt{M} \\ \mathbf{1}/\sqrt{M} & K/M + \gamma^{-1}I \end{pmatrix}, \qquad u = \begin{pmatrix} b \\ \beta \end{pmatrix}, \qquad v =\frac{1}{\sqrt{M}}\begin{pmatrix} 0 \\ y \end{pmatrix} \end{equation}\]

with \(K\) the kernel matrix for which \(K_{ij} = \langle x_i, x_j \rangle\), \(\mathbf{1}\) the all-ones vector, and \(I\) the identity matrix. The vector \(w\) is inferred from \(\beta\) via the formula \(w = \sum_j \beta_j x_j/\sqrt{M}\).

However, unlike the first two formulations, the least-squares formulation does not generally have sparse solution vectors \((w,b)\) (see [19]). Additionally, its solution can be qualitatively different, due to the fact that correctly classified data points can lead to negative \(\xi_j\) that apply penalties to the objective function through the appearance of \(\nrm{\xi}^2\).

Dominant resource cost:

The hard-margin and soft-margin formulations of SVM are quadratic programs, which can be mapped to second-order cone programs and solved with quantum interior point methods (QIPMs). This solution was proposed in [17], and, assuming access to log-depth QRAM it can find \(\epsilon\)-accurate estimates for the solution \((w,b)\) in time scaling as \(\widetilde{\mathcal{O}}\left( M^{0.5}(M+N)\kappa_{\rm IPM}\zeta\log(1/\epsilon)/\xi' \right)\), where \(\kappa_{\rm IPM}\), \(\zeta\), and \(\xi'\) are instance-specific parameters related to the QIPM. This compares to \(\mathcal{O}\left( M^{0.5}(M+N)^{3}\log(1/\epsilon) \right)\) for naively implemented classical interior point methods. In [17], numerical simulations on random SVM instances were performed to compute these instance-specific parameters, and the results were consistent with a small polynomial speedup. However, the resource estimate of [20] for a related problem suggests a practical advantage may be difficult to realize with this approach.

The least-squares formulation can be solved directly with the quantum linear system solver (QLSS), as pursued in [6]. This can be compared to classically solving the linear system via Gaussian elimination, with cost \(\mathcal{O}\left( M^3 \right)\). The QLSS requires the ability to prepare the state \(\ket{v}\), which can be accomplished in \(\mathcal{O}\left( \log(M) \right)\) depth through methods for preparation of states from classical data, although requiring \(\mathcal{O}\left( M \right)\) total gates and ancilla qubits. One also needs a block-encoding of the matrix \(A\). One method is through block-encodings from classical data, which requires classical precomputation of the \(\mathcal{O}\left( M^2 \right)\) entries of \(K\) (incurring classical cost \(\mathcal{O}\left( M^2N \right)\)) and producing a block-encoding with normalization factor \(\alpha = \nrm{A}_F\) (Frobenius norm). Henceforth we assume that \(\nrm{x_j} \leq 1\) for all \(j\), which can always be achieved by scaling down the training data (inducing a scaling up of \(w\) and \(\sqrt{\gamma}\) by an equal factor). This implies \(\nrm{K/M}_F \leq 1\) and hence \(\nrm{A}_F \leq \sqrt{2}+1+\sqrt{M}\gamma^{-1}\). A better block-encoding can be obtained by block-encoding \(K/M\) via the method for Gram matrices⁶ and \(\gamma^{-1}I\) via the trivial method, and then combining these with the rest of \(A\) via linear combination of block-encodings. This avoids the need to classically calculate the inner products \(\langle x_i, x_j \rangle\), and has a better normalization \(\alpha \leq \sqrt{2} + 1+\gamma^{-1}\).

Given these constructions, the QLSS outputs the state \(\ket{u} = (b\ket{0} + \sum_{j=1}^M \beta_j \ket{j})/\sqrt{b^2+\nrm{\beta}^2}\); the cost is \(\smash{\widetilde{\mathcal{O}}\left( \alpha \kappa_A/ \nrm{A} \right)}\), where \(\kappa_A\) is the condition number of \(A\). We may assert that \(\nrm{A} \geq 1\). This follows by noting that the lower right block of \(A\) is positive semidefinite, and that 1 is an eigenvalue of \(A\) when the lower-right block is set to zero. The condition number should be upper bounded by an \(M\)-independent function of \(\gamma\) due to the appearance of the regularizing \(\gamma^{-1}I\).

Reading out all \(M+1\) entries of \(\ket{u}\) via tomography would multiply the cost by \(\Omega(M)\). However, in [6], it was observed that to classify a test point \(x_*\) via Eq. \(\eqref{eq:SVM_classify}\), one can use overlap estimation rather than classically learning the solution vector. In our notation and normalization, this can be carried out as follows. Let \(\ket{x_j}:=\sum_{i=1}^M x_{ji} \ket{i}/\nrm{x_j}\), with \(x_{ji}\) denoting the \(i\)th entry of the vector \(x_j\). Starting with \(\ket{u}\), we prepare \(\ket{x_j}\) into an ancilla register, using methods for controlled state preparation from classical data, forming

\[\begin{equation} \ket{\tilde{u}} = \frac{b\ket{0}\ket{0} + \sum_{j=1}^M \beta_j \ket{j}\left(\nrm{x_j}\ket{x_j} + \sqrt{1-\nrm{x_j}^2}\ket{M+1}\right)}{\sqrt{b^2+\nrm{\beta}^2}}\,. \end{equation}\]

One also creates a reference state \(\ket{\tilde{x}_*}\) encoding \(x_*\), defined as

\[\begin{equation} \ket{\tilde{x}_*} = \frac{1}{\sqrt{2}}\ket{0}\ket{0} + \frac{1}{\sqrt{2M}}\sum_{j=1}^M \ket{j}\left(\nrm{x_*}\ket{x_*} + \sqrt{1-\nrm{x_*}^2 }\ket{M+2}\right)\,. \end{equation}\]

The right-hand side of Eq. \(\eqref{eq:SVM_classify}\) is then given by \(\sqrt{2}\sqrt{b^2+\nrm{\beta}^2}\braket{\tilde{u}}{\tilde{x}_*}\). Thus, the overlap \(\braket{\tilde{u}}{\tilde{x}_*}\) must be estimated to precision \(\epsilon = 1/\sqrt{2(b^2+\nrm{\beta}^2)}\) in order to distinguish \(\pm 1\) and classify \(x_*\). Additionally, the norm \(\nrm{u} = \sqrt{b^2 + \nrm{\beta}^2}\) must be calculated; this can separately be done to relative error \(\epsilon'\) at cost \(\widetilde{\mathcal{O}}\left( \alpha \kappa_A/\epsilon' \right)\) (see QLSS). We may also note that as \(u = A^{-1}v\) and \(\nrm{v}=1\), we have \(\nrm{u} \leq \kappa_A/\nrm{A}\). Thus, the overall circuit depth required to classify \(x_*\) is

\[\begin{equation} \widetilde{\mathcal{O}}\left( \frac{\alpha \kappa_A^2}{ \nrm{A}^2} \right) \,. \end{equation}\]

There is no explicit \(\mathrm{poly}(N,M)\) dependence. However, for certain data sets and parameter choices, such dependence could be hidden in \(\kappa_A\) or \(\alpha\), making an apples-to-apples comparison with Gaussian elimination less clear.

Furthermore, this task has been dequantized under the assumption of SQ access [21, 9, 10]. In time scaling as \(\mathrm{poly}(\nrm{A}_F, \epsilon^{-1}, \log(NM))\), one can classically sample from the solution vector \(\ket{u}\) to error \(\epsilon\), and furthermore, given sample access, one can estimate inner products \(\braket{\tilde{u}}{\tilde{v}}\) in time \(\mathcal{O}\left( 1/\epsilon^2 \right)\) [22]. However, the cost can be reduced through a trick that is analogous to how the quantum algorithm can block-encode the \(\gamma^{-1}I\) part of \(A\) separately to avoid the dependence on a large \(\nrm{A}_F\). In particular, [9, Corollary 6.18] gives a classical complexity that would be polynomially related to the quantum complexity above under appropriate matching of parameters, but the power of this polynomial speedup could still be significant. In any case, such a speedup crucially requires log-depth QRAM access to the training data, which requires total gate complexity \(\Omega(NM)\) and \(\mathcal{O}\left( NM \right)\) ancilla qubits.

Example 3: Supervised cluster assignment

Actual end-to-end problem:

Suppose we are given access to a vector \(x\in\mathbb{C}^N\) and a set of \(M\) samples \(\{y_j\in \mathbb{C}^N\}_{j=1,\ldots,M}\). We want to estimate the distance between \(x\) and the centroid of the set \(\{y_j\}\) to judge whether \(x\) was drawn from the same set as \(\{ y_j\}\). If we have multiple sets \(\{y_j\}\), we can infer that \(x\) belongs to the one for which the distance is shortest; as a result, this is also called the "nearest-centroid problem." Specifically, the computational task is to estimate \(\lVert x-\frac{1}{M}Y \mathbf{1} \rVert\) to additive constant error \(\epsilon\) with probability \(1-\delta\), where \(Y\in \mathbb{C}^{N\times M}\) is the matrix whose columns are \(y_j\), and \(\mathbf{1}\) is the vector of \(M\) ones—the vector \(Y\mathbf{1}/M\) is the centroid of the set.

Dominant resource cost:

Naively computing the centroid incurs classical cost \(\mathcal{O}\left( NM \right)\). In [1], a quantum solution to this problem was proposed. Let \(\bar{x}=x/\lVert x\rVert\) and let \(\bar{Y}\) be normalized so that all columns have unit norm. Define \(N \times (M+1)\) matrix \(R\) and length-\((M+1)\) vector \(w\) as follows:

\[\begin{align} R=\begin{pmatrix}\bar{x} & \bar{Y}/\sqrt{M}\end{pmatrix}, \qquad w = \begin{pmatrix} \lVert x \rVert \\ -1_Y/\sqrt{M}\end{pmatrix}\,, \end{align}\]

where \(1_Y\) is the length-\(M\) vector containing the norms of the columns of \(Y\), defined such that \(\bar{Y}1_Y=Y\mathbf{1}\). Then, \(Rw=x-\frac{1}{M} Y\mathbf{1}\). Using methods for block-encoding and state preparation from classical data, one constructs \(\mathcal{O}\left( \log(NM) \right)\)-depth circuits that block-encode \(R\) (with normalization factor \(\nrm{R}_F = 2\)) and prepare the state \(\ket{w}\). If we apply the block-encoding of \(R\) to \(\ket{w}\) and measure the block-encoding ancillas, the probability that we obtain \(\ket{0}\) is precisely \((\nrm{Rw}/2\nrm{w})^2\). Thus, using amplitude estimation, one learns \(\nrm{Rw}\) to precision \(\epsilon\) with probability at least \(1-\delta\) at cost \(\mathcal{O}\left( \nrm{w}\log(1-\delta)/\epsilon \right)\) calls to the log-depth block-encoding and state preparation routines.

The advantage over naive classical methods essentially boils down to the assumption of efficient classical data loading for a specific data set. Subsequently, this quantum algorithm was dequantized, and it was understood that a similar feat is possible classically in the SQ access model [8]. Specifically, the classical algorithm runs in time \(\widetilde{\mathcal{O}}\left( \nrm{w}^2\log(1-\delta)/\epsilon^2 \right)\), reducing the exponential speedup to merely quadratic.

Caveats

The overwhelming caveat in these and other proposals is access to the classical data in quantum superposition. These quantum machine learning algorithms assume that we can load a vector of \(N\) entries or a matrix of \(N^2\) entries in \(\mathrm{polylog}(N)\) time. While efficient quantum data structures, i.e. QRAM, have been proposed for this task, they introduce a number of caveats. In order to coherently load \(N\) pieces of data in \(\log(N)\) time, QRAM uses a number of ancilla qubits, arranged in a tree structure. To load data of size \(N\), the QRAM data structure requires \(\mathcal{O}\left( N \right)\) qubits, which is exponentially larger than the \(\mathcal{O}\left( \log(N) \right)\) data qubits used in the algorithms above. This spatial complexity does not yet include the overheads of quantum error correction and fault-tolerant computation, in particular the large spatial resources required to distill magic states in parallel. As such, we do not yet know if it is possible to build a QRAM that can load the data sufficiently quickly, while maintaining moderate spatial resources.

In addition, achieving speedups by efficiently representing the data as a quantum state may suggest that methods based on tensor networks could achieve similar performance, in some settings. Taking this line of reasoning to the extreme, a number of efficient classical algorithms have been developed by "dequantizing\" the quantum algorithms. That is, by assuming an analogous access model (the SQ access model) to the training data, as well as some assumptions on sparsity and/or rank of the inputs, there exist approximate classical sampling algorithms with polynomial overhead as compared to the quantum algorithms [8, 22]. This means that any apparent exponential speedup must be an artifact of the data loading/data access assumptions.

A further caveat is inherited from the QLSS subroutine, which is that the complexity is large when the matrices involved are ill conditioned. This caveat is somewhat mitigated in the Gaussian process regression and support vector machine examples above, where the matrix to be inverted is regularized by adding the identity matrix.

End-to-end resource analysis

To the best of our knowledge, full end-to-end resource estimation has not been performed for any specific quantum machine learning tasks.

Outlook

Much of the promise of quantum speedup for classical machine learning based on linear algebra hinges on the extent to which quantum algorithms can be dequantized. At present, the results of [8] seem to prohibit an exponential speedup for many of the problems proposed, but there is still the possibility of a large polynomial speedup. The most recent asymptotic scaling analysis [16] for dequantization methods still allows for a power \(4\) speedup in the Frobenius norm of the "data matrix" and a power 9 speedup in the polynomial approximation degree (see [23] for more details). However, the classical algorithms are steadily improving, and their scaling might be further reduced.

It is also worth noting that the classical probabilistic algorithms based on the SQ access model are not currently used in practice. This could be due to a number of reasons, including the poor polynomial scaling, the fact that the access model might not be well suited to many practical scenarios, or simply because the method is new and has not been tested in practice (see [24, 25] for some work in this direction).

On the other hand, some machine learning tasks based on quantum linear algebra are not known to be dequantized, such as Gaussian process regression under the assumption that the kernel matrix is sparse. In particular, avoiding dequantization and achieving an exponential quantum speedup appears to require that the matrices involved are simultaneously sparse, well conditioned, and have a large Frobenius norm. In this situation, quantum algorithm can leverage block-encodings for which the normalization factor is equal to the sparsity, rather than general block-encodings of classical data for which the normalization factor is the Frobenius norm. Quantum-inspired classical algorithms based on SQ access will still scale polynomially with the Frobenius norm, although other classical algorithms may be able to exploit the sparsity more directly. Perhaps unsurprisingly, the prototypical matrices that satisfy these criteria are sparse unitary matrices (such as those naturally implemented by a local quantum gate). For unitary matrices, the condition number is 1, and the Frobenius norm is equal to the square root of the Hilbert space dimension—exponentially large in the system size. A central question is whether situations like this occur in interesting end-to-end machine learning problems. Even if they do, an exponential speedup is not guaranteed. An additional hurdle arises in the quantum readout step, which incurs a cost scaling as the inverse in the precision target. To avoid exponential overhead, the end-to-end problem must not require exponentially small precision.

Bibliography

Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum algorithms for supervised and unsupervised machine learning. arXiv: https://arxiv.org/abs/1307.0411, 2013.
Seth Lloyd, Masoud Mohseni, and Patrick Rebentrost. Quantum principal component analysis. Nature Physics, 10:631–633, 2014. arXiv: https://arxiv.org/abs/1307.0401. doi:10.1038/nphys3029.
Maria Schuld, Ilya Sinayskiy, and Francesco Petruccione. Prediction by linear regression on a quantum computer. Physical Review A, 94(2):022342, 2016. arXiv: https://arxiv.org/abs/1601.07823. doi:10.1103/PhysRevA.94.022342.
Iordanis Kerenidis and Anupam Prakash. Quantum gradient descent for linear systems and least squares. Physical Review A, 101(2):022316, 2020. arXiv: https://arxiv.org/abs/1704.04992. doi:10.1103/PhysRevA.101.022316.
Iordanis Kerenidis and Anupam Prakash. Quantum recommendation systems. In Proceedings of the 8th Innovations in Theoretical Computer Science Conference (ITCS), 49:1–49:21. 2017. arXiv: https://arxiv.org/abs/1603.08675. doi:10.4230/LIPIcs.ITCS.2017.49.
Patrick Rebentrost, Masoud Mohseni, and Seth Lloyd. Quantum support vector machine for big data classification. Physical Review Letters, 113(13):130503, 2014. arXiv: https://arxiv.org/abs/1307.0471. doi:10.1103/PhysRevLett.113.130503.
Zhikuan Zhao, Jack K. Fitzsimons, and Joseph F. Fitzsimons. Quantum-assisted gaussian process regression. Physical Review A, 99(5):052331, 2019. arXiv: https://arxiv.org/abs/1512.03929. doi:10.1103/PhysRevA.99.052331.
Ewin Tang. Quantum principal component analysis only achieves an exponential speedup because of its state preparation assumptions. Physical Review Letters, 127(6):060503, 2021. arXiv: https://arxiv.org/abs/1811.00414. doi:10.1103/PhysRevLett.127.060503.
Nai-Hui Chia, András Gilyén, Tongyang Li, Han-Hsuan Lin, Ewin Tang, and Chunhao Wang. Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning. In Proceedings of the 52nd ACM Symposium on the Theory of Computing (STOC), 387–400. 2020. arXiv: https://arxiv.org/abs/1910.06151. doi:10.1145/3357713.3384314.
Changpeng Shao and Ashley Montanaro. Faster quantum-inspired algorithms for solving linear systems. ACM Transactions on Quantum Computing, 2022. arXiv: https://arxiv.org/abs/2103.10309. doi:10.1145/3520141.
Motonobu Kanagawa, Philipp Hennig, Dino Sejdinovic, and Bharath K Sriperumbudur. Gaussian processes and kernel methods: a review on connections and equivalences. arXiv: https://arxiv.org/abs/1807.02582, 2018.
Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 11 2005. ISBN 9780262256834. URL: https://doi.org/10.7551/mitpress/3206.001.0001, doi:10.7551/mitpress/3206.001.0001.
Zhikuan Zhao, Jack K. Fitzsimons, Michael A. Osborne, Stephen J. Roberts, and Joseph F. Fitzsimons. Quantum algorithms for training gaussian processes. Physical Review A, 100:012304, 7 2019. arXiv: https://arxiv.org/abs/1803.10520. URL: https://link.aps.org/doi/10.1103/PhysRevA.100.012304, doi:10.1103/PhysRevA.100.012304.
Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. When gaussian process meets big data: a review of scalable gps. IEEE Transactions on Neural Networks and Learning Systems, 31(11):4405–4423, 2020. arXiv: https://arxiv.org/abs/1807.01065. doi:https://doi.org/10.1109/TNNLS.2019.2957109.
András Gilyén, Zhao Song, and Ewin Tang. An improved quantum-inspired algorithm for linear regression. Quantum, 6:754, 2022. arXiv: https://arxiv.org/abs/2009.07268. doi:10.22331/q-2022-06-30-754.
Nai-Hui Chia, András Pal Gilyén, Tongyang Li, Han-Hsuan Lin, Ewin Tang, and Chunhao Wang. Sampling-based sublinear low-rank matrix arithmetic framework for dequantizing quantum machine learning. Journal of the ACM, 69(5):1–72, 2022. Earlier version in STOC'20, arXiv: https://arxiv.org/abs/1910.06151. doi:10.1145/3549524.
Iordanis Kerenidis, Anupam Prakash, and Dániel Szilágyi. Quantum algorithms for second-order cone programming and support vector machines. Quantum, 5:427, 2021. arXiv: https://arxiv.org/abs/1908.06720. doi:10.22331/q-2021-04-08-427.
J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifiers. Neural Processing Letters, 9(3):293–300, 1999. URL: https://doi.org/10.1023/A:1018628609742, doi:10.1023/A:1018628609742.
J.A.K. Suykens, J. De Brabanter, L. Lukas, and J. Vandewalle. Weighted least squares support vector machines: robustness and sparse approximation. Neurocomputing, 48(1):85–105, 2002. URL: https://www.sciencedirect.com/science/article/pii/S0925231201006440, doi:https://doi.org/10.1016/S0925-2312(01)00644-0.
Alexander M Dalzell, B David Clader, Grant Salton, Mario Berta, Cedric Yen-Yu Lin, David A Bader, Nikitas Stamatopoulos, Martin J A Schuetz, Fernando G S L Brandão, Helmut G Katzgraber, and others. End-to-end resource analysis for quantum interior point methods and portfolio optimization. PRX Quantum, pages to appear, 2023. arXiv: https://arxiv.org/abs/2211.12489.
Chen Ding, Tian-Yi Bao, and He-Liang Huang. Quantum-inspired support vector machine. IEEE Transactions on Neural Networks and Learning Systems, 33(12):7210–7222, 2022. arXiv: https://arxiv.org/abs/1906.08902. doi:10.1109/TNNLS.2021.3084467.
Ewin Tang. A quantum-inspired classical algorithm for recommendation systems. In Proceedings of the 51st ACM Symposium on the Theory of Computing (STOC), 217–228. 2019. arXiv: https://arxiv.org/abs/1807.04271. doi:10.1145/3313276.3316310.
Ainesh Bakshi and Ewin Tang. An improved classical singular value transformation for quantum machine learning. arXiv: https://arxiv.org/abs/2303.01492, 2023.
Juan Miguel Arrazola, Alain Delgado, Bhaskar Roy Bardhan, and Seth Lloyd. Quantum-inspired algorithms in practice. Quantum, 4:307, 2020. arXiv: https://arxiv.org/abs/1905.10415. doi:10.22331/q-2020-08-13-307.
Nadiia Chepurko, Kenneth Clarkson, Lior Horesh, Honghao Lin, and David Woodruff. Quantum-inspired algorithms from randomized numerical linear algebra. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning (ICML), volume 162 of Proceedings of Machine Learning Research, 3879–3900. PMLR, 7 2022. arXiv: https://arxiv.org/abs/2011.04125. URL: https://proceedings.mlr.press/v162/chepurko22a.html.
Maria Schuld and Francesco Petruccione. Machine learning with quantum computers. Springer, 2021. doi:10.1007/978-3-030-83098-4.
Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe, and Seth Lloyd. Quantum machine learning. Nature, 549:195–202, 2017. arXiv: https://arxiv.org/abs/1611.09347. doi:10.1038/nature23474.
Meng-Han Chen, Chao-Hua Yu, Jian-Liang Gao, Kai Yu, Song Lin, Gong-De Guo, and Jing Li. Quantum algorithm for gaussian process regression. Physical Review A, 106:012406, 7 2022. arXiv: https://arxiv.org/abs/2106.06701. URL: https://link.aps.org/doi/10.1103/PhysRevA.106.012406, doi:10.1103/PhysRevA.106.012406.
András Gilyén, Yuan Su, Guang Hao Low, and Nathan Wiebe. Quantum singular value transformation and beyond: exponential improvements for quantum matrix arithmetics. In Proceedings of the 51st ACM Symposium on the Theory of Computing (STOC), 193–204. 2019. arXiv: https://arxiv.org/abs/1806.01838. doi:10.1145/3313276.3316366.

This can be visualized by sampling a function from the distribution, which means sampling a value of \(f(x_j)\) from the distribution for each \(x_j\), and plotting the values of \(f(x_j)\) as a curve. ↩
For any vector \(v\), the notation \(\ket{v}\) denotes the normalized quantum state whose amplitudes in the computational basis are proportional to the entries of \(v\). ↩
It may be more efficient to load in the \(\{x_j\}\) values and then coherently evaluate the kernel entries using quantum arithmetic. Some ideas in this direction are explored in [28]. One might also consider block-encoding \(K\) and \(\sigma^2 I\) separately and combining them with linear combination of unitaries. ↩
For the squared exponential covariance function mentioned above, the kernel matrix will not be sparse, but [7] notes several applications of GPR where sparsity is well justified. ↩
Our definition of the least-squares SVM is equivalent to the normal presentation found in [18, 6]; however, we choose slightly different conventions for normalization of certain parameters, such as \(\gamma\), with respect to \(M\). The goal of our choices is to make the final complexity expression free of any explicit \(M\) dependence. ↩
We sketch a possible instantiation of this method here. Define \(\ket{x_i} = \nrm{x_i}^{-1} \sum_{k=1}^M x_{ik}\ket{k}\) where \(x_{ik}\) is the \(k\)th entry of \(x_i\). Suppose \(M=2^m\) is a power of 2. Following the setup in block-encodings and [29, Lemma 47], we must define sets of \(M\) orthonormal states \(\{\ket{\psi_i}\}\) and \(\{\ket{\phi_j}\}\). We choose \(\ket{\psi_i} = (\nrm{x_i}\ket{x_i} + \sqrt{1-\nrm{x_i}^2}\ket{M+1})(H^{\otimes m} \ket{i})\ket{0^m}\), where \(H\) denotes the Hadamard transform. We choose \(\ket{\phi_j} = (\nrm{x_j}\ket{x_j} + \sqrt{1-\nrm{x_j}^2}\ket{M+2})\ket{0^m}(H^{\otimes m}\ket{j})\). These states can be prepared in \(\mathcal{O}\left( \log(M) \right)\) depth using \(\mathcal{O}\left( M \right)\) total gates and ancilla qubits with methods for controlled state preparation from classical data. It can be verified that these sets are orthonormal, and that \(\braket{\psi_i}{\phi_j} = \langle x_i, x_j \rangle/M\). Hence, the Gram matrix construction yields a block-encoding of \(K/M\) with normalization factor 1. ↩

Quantum machine learning via quantum linear algebra

Overview

ML applications

Example 1: Gaussian process regression

Actual end-to-end problem:

Dominant resource cost:

Example 2: Support vector machines

Actual end-to-end problem:

Dominant resource cost:

Example 3: Supervised cluster assignment

Actual end-to-end problem:

Dominant resource cost:

Caveats

End-to-end resource analysis

Outlook

Further reading

Bibliography