Quantum neural networks and quantum kernel methods

Overview

In this article we discuss two collections of proposals to use a quantum computer as a machine learning model, often known as quantum neural networks and quantum kernel methods. Many early ideas were motivated by the constraints of near-term, "NISQ\" [1] devices. Despite this, not all subsequent proposals are necessarily implementable on NISQ devices. Moreover, the proposals need not be restricted to running on NISQ devices, but could also be run on devices with explicit quantum error correction. For simplicity, we present concrete examples based on supervised machine learning tasks. However, outside of these examples we keep our discussion more general, and note that the techniques are also applicable to other settings, such as unsupervised learning.

Given access to some data, our goal is to obtain a function or distribution that emulates certain properties of the data, which we will call a model. This is obtained by first defining a model family or hypothesis set, and using a learning algorithm to select a model from this set. For example, in supervised learning, we have data \(x_i \in X\) that have respective labels \(y_i \in Y\). The goal is then to find a model function \(h: X \rightarrow Y\) which correctly labels previously unseen data with high probability. Note that we have left the exact descriptions of the sets \(X\) and \(Y\) ambiguous. They could, for instance, correspond to sets of numbers or vectors. More generally, this description encompasses the possibility of operating on quantum data such that each \(x_i\) corresponds to a quantum state.

Quantum neural networks and quantum kernel methods use a quantum computer to assist in constructing the model, in place of a classical model such as a neural network. Specifically, here the model will be constructed by preparing some quantum state(s) encoding the data, and measuring some observable(s) to obtain model predictions. We first elaborate on both quantum neural networks, and quantum kernel methods.

Quantum neural networks

Actual end-to-end problem(s) solved. Given data \(x\), we consider a model constructed from a parameterized quantum circuit:

\[\begin{equation} \label{eq:qnn} h_{\boldsymbol{\theta}}(x) = \operatorname{Tr}\left[ \rho(x, \boldsymbol{\theta} ) O\right]\,, \end{equation}\]

where \(\rho(x, \boldsymbol{\theta} )\) is a quantum state that encodes both the data \(x\) as well as a set of adjustable parameters \(\boldsymbol{\theta}\), and \(O\) is some chosen measurement observable. For instance, if \(x\) corresponds to a classical vector, \(\rho(x, \boldsymbol{\theta} )\) could correspond to initializing in the \(\ketbra{0}{0}\) state and applying some data-encoding gates \(U(x)\) followed by parameterized gates \(V(\boldsymbol{\theta})\). Alternatively, the data itself could be a quantum state, and a more general operation in the form of a parameterized channel \(\mathcal{V}(\boldsymbol{\theta})\) could be applied. The model is optimized via a learning algorithm which aims to find the optimal parameters \(\boldsymbol{\theta}^*\) by minimizing a loss function, which assesses the quality of the model. For instance, in supervised learning, given some labelled training data set \(T=\{(x_i, y_i)\}\), a suitable choice of loss should compare how close each \(h_{\boldsymbol{\theta}}(x_i)\) is to the true label \(y_i\) for all data in \(T\). The quality of the model can then be assessed on a set of previously unseen data outside of \(T\).

We remark that this setting has substantial overlap with the setting of variational quantum algorithms (VQAs)—indeed, quantum neural networks can be thought of as a VQA that incorporates data—thus the same challenges and considerations that apply to VQAs also apply here. There will additionally also be extra considerations due to the role of the data.

Dominant resource cost/complexity. The encoding of data \(x\) and parameters \(\boldsymbol{\theta}\) in Eq. \(\eqref{eq:qnn}\) should be sufficiently expressive that it (1) leads to good performance on data and (2) is (at minimum) not efficiently simulable classically, if one is to seek quantum advantage. These criteria can be used to derive lower bounds on the circuit depth, in some settings.

The learning algorithm to find optimal parameters is usually performed by classical heuristics, such as gradient descent, and can have significant time overhead, requiring evaluation of Eq. \(\eqref{eq:qnn}\) at many parameter values (see variational quantum algorithms for more details).

The size of the training dataset required can also have direct implications for runtime, with a larger amount of training data typically taking a longer time to process. Reference [2] proves that good generalization can be achieved with the size of the training data \(|T|\) growing in tandem with the number of adjustable parameters \(M\). Specifically, it is shown that the deviation between training error (performance on training data set) and test error (performance on previously unseen data) with high probability scales as \(\mathcal{O}\left( \sqrt{M\log(M) / |T|} \right)\). Thus, only a mild amount of data is required for good generalization. We stress that this alone does not say anything about the ability for quantum neural networks to obtain low training error.

Scope for advantage. Quantum neural networks could achieve advantage in a number of ways, including improved runtime, or needing less training data. In supervised learning settings, generalization performance is a separate consideration, and an additional domain for possible quantum advantage. Machine learning with quantum neural networks has yielded some promising performance empirically and encouraging theoretical guarantees exist for certain stages of the full pipeline in restricted settings [3, 4, 2, 5, 6]. Nevertheless, there are currently no practical use cases with full end-to-end performance guarantees.

Quantum kernel methods

Actual end-to-end problem(s) solved. Quantum kernel methods are a generalization of classical kernel methods, of which support vector machines are a prominent example. Given a dataset \(T=\{x_i\}\subset X\) the model can be written

\[\begin{align} \label{eq:kernel} h_{\boldsymbol{\alpha}}(x) = \sum_{i:\, x_i \in T} \alpha_i \kappa(x,x_i)\,, \end{align}\]

where \(\boldsymbol{\alpha}=(\alpha_1, \alpha_2, ...)\) is a vector of parameters to be optimized, and \(\kappa(x,x'): X \times X \rightarrow \mathbb{R}\) is a measure of similarity known as the kernel function. This model has several theoretical motivations:

If the matrix with entries \(K_{ij}=\kappa(x_i,x_j)\) is symmetric positive semi-definite for any \(\{x_1,...,x_m\}\subseteq X\), \(\kappa(x_i,x_j)\) can be interpreted as an inner product of feature vectors \(\phi(x_i), \phi(x_j)\) which embed the data \(x_i\) and \(x_j\) in a (potentially high dimensional) Hilbert space. Due to the so-called kernel trick, linear statistical methods can be used to learn a linear function in this high dimensional space, only using the information of the inner products \(\kappa(x_i,x_j)\) and never having to explicitly evaluate \(\phi(x_i)\) and \(\phi(x_j)\).
Concretely, the Representer Theorem [7] states that the optimal model over the dataset \(T\) can be expressed as a linear combination of kernel values evaluated over \(T\)—that is, the optimal model exactly takes the form in Eq. \(\eqref{eq:kernel}\).
Further, if the loss function is convex, then the dual optimization program to find the optimal parameters \(\boldsymbol{\alpha}^*\) is also convex [8].

A key question that remains is then how to choose a kernel function. Quantum kernel methods embed data in quantum states, and thus evaluate \(\kappa(x_i,x_j)\) on a quantum computer. Similar to quantum neural networks or any other quantum model, the quantum kernel should be hard to simulate classically. As an example, we present two common choices of quantum kernel.

The fidelity quantum kernel
\[\begin{equation} \label{eq:fidelity-kernel} \kappa_F(x,x') = \operatorname{Tr}[\rho(x)\rho(x')]\,, \end{equation}\]

which can be evaluated either with a SWAP test or, given classical data with unitary embeddings, it can be evaluated with the overlap circuit \(|\bra{0}U(x')^{\dag}U(x)\ket{0}|^2\). - The fidelity kernel can run into issues for high dimensional systems, as the inner product in Eq. \(\eqref{eq:fidelity-kernel}\) can be very small for \(x\neq x'\). This motivated the proposal of a family of projected quantum kernels [9], of which one example is the Gaussian projected quantum kernel

\[\begin{equation} \kappa_P(x,x') =\exp \left(-\gamma \sum_{k=1}^n\left\|\rho_k(x)-\rho_k\left(x'\right)\right\|_2^2\right) \end{equation}\]

where \(\rho_k(x)\) is the reduced state of the \(n\)-qubit state \(\rho(x)\) on qubit \(k\), and \(\gamma\) is a hyperparameter.

Dominant resource cost/complexity. During the optimization of the dual program to find the optimal parameters \(\boldsymbol{\alpha}^*\), \(\mathcal{O}\left( |T|^2 \right)\) expectation values corresponding to the kernel values in Eq. \(\eqref{eq:kernel}\) need to be accurately evaluated, as well as when computing \(h_{\boldsymbol{\alpha}^*}(x)\) for a new data point \(x\) with the optimized model. This can lead to a significant overhead in applications with large datasets. Alternatively, the primal optimization problem has reduced complexity in the data set size, but greatly exacerbated dependence on the error [10]. The gate complexity is wholly dependent on the choice of data encoding leading to the kernel function. As the kernel function should be classically non-simulable, this gives intuition that there should be some minimum requirements in terms of gate complexity. However, in the absence of standardized techniques for data encoding it is hard to make more precise statements.

Scope for advantage. In Ref. [11] the authors demonstrate that using a particular constructed dataset and data embedding, concrete quantum advantage can be obtained for a constructed machine learning problem based on the discrete logarithm problem. The original work was based on the fidelity kernel, but a similar advantage can also be more simply obtained for the projected quantum kernel [9]. This can also be adapted beyond kernel methods to the reinforcement learning setting [12]. Whilst great strides have been made in understanding the complexity of quantum kernel methods [13, 9], at present there do not yet exist examples of end-to-end theoretical guarantees of advantage for more physically relevant classical data.

Caveats

One consideration we have not discussed so far is how to encode classical data into a quantum circuit, which is a significant aspect of constructing the quantum model. There are many possibilities, such as amplitude encoding or encoding data into rotation angles of single-qubit rotations (e.g., see [14, 15, 16, 17]). While certain strategies are popular, in general it is unclear what is the best choice for a given problem at hand, and thus selecting the data-encoding strategy can itself be a heuristic process. The same question extends to the choice of quantum neural network or quantum kernel. While certain choices may perform well in specific problem instances, there is at present a lack of strong evidence why such approaches may be advantageous over their classical counterparts in general.

While optimization of parameterized quantum circuits is predominantly a concern for quantum neural networks, the search for good quantum kernels has also motivated proposals of trainable kernels [16, 18, 19] where a parameterized quantum circuit is used to construct the quantum kernel (note, this is distinct from the "classical" optimization of \(\boldsymbol{\alpha}\) in Eq. \(\eqref{eq:kernel}\)). In the case that the parameter optimization process is performed using heuristics, it is subject to the same challenges and considerations that arise with VQAs (see variational quantum algorithms for more details).

Finite statistics is an important consideration for both settings. Where there is optimization of parameterized quantum circuits, one must take care to avoid the barren plateau phenomenon [20, 21, 22, 23, 24] (again see variational quantum algorithms for more details). Analogous effects can also occur in the kernel setting [25], which can arise even purely due to the data-encoding circuit [9, 26].

Outlook

The use of classical machine learning models is often highly heuristic, and guided by empirical evidence or physical intuition. Despite this, they have found remarkable success in solving many computational problems. The quantum techniques outlined in this section also broadly follow this approach (though theoretical progress has also been substantial in certain areas), and there is no a priori reason why they cannot also be useful. Nevertheless, it is challenging to make concrete predictions for quantum advantage, particularly on classical data. This is exacerbated by our limited analytic understanding of end-to-end applications, even in the fully classical setting. Indeed, it may ultimately be challenging to have the same complete end-to-end theoretical analysis that other quantum algorithms enjoy, aside for a few select examples [27]. Within the realm of quantum data, there appears to be ripe potential for concrete provable advantage [28, 29, 30], however this is beyond the scope of this article.

Bibliography

John Preskill. Quantum computing in the nisq era and beyond. Quantum, 2:79, 2018. arXiv: https://arxiv.org/abs/1801.00862. doi:10.22331/q-2018-08-06-79.
Matthias C Caro, Hsin-Yuan Huang, Marco Cerezo, Kunal Sharma, Andrew Sornborger, Lukasz Cincio, and Patrick J Coles. Generalization in quantum machine learning from few training data. Nature Communications, 13(1):4919, 2022. arXiv: https://arxiv.org/abs/2111.05292. URL: https://www.nature.com/articles/s41467-022-32550-3, doi:https://doi.org/10.1038/s41467-022-32550-3.
Louis Schatzki, Martin Larocca, Frederic Sauvage, and Marco Cerezo. Theoretical guarantees for permutation-equivariant quantum neural networks. arXiv: https://arxiv.org/abs/2210.09974, 2022.
Matthias C Caro, Elies Gil-Fuster, Johannes Jakob Meyer, Jens Eisert, and Ryan Sweke. Encoding-dependent generalization bounds for parametrized quantum circuits. Quantum, 5:582, 2021. arXiv: https://arxiv.org/abs/2106.03880. doi:https://doi.org/10.22331/q-2021-11-17-582.
Junyu Liu, Khadijeh Najafi, Kunal Sharma, Francesco Tacchino, Liang Jiang, and Antonio Mezzacapo. Analytic theory for the dynamics of wide quantum neural networks. Physical Review Letters, 130:150601, 4 2023. arXiv: https://arxiv.org/abs/2203.16711. URL: https://link.aps.org/doi/10.1103/PhysRevLett.130.150601, doi:10.1103/PhysRevLett.130.150601.
Xuchen You, Shouvanik Chakrabarti, Boyang Chen, and Xiaodi Wu. Analyzing convergence in quantum neural networks: deviations from neural tangent kernels. arXiv: https://arxiv.org/abs/2303.14844, 2023. URL: https://arxiv.org/abs/2303.14844.
Bernhard Schölkopf, Ralf Herbrich, and Alex J. Smola. A generalized representer theorem. In Proceedings of the 14th Conference On Learning Theory (COLT), 416–426. Springer Berlin Heidelberg, 2001. doi:https://doi.org/10.1007/3-540-44581-1\_27.
Maria Schuld. Supervised quantum machine learning models are kernel methods. arXiv: https://arxiv.org/abs/2101.11020, 2021.
Hsin-Yuan Huang, Michael Broughton, Masoud Mohseni, Ryan Babbush, Sergio Boixo, Hartmut Neven, and Jarrod R McClean. Power of data in quantum machine learning. Nature Communications, 12(1):2631, 2021. arXiv: https://arxiv.org/abs/2011.01938. doi:https://doi.org/10.1038/s41467-021-22539-9.
Gian Gentinetta, Arne Thomsen, David Sutter, and Stefan Woerner. The complexity of quantum support vector machines. arXiv: https://arxiv.org/abs/2203.00031, 2022.
Yunchao Liu, Srinivasan Arunachalam, and Kristan Temme. A rigorous and robust quantum speed-up in supervised machine learning. Nature Physics, 17(9):1013–1017, 2021. arXiv: https://arxiv.org/abs/2010.02174. URL: https://www.nature.com/articles/s41567-021-01287-z, doi:https://doi.org/10.1038/s41567-021-01287-z.
Sofiene Jerbi, Casper Gyurik, Simon Marshall, Hans Briegel, and Vedran Dunjko. Parametrized quantum policies for reinforcement learning. In Advances in Neural Information Processing Systems 34 (NIPS), 28362–28375. 2021. arXiv: https://arxiv.org/abs/2103.05577. URL: https://proceedings.neurips.cc/paper/2021/hash/eec96a7f788e88184c0e713456026f3f-Abstract.html.
Leonardo Banchi, Jason Pereira, and Stefano Pirandola. Generalization in quantum machine learning: a quantum information standpoint. PRX Quantum, 2:040321, 11 2021. arXiv: https://arxiv.org/abs/2102.08991. URL: https://link.aps.org/doi/10.1103/PRXQuantum.2.040321, doi:10.1103/PRXQuantum.2.040321.
Seth Lloyd, Maria Schuld, Aroosa Ijaz, Josh Izaac, and Nathan Killoran. Quantum embeddings for machine learning. arXiv: https://arxiv.org/abs/2001.03622, 2020. URL: https://arxiv.org/abs/2001.03622.
Vojtěch Havlíček, Antonio D Córcoles, Kristan Temme, Aram W Harrow, Abhinav Kandala, Jerry M Chow, and Jay M Gambetta. Supervised learning with quantum-enhanced feature spaces. Nature, 567(7747):209–212, 2019. arXiv: https://arxiv.org/abs/1804.11326. URL: https://www.nature.com/articles/s41586-019-0980-2, doi:10.1038/s41586-019-0980-2.
Thomas Hubregtsen, David Wierichs, Elies Gil-Fuster, Peter-Jan H. S. Derks, Paul K. Faehrmann, and Johannes Jakob Meyer. Training quantum embedding kernels on near-term quantum computers. Physical Review A, 106:042431, 10 2022. arXiv: https://arxiv.org/abs/2105.02276. URL: https://link.aps.org/doi/10.1103/PhysRevA.106.042431, doi:10.1103/PhysRevA.106.042431.
Ryan LaRose and Brian Coyle. Robust data encodings for quantum classifiers. Physical Review A, 102:032420, 9 2020. arXiv: https://arxiv.org/abs/2003.01695. URL: https://link.aps.org/doi/10.1103/PhysRevA.102.032420, doi:10.1103/PhysRevA.102.032420.
Gian Gentinetta, David Sutter, Christa Zoufal, Bryce Fuller, and Stefan Woerner. Quantum kernel alignment with stochastic gradient descent. arXiv: https://arxiv.org/abs/2304.09899, 2023.
Jennifer R Glick, Tanvi P Gujarati, Antonio D Corcoles, Youngseok Kim, Abhinav Kandala, Jay M Gambetta, and Kristan Temme. Covariant quantum kernels for data with group structure. arXiv: https://arxiv.org/abs/2105.03406, 2021.
Jarrod R McClean, Sergio Boixo, Vadim N Smelyanskiy, Ryan Babbush, and Hartmut Neven. Barren plateaus in quantum neural network training landscapes. Nature Communications, 9(1):1–6, 2018. arXiv: https://arxiv.org/abs/1803.11173. URL: https://www.nature.com/articles/s41467-018-07090-4, doi:10.1038/s41467-018-07090-4.
M. Cerezo, Akira Sone, Tyler Volkoff, Lukasz Cincio, and Patrick J Coles. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nature Communications, 12(1):1–12, 2021. arXiv: https://arxiv.org/abs/2001.00550. URL: https://www.nature.com/articles/s41467-021-21728-w, doi:10.1038/s41467-021-21728-w.
Zoë Holmes, Kunal Sharma, M. Cerezo, and Patrick J Coles. Connecting ansatz expressibility to gradient magnitudes and barren plateaus. PRX Quantum, 3:010313, 1 2022. arXiv: https://arxiv.org/abs/2101.02138. URL: https://link.aps.org/doi/10.1103/PRXQuantum.3.010313, doi:10.1103/PRXQuantum.3.010313.
Carlos Ortiz Marrero, Mária Kieferová, and Nathan Wiebe. Entanglement-induced barren plateaus. PRX Quantum, 2(4):040316, 2021. arXiv: https://arxiv.org/abs/2010.15968. URL: https://journals.aps.org/prxquantum/abstract/10.1103/PRXQuantum.2.040316, doi:10.1103/PRXQuantum.2.040316.
Kunal Sharma, M. Cerezo, Lukasz Cincio, and Patrick J Coles. Trainability of dissipative perceptron-based quantum neural networks. Physical Review Letters, 128(18):180505, 2022. arXiv: https://arxiv.org/abs/2005.12458. doi:10.1103/PhysRevLett.128.180505.
Jonas Kübler, Simon Buchholz, and Bernhard Schölkopf. The inductive bias of quantum kernels. In Advances in Neural Information Processing Systems 34 (NIPS), 12661–12673. 2021. arXiv: https://arxiv.org/abs/2106.03747. URL: https://proceedings.neurips.cc/paper/2021/hash/69adc1e107f7f7d035d7baf04342e1ca-Abstract.html.
Supanut Thanasilp, Samson Wang, Marco Cerezo, and Zoë Holmes. Exponential concentration and untrainability in quantum kernel methods. arXiv: https://arxiv.org/abs/2208.11060, 2022. URL: https://arxiv.org/abs/2208.11060.
Maria Schuld and Nathan Killoran. Is quantum advantage the right goal for quantum machine learning? PRX Quantum, 3(3):030101, 2022. arXiv: https://arxiv.org/abs/2203.01340. URL: https://link.aps.org/doi/10.1103/PRXQuantum.3.030101, doi:10.1103/PRXQuantum.3.030101.
Hsin-Yuan Huang, Michael Broughton, Jordan Cotler, Sitan Chen, Jerry Li, Masoud Mohseni, Hartmut Neven, Ryan Babbush, Richard Kueng, John Preskill, and others. Quantum advantage in learning from experiments. Science, 376(6598):1182–1186, 2022. arXiv: https://arxiv.org/abs/2112.00778. doi:DOI: 10.1126/science.abn7293.
Sitan Chen, Jordan Cotler, Hsin-Yuan Huang, and Jerry Li. Exponential separations between learning with and without quantum memory. In Proceedings of the 62nd IEEE Symposium on Foundations of Computer Science (FOCS), 574–585. IEEE, 2022. arXiv: https://arxiv.org/abs/2111.05881. doi:10.1109/FOCS52979.2021.00063.
Matthias C. Caro, Hsin-Yuan Huang, Nicholas Ezzell, Joe Gibbs, Andrew T. Sornborger, Lukasz Cincio, Patrick J. Coles, and Zoë Holmes. Out-of-distribution generalization for learning quantum dynamics. Nature Communications, 14(1):3751, 2023. arXiv: https://arxiv.org/abs/2204.10268. URL: https://doi.org/10.1038/s41467-023-39381-w, doi:10.1038/s41467-023-39381-w.
Marcello Benedetti, Erika Lloyd, Stefan Sack, and Mattia Fiorentini. Parameterized quantum circuits as machine learning models. Quantum Science and Technology, 4(4):043001, 2019. arXiv: https://arxiv.org/abs/1906.07682. URL: https://iopscience.iop.org/article/10.1088/2058-9565/ab4eb5, doi:10.1088/2058-9565/ab4eb5.
M. Cerezo, Andrew Arrasmith, Ryan Babbush, Simon C Benjamin, Suguru Endo, Keisuke Fujii, Jarrod R McClean, Kosuke Mitarai, Xiao Yuan, Lukasz Cincio, and Patrick J. Coles. Variational quantum algorithms. Nature Reviews Physics, pages 625–644, 2021. arXiv: https://arxiv.org/abs/2012.09265. doi:10.1038/s42254-021-00348-9.
M Cerezo, Guillaume Verdon, Hsin-Yuan Huang, Lukasz Cincio, and Patrick J Coles. Challenges and opportunities in quantum machine learning. Nature Computational Science, 2(9):567–576, 2022. arXiv: https://arxiv.org/abs/2303.09491. doi:https://doi.org/10.1038/s43588-022-00311-3.