Quantum machine learning via energy-based models

Overview

An important class of models in machine learning are energy-based models, which are heavily inspired by statistical mechanics. The goal of energy-based models is to train a physical model (i.e., tune the interaction strengths between a set of particles) such that the model closely matches the training set when the model is in thermal equilibrium (made more precise below). Energy-based models are an example of generative models since, once they are trained, they can then be used to form new examples that are similar to the training set by sampling from the model's thermal distribution.

Due to their deep connection to physics, energy-based models are prime candidates for various forms of quantization. However, one challenge faced by quantum approaches is that the statistical mechanical nature of the learning problem also often lends itself to efficient, approximate classical methods. As a result, the best quantum algorithms may also be heuristic in nature, which prevents an end-to-end complexity analysis. While energy-based models are less widely used than deep neural networks today, they were an important conceptual development in machine learning [1] and continue to foster interest due to their sound theoretical basis, and their connection to statistical mechanics.

There are a number of proposals for generalizing energy-based models to quantum machine learning. The starting point is a graph where the vertices are divided into visible \(\{v\}\) and hidden \(\{h\}\) nodes. When each node is assigned a value in some discrete or continuous set, this constitutes a "configuration" \((h,v)\) of the model. A training set \(\mathcal{D}\) is provided as input, containing a list of configurations of the visible vertices. The hidden nodes are not part of the training set, but including them is essential for the model to be able to capture latent variables in the data.

A graphical model is then built on the vertices—each vertex is a physical system (such as a spin-\(1/2\) particle) and edges between vertices represent physical interactions. The model is described by an energy functional \(H(h,v)\), which assigns an energy value to each possible configuration \((h,v)\) of the vertices. For example, in Boltzmann machines (BMs), the vertices are assigned binary variables, and the interactions are Ising interactions. The model can be used to generate samples (e.g., via Markov chain Monte Carlo methods) from the thermal distribution (also known as the Boltzmann distribution or the Gibbs distribution) at unit temperature, that is, the distribution where each configuration \((h,v)\) is sampled with probability proportional to \(e^{-H(h,v)}\). In unsupervised learning tasks, provided a set of training samples of configurations of the visible units \(v\), the goal is to tune the interaction weights of the model such that the model's thermal distribution best matches the distribution that generated the training set.

Quantum algorithms can potentially be helpful for training classical graphical models. One can also generalize the model itself by allowing the physical systems on each vertex to be quantum, and interactions between systems to be noncommuting.

Actual end-to-end problem(s) solved

Classical graphical models.

Let \(G = (V,E)\) denote a graph with vertices \(V\) and edges \(E\). For classical models, each vertex \(j\) is assigned a binary variable \(z_j = \pm 1\). The variables are split into visible and hidden nodes, \(z \in \{v\} \cup \{h\}\). For classical BMs, the energy functional is often taken to be quadratic¹ with weights \(\{b_i,w_{ij}\}\):

\[\begin{equation} \label{eq:EBM_energy} H(z)= \sum_{i\in V} b_i z_i+\sum_{(i,j)\in E}w_{ij} z_iz_j. \end{equation}\]

Note that interactions can occur between any pair of nodes (hidden or visible). In the special case of a restricted Boltzmann machine (RBM), each edge must pair up a hidden node with a visible node (i.e., the graph is bipartite), which can cause simplifications to certain training approaches.

The thermal distribution corresponding to the energy functional (at unit temperature) associates each configuration \(v\) of visible nodes with a probability \(p(v)\) such that

\[\begin{equation} p(v) = \sum_h p(h,v), \qquad p(h,v) = \frac{e^{-H(h,v)}}{Z}, \qquad Z = \sum_{h,v} e^{-H(h,v)} \end{equation}\]

where \(Z\), the partition function, is the normalization to ensure probabilities sum to 1. Even though hidden nodes are integrated out in the calculation of \(p(v)\), they impact the distribution of \(p(v)\) through their interactions with the visible nodes.

Given a training set \(\mathcal{D} = \{v_1,v_2,\ldots,v_{\lvert \mathcal{D} \rvert}\}\) of sample configurations of the visible nodes, the goal of the training phase is to modify the weights \(\theta \in \{b_i\} \cup \{w_{ij}\}\) such that samples from the thermal distribution of the model most closely match the training samples. Ideally, this is done by finding the set of weights that maximizes the likelihood of observing the samples, i.e. \(\prod_{v\in \mathcal{D}} p(v)\), or, equivalently, minimizing the (normalized) log-likelihood loss function, defined as

\[\begin{equation} L(b,w)=-\frac{1}{\lvert \mathcal{D} \rvert }\sum_{v\in\mathcal{D}} \log(p(v)) \,. \end{equation}\]

The loss function can be minimized using some variant of gradient descent, which requires the evaluation of the derivatives \(\partial_{\theta}L\) for \(\theta \in \{b_i\} \cup \{w_{ij}\}\). For the energy functional above, these derivatives can be readily calculated from ensemble averages (see, e.g., [2]). For example,

\[\begin{equation} \label{eq:EBM_gradient} \frac{\partial L}{\partial w_{ij}} = \langle z_iz_j \rangle_{v \in \mathcal{D}} - \langle z_iz_j \rangle \end{equation}\]

where \(\langle \cdot \rangle\) denotes an average over samples from the thermal distribution \(p(h,v)\), while \(\langle \cdot \rangle_{v \in \mathcal{D}}\) denotes an average where \(v\) is drawn at random from the training set \(\mathcal{D}\), and \(h\) is sampled from the thermal distribution conditioned on that choice of \(v\). Without any further restrictions, the gradients will typically be difficult to evaluate, or estimate accurately. An exact computation requires computing a sum over the exponential number of configurations of the vertices.

In some cases, good estimates of the gradients can be obtained by repeatedly drawing samples from the thermal distribution and computing averages. Samples can be generated with Markov chain Monte Carlo (MCMC) methods such as Metropolis sampling or simulated annealing; however, the time required to sample from a distribution close to the thermal distribution depends on the mixing time of the Markov chain, which is generally unknown and can also be exponential in the graph size. Additionally, many samples need to be generated to produce a robust average, with precision \(\epsilon\) requiring \(\mathcal{O}\left( 1/\epsilon^2 \right)\) samples. Approximate classical methods, such as contrastive divergence [3], avoid this issue by initializing the Markov chain at one of the training samples and deliberately taking a small number of steps—this does not exactly correspond to optimizing the log-likelihood but in some cases has empirical success [4]. Once the model has been trained, new samples can also be generated via the same MCMC methods. The end-to-end tasks are (i) training the model, and then, (ii) generating samples from the trained model to accomplish some larger machine learning goal.

Quantum graphical models.

A separate end-to-end problem is found by generalizing the model itself to be quantum. For example, one can start with a classical BM and promote the binary variables to qubits. The energy functional is promoted to a quantum Hamiltonian and augmented with a transverse field, which does not commute with the Ising interactions. The result is a quantum Boltzmann machine (QBM), described by a transverse-field Ising (TFI) Hamiltonian [5]:

\[\begin{equation} \label{eq:QBM_energy} H_{\rm QBM}= -\sum_{i\in V}(\kappa_i X_i +b_i Z_i) -\sum_{(i,j)\in E} w_{ij} Z_i Z_j, \end{equation}\]

where \(X_i\) and \(Z_i\) are the Pauli-\(X\) and Pauli-\(Z\) operators on qubit \(i\), and \(b_i, \kappa_i, w_{ij}\) are real variational parameters of the model. The ground or Gibbs state of \(H_{\rm QBM}\) can be prepared in a variety of ways, including: the adiabatic algorithm, Hamiltonian simulation, Gibbs sampling or as a variational quantum algorithm. These states can be measured (in the \(Z\) basis or in the \(X\) basis), yielding samples of the variables \(v,h\) drawn from different distributions than the thermal distribution for the classical BM. As in the classical case, the training phase for a QBM consists of varying the weights via gradient descent to maximize a likelihood function. However, the noncommutativity of the Hamiltonian leads to complications: the gradients of the loss function are no longer directly given by sample expectation values, although workarounds have been proposed [5, 6, 7, 8, 9]. The end-to-end problem is to train these models and generate samples.

Dominant resource cost/complexity

Complexity of classical graphical models.

Recall that for classical BMs, one wishes to produce samples from the thermal distribution corresponding to the energy functional in Eq. \(\eqref{eq:EBM_energy}\), i.e. Gibbs sampling (of diagonal Hamiltonians), either to assist in training the model or, if it has already been trained, to make inferences or generate new data. Specifically, given \(H(h,v)\), one wishes to draw samples of \((h,v)\) with probability proportional to \(e^{-H(h,v)}\), either with \(v\) free or with \(v\) fixed (sometimes referred to as "clamped") to a particular value from the training set \(\mathcal{D}\). Classically, one approach is simulated annealing or other MCMC algorithms. Quantumly, one can take one of several analogous approaches, including "quantum simulated annealing" [10] and quantum annealing, discussed as follows.

For quantum simulated annealing, one prepares the coherent Gibbs state \(\sum_{v,h} \sqrt{p(h,v)}\ket{v,h}\), and a quadratic speedup is obtained over classical simulated annealing. The method is to construct a Hamiltonian whose ground state is the coherent Gibbs state at temperature \(T\) (for which probabilities are proportional to \(e^{-H(h,v)/T}\), and follow an adiabatic path from \(T=\infty\) to \(T=1\). Following the path is accomplished by repeatedly performing quantum phase estimation (QPE) to project onto the ground state of the Hamiltonian at a given temperature. As is typical for the adiabatic algorithm, the cost of this procedure is dominated by the inverse of the spectral gap—this is the precision required for QPE to succeed. Specifically, for a graphical model with \(|V|\) vertices, the runtime will be \(\mathrm{poly}(|V|)/\Delta\), where \(\Delta\) is the minimum spectral gap. Importantly, \(\Delta\) can be related to the maximum mixing time \(t_{\rm mix}\) of the simulated annealing Markov chain, as \(1/\Delta = \mathcal{O}\left( \sqrt{t_{\rm mix}} \right)\), which leads to the quadratic speedup, although it is possible that \(\Delta\) is exponentially small in \(|V|\).

An alternative method for preparing (and sampling from) the coherent Gibbs state was proposed in [2]. There, one begins in an easy-to-prepare coherent mean-field state approximating the coherent Gibbs state. Then, one performs rejection sampling with amplitude amplification to gain a quadratic speedup over the analogous classical method. Additionally, it was proposed to use amplitude estimation to gain a quadratic improvement in the number of samples needed to achieve precision \(\epsilon\), from \(\mathcal{O}\left( 1/\epsilon^2 \right)\) to \(\mathcal{O}\left( 1/\epsilon \right)\), mirroring later analyses that work for general Monte Carlo methods [11]. If these \(\mathcal{O}\left( 1/\epsilon \right)\) quantum samples are each for the same training sample \(v \in \mathcal{D}\), this is straightforward; however, if the samples are drawn randomly from \(v \in \mathcal{D}\), achieving the quadratic speedup from amplitude estimation requires accessing the data in \(\mathcal{D}\) coherently and quickly. Such data access is provided by the quantum random access memory (QRAM) primitive, for which the circuit depth can be logarithmic in the size of the training data, at the expense of a number of ancilla qubits (and total gates) that is linear in the size of the training data.

For quantum annealing, the idea is to add a uniform transverse field, as in the QBM of Eq. \(\eqref{eq:QBM_energy}\) with \(\kappa_i=\kappa_j\) for all \(i,j\). The transverse field is initially strong, and slowly turned off. This is similar to the adiabatic algorithm, but differs in that it is specifically carried out at finite ambient temperature. Thus, the system-bath interaction of the device naturally drives the state to the Gibbs state, which coincides with the classical thermal distribution once the transverse field is turned off. This is a heuristic method; it is efficient but there are few success guarantees. The hope is that the inclusion of an initial transverse field induces nonclassical fluctuations that help the system avoid becoming trapped in local minima as the transverse field is turned off.

Overall, computing the gradient of the loss function with respect to one parameter, up to precision \(\epsilon\), will require complexity \(\mathcal{O}\left( S/\epsilon \right)\), where \(S\) is the complexity of sampling from the Gibbs state. The above assumes log-depth QRAM to be able to estimate the \(\langle z_i z_j \rangle_{v \in \mathcal{D}}\) term of Eq. \(\eqref{eq:EBM_gradient}\). The complexity of \(S\) will be \(\mathrm{poly}(|V|)\sqrt{t_{\rm mix}}\) if a quantum simulated annealing approach is used, or some hard-to-analyze quantity if the quantum annealing approach is used. If the number of training samples is small, one can also sequentially compute the sum over \(v \in \mathcal{D}\) and avoid the assumption of log-depth QRAM, leading to complexity \(\mathcal{O}\left( S\lvert \mathcal{D} \rvert/\epsilon' \right)\) (where \(\epsilon' \geq \epsilon\) may be order-1). This must be carried out for all \(|E|+|V|\) weights in the model, although these could be simultaneously estimated to precision \(\epsilon\) at cost \(\tilde{\mathcal{O}}(\sqrt{|E|+|V|}/\epsilon)\) samples, using methods from [12], which leverage the quantum gradient estimation primitive. It is not clear what value of \(\epsilon\) is required in practice. Reference [2] takes \(\epsilon \sim 1/\sqrt{\lvert \mathcal{D} \rvert}\), to match the natural uncertainty coming from a finite number of training samples. In this case, the overall complexity is dominated by

\[\begin{equation} \widetilde{\mathcal{O}}\left( S \cdot \sqrt{|V|+|E|} \cdot \sqrt{|\mathcal{D}|} \right) \end{equation}\]

assuming log-depth QRAM, and

\[\begin{equation} \widetilde{\mathcal{O}}\left( S \cdot \sqrt{|V|+|E|} \cdot |\mathcal{D}| \right) \end{equation}\]

without log-depth QRAM (the precision for each training sample can be taken \(\epsilon' = \mathcal{O}\left( 1 \right)\)).

Complexity of quantum graphical models.

For QBMs, the dominant cost is producing samples from the quantum Gibbs state of Eq. \(\eqref{eq:QBM_energy}\), i.e. the state \(\rho \propto e^{-H_{\rm QBM}}\), which can be accomplished through methods for Gibbs sampling. Rigorous methods for Gibbs sampling may scale exponentially in the size of the graph, without further assumptions. Such scaling would likely not be tolerable in practice. However, Monte Carlo–style methods for Gibbs sampling, which follow a similar approach as MCMC, but in an inherently quantum way, may be more effective in this case. These could have \(\mathrm{poly}(|V|)\) scaling for some parameter settings, but must also have exponential scaling in the worst case, as sampling low-energy Ising-model configurations is known to be NP-hard.

One can also heuristically apply quantum annealing, beginning from a large transverse field and reducing its strength slowly to some final nonzero value. However, some hardware platforms may only admit global control over the transverse field, preventing one from tuning the transverse field strengths \(\kappa_i\) individually. In any of these approaches, it is difficult to make any rigorous statements about the Gibbs sampling complexity.

Existing error corrected resource estimates

There are no error-corrected estimates for annealing. However, [13, 14] discuss in detail how to embed the fully connected architecture of a RBM into the 2D lattice architecture available on planar quantum annealers.

Reference [14] reports an embedding ratio scaling which is roughly quadratic—that is, a graphical model with \(|V|\) vertices requires \(\mathcal{O}\left( |V|^2 \right)\) qubits to accommodate the architectural limitations of the device. A proper fault-tolerant resource estimation has not been performed for the fault-tolerant algorithm of [2].

Caveats

There are two main caveats to quantum approaches to training classical models, which apply to both the annealing and to the fault-tolerant setting. (i) Classical heuristic algorithms, such as greedy methods or contrastive divergence, often perform well in practice and are the method of choice for existing classical analyses. These methods are also often highly parallelizable. If the quantum algorithm offers a speedup over a slower, exact classical method, this may not be relevant if the faster approximate classical methods are already sufficient. (ii) The situations where one might hope for the heuristic quantum annealing approach to perform better might not be relevant problems, for instance in highly regular lattice based problems.

A caveat of the QBM is that the gradients of the loss function are not exactly related to sample averages, and imperfect workarounds, such as those proposed in [5], must be pursued. Like many other situations in machine learning, the resulting end-to-end solution is heuristic and evidence of its efficacy requires empirical demonstration.

Comparable classical complexity and challenging instance sizes

For classical models, an exact computation of the gradients would scale exponentially in the size of the graph, as \(\mathcal{O}\left( |\mathcal{D}|2^{|V|} \right)\) for the gradient of a single parameter. Approximate methods based on simulated annealing or other MCMC methods would scale as \(\mathcal{O}\left( S_c/\epsilon^2 \right)\), where \(S_c\) is the classical sample time, scaling as \(S_c = \mathrm{poly}(|V|)t_{\rm mix}\). On the other hand, these methods can also be implemented heuristically at reduced cost (e.g., contrastive divergence, where one does not wait for the chain to mix), and they can also be implemented on parallel architectures. For instance, in [15], an architecture was proposed to train deep BMs efficiently. Experiments demonstrated that heuristic training methods could be carried out for graphs of size 1 million in 100 seconds on field-programmable gate arrays available in 2010. Much larger sizes would be accessible to a scaled-up version of the same architecture on modern hardware. It is unlikely that any exact method, quantum or classical, could match this efficiency.

For the quantum models, the classical complexity of sampling from the Gibbs state of the model would be exponential in the graph size \(|V|\). Thus, training these models would likely not be pursued classically.

Speedup

For the classical models, the speedup can be quadratic in most of the parameters: producing a sample can in some cases be sped up quadratically, and the number of samples required to achieve a certain precision also enjoys a quadratic speedup (e.g., \(t_{\rm mix}\) to \(\sqrt{t_{\rm mix}}\) and \(\mathcal{O}\left( 1/\epsilon^2 \right)\) to \(\mathcal{O}\left( 1/\epsilon \right)\)). The methods that give these provable quadratic speedups are based on primitives such as amplitude amplification, where superquadratic speedups are not possible without exploiting additional structure. Larger superpolynomial speedups are only possible under optimistic assumptions about the success of heuristic quantum annealing approaches at producing samples faster than classical approaches.

For the quantum models, the speedup is technically exponential, assuming that for the models considered, quantum algorithms for Gibbs sampling scale efficiently while approximate classical methods (e.g., tensor networks) scale exponentially. Nevertheless, it has yet to be demonstrated that there are specific tasks where these models are superior to classical machine learning models that can be trained and operated more efficiently classically.

Outlook

While energy-based models are naturally in a form that can readily be extended to the quantum domain, there still lacks decisive evidence of quantum advantage for a specific end-to-end classical machine learning problem. There remains some uncertainty on the outlook of these approaches due to the centrality of heuristic quantum approaches. One may hold out hope that these heuristics could outperform classical heuristics in some specific settings, but the success of classical heuristics and effectiveness of approximate classical approaches presents a formidable barrier to achieving any quantum advantage in this area.

Bibliography

Ruslan Salakhutdinov and Hugo Larochelle. Efficient learning of deep boltzmann machines. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 693–700. JMLR Workshop and Conference Proceedings, 2010. URL: https://proceedings.mlr.press/v9/salakhutdinov10a.html.
Nathan Wiebe, Ashish Kapoor, and Krysta M. Svore. Quantum deep learning. Quantum Information and Computation, 16(7–8):541–587, 5 2016. arXiv: https://arxiv.org/abs/1412.3489. doi:10.26421/QIC16.7-8-1.
Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002. doi:10.1162/089976602760128018.
Maria Schuld and Francesco Petruccione. Machine learning with quantum computers. Springer, 2021. doi:10.1007/978-3-030-83098-4.
Mohammad H. Amin, Evgeny Andriyash, Jason Rolfe, Bohdan Kulchytskyy, and Roger Melko. Quantum boltzmann machine. Physical Review X, 8(2):021050, 2018. arXiv: https://arxiv.org/abs/1601.02036. doi:10.1103/PhysRevX.8.021050.
Mária Kieferová and Nathan Wiebe. Tomography and generative training with quantum boltzmann machines. Physical Review A, 96(6):062327, 2017. arXiv: https://arxiv.org/abs/1612.05204. doi:10.1103/PhysRevA.96.062327.
Nathan Wiebe and Leonard Wossnig. Generative training of quantum boltzmann machines with hidden units. arXiv: https://arxiv.org/abs/1905.09902, 2019.
Eric R Anschuetz and Yudong Cao. Realizing quantum boltzmann machines through eigenstate thermalization. arXiv: https://arxiv.org/abs/1903.01359, 2019.
Christa Zoufal, Aurélien Lucchi, and Stefan Woerner. Variational quantum boltzmann machines. Quantum Machine Intelligence, 3(1):7, 2021. arXiv: https://arxiv.org/abs/2006.06004. URL: https://doi.org/10.1007/s42484-020-00033-7, doi:10.1007/s42484-020-00033-7.
Rolando Somma, Sergio Boixo, and Howard Barnum. Quantum simulated annealing. arXiv: https://arxiv.org/abs/0712.1008, 2007.
Ashley Montanaro. Quantum speedup of monte carlo methods. Proceedings of the Royal Society A, 2015. arXiv: https://arxiv.org/abs/1504.06987. doi:10.1098/rspa.2015.0301.
William J. Huggins, Kianna Wan, Jarrod McClean, Thomas E. O'Brien, Nathan Wiebe, and Ryan Babbush. Nearly optimal quantum algorithm for estimating multiple expectation values. Physical Review Letters, 129:240501, 12 2022. arXiv: https://arxiv.org/abs/2111.09283. URL: https://link.aps.org/doi/10.1103/PhysRevLett.129.240501, doi:10.1103/PhysRevLett.129.240501.
Steven H Adachi and Maxwell P Henderson. Application of quantum annealing to training of deep neural networks. arXiv: https://arxiv.org/abs/1510.06356, 2015.
Marcello Benedetti, John Realpe-Gómez, Rupak Biswas, and Alejandro Perdomo-Ortiz. Quantum-assisted learning of hardware-embedded probabilistic graphical models. Physical Review X, 7(4):041052, 2017. arXiv: https://arxiv.org/abs/1609.02542. doi:10.1103/PhysRevX.7.041052.
Sang Kyun Kim, Peter Leonard McMahon, and Kunle Olukotun. A large-scale architecture for restricted boltzmann machines. In 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, volume, 201–208. 2010. doi:10.1109/FCCM.2010.38.
David Sherrington and Scott Kirkpatrick. Solvable model of a spin-glass. Physical Review Letters, 35(26):1792, 1975. doi:10.1103/PhysRevLett.35.1792.

This quadratic energy functional is related to the Sherrington–Kirkpatrick (SK) model [16] with an external field, which is a model for spin glasses in the statistical mechanics literature. For the SK model, the couplings \(w_{ij}\) are chosen randomly for each pair of nodes, and it is typically computationally hard to find configurations with optimal energy (see the section on beyond quadratic speedups in combinatorial optimization for some additional information). ↩