Paper The following article is Open access

Sample complexity of learning parametric quantum circuits

, and

Published 1 March 2022 © 2022 The Author(s). Published by IOP Publishing Ltd
, , Citation Haoyuan Cai et al 2022 Quantum Sci. Technol. 7 025014 DOI 10.1088/2058-9565/ac4f30

2058-9565/7/2/025014

Abstract

Quantum computers hold unprecedented potentials for machine learning applications. Here, we prove that physical quantum circuits are probably approximately correct learnable on a quantum computer via empirical risk minimization: to learn a parametric quantum circuit with at most nc gates and each gate acting on a constant number of qubits, the sample complexity is bounded by $\tilde{O}({n}^{c+1})$. In particular, we explicitly construct a family of variational quantum circuits with O(nc+1) elementary gates arranged in a fixed pattern, which can represent all physical quantum circuits consisting of at most nc elementary gates. Our results provide a valuable guide for quantum machine learning in both theory and practice.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

Over the past few decades, machine learning, especially deep learning, has made dramatic progress [1, 2] in a wide range of tasks, such as playing the game of Go [3, 4], protein structure prediction [5], and computer vision [6], etc. More recently, the interplay between machine learning and quantum physics has attracted tremendous interest [711], giving birth to an emergent research Frontier of quantum machine learning. A number of notable quantum algorithms, such as the Harrow–Hassidim–Lloyd (HHL) algorithm [12], quantum generative models [13], and quantum support vector machine [14], have been designed to enhance, speed up, or innovate machine learning with quantum devices. These algorithms bear the intriguing potentials of exhibiting exponential advantages compared to their classical counterparts, although subtle caveats do exist and require careful examinations in practice [15].

In 1984, Valiant introduced the probably approximately correct (PAC) learning model [16], which gives a complexity-theoretical foundation and a mathematically rigorous framework for studying machine learning. Since then, the PAC learning model has been extensively studied in various machine learning scenarios to understand why and when efficient learning is possible or not [17, 18]. With the rapid progress in quantum computing [1921], practical applications of quantum machine learning have become more and more realistic [2226]. A natural problem is then to generalize the PAC learning model to quantum learning scenarios. Indeed, notable progress has been made along this direction [2737]. For example, in reference [28] Chung and Lin have studied the sample complexity of learning quantum channels and demonstrated that we can PAC-learn a polynomial-size quantum circuit with a polynomial number of samples. In addition, in reference [33] Bu et al investigated the Rademacher complexity of quantum circuits in the framework of quantum resource theories [38]. They introduced a resource measure of magic for quantum channels based on the (p, q) group norm and found useful bounds for how the statistical complexity scales with resources in the quantum circuits. Yet, this fledgling research direction is still in its rapidly growing early phase and many important issues remain barely explored.

In this paper, we study the problem of the sample complexity for learning parametric quantum circuits. We focus on the supervised learning scenarios and prove that all the unitary physical quantum circuits are PAC learnable on a quantum computer via empirical risk minimization. More concretely, we prove the following two theorems: (1) any physical n-qubit quantum circuit consisting of at most nc unitary gates with each gate acting on a constant number of qubits can be represented in an exact fashion by a family of variational quantum circuits with O(nc+1) elementary gates arranged in a fixed uniform pattern; (2) this family of variational quantum circuits is PAC learnable. Since most quantum circuits that can be efficiently implemented on a quantum computer, such as the circuits for the Shor's algorithm [39] or the HHL algorithm [12], contain at most a polynomial number of gates, our results imply that they are all PAC learnable with a quantum computer.

2. Results

2.1. Notations and the general setting

We define the concept class $\mathcal{C}$ as the collection of all the n-qubit parametric quantum circuits with at most nc unitary gates, each gate acting on at most b qubits (b, c are constant numbers independent of n). We note that $\mathcal{C}$ is general enough to include most quantum circuits in practical applications. Here, we study the learnability of the quantum circuits in $\mathcal{C}$ under the PAC learning framework [18]. Let $C\in \mathcal{C}$ be any n-qubit circuit in this concept class. When we input an n-qubit pure state |ψin to C, we will get an output n-qubit pure state |ψout = C|ψin. Therefore, C can be viewed as a function ${f}_{C}:\mathcal{X}\to \mathcal{Y}$, where its domain $\mathcal{X}$ and range $\mathcal{Y}$ are both the set of all n-qubit pure states. In this work, we write $x\in \mathcal{X}$ as an abbreviation of the n-qubit quantum state |ψ(x)⟩, and similarly for $y\in \mathcal{Y}$. With these notations, we sometimes write y = fC (x) to denote |ψ(y)⟩ = C|ψ(x)⟩ for simplicity.

We consider the supervised learning scenario [18] and denote the training set of size m as S = {(x1, y1), (x2, y2), ..., (xm , ym )}. Under the PAC learning framework, to learn the unknown circuit C, assume we have m independent n-qubit input samples $\left\{{x}_{1},{x}_{2},\dots ,{x}_{m}\right\}\in {\mathcal{X}}^{m}$, we can input them into C, and obtain output states $\left\{{y}_{1},{y}_{2},\dots ,{y}_{m}\right\}\in {\mathcal{Y}}^{m}$. The essential task of supervised learning is then to learn from S a hypothesis function (here a quantum circuit F) that can approximate the target function fC (x). This might be accomplished by minimizing certain loss functions over a set of variational model parameters. More concretely, we construct a variational quantum circuit F consisting of multiple gates with some of them having tunable parameters. By tuning these parameters, we can use F to represent different functions ${f}_{h}:\mathcal{X}\to \mathcal{Y}$, and we define our hypothesis space $\mathcal{F}$ as the collection of all the functions fh that F can represent. Given m independent samples and a tunable quantum circuit F, we can use the following process to make our circuit F become a good approximation of C. We can tune the parameters of F according to the training set S, so that when we put the state xi (i = 1, 2, ..., m) into the input of F, the output of F will be a good approximation of yi . By the PAC learning theory, assuming that $\mathcal{F}$ has good generalization power, fh 's decent performance on the training set can imply its good performance over the whole sample space.

The effectiveness of the above process is based on two assumptions. First, the space of $\mathcal{F}$ should be large enough, so that given any set of samples $S={({x}_{i},{y}_{i})}_{i=1,2,\dots ,m}$, we can always find a function ${f}_{h}\in \mathcal{F}$, such that fh (xi ) can approximate yi with a small error for all i = 1, 2, ..., m. Second, the space of $\mathcal{F}$ should not be too large or complex, so that $\mathcal{F}$ has favorable generalization power to generalize its performance from the training set S to the true probability distribution that S is sampled from. This is a reflection of the Occam's razor principle [18]. Therefore, we need to design a variational quantum circuit class $\mathcal{F}$, which meets the following two requirements simultaneously in order to learn fC :

  • R1: For any $C\in \mathcal{C}$, there exists a hypothesis function ${f}_{h}\in \mathcal{F}$, such that fh (x) = fC (x) for all $x\in \mathcal{X}$.
  • R2: The hypothesis space $\mathcal{F}$ satisfies the PAC learnability.

We note that the first requirement R1 is stronger than the first assumption, because the function fh in R1 is the same as fC . Hence, the training error fh on the set of samples S is necessarily zero, whereas the first assumption only requires that fh has a small training error for S. In supervised learning, obtaining high-quality training samples is usually resource-demanding in practice. Thus, studying the sample complexity becomes crucial. In the following, we will study the sample complexity of learning parametric quantum circuits and rigorously prove that any physical quantum circuit is PAC learnable.

2.2. A family of universal variational circuits

To meet R1, we should ensure that F has representation power for all the n-qubit parametric quantum circuits C in $\mathcal{C}$. We observe that any quantum circuit C can be decomposed as a sequence of O(nc ) number of H gates, Rx (θ) gates, and CNOT gates (see the proposition 1 in appendix A). Thus in our construction of F, we also use these three kinds of gates and arrange them into a uniform pattern, so that its scaling to quantum circuits with more qubits is clear. The construction of F is illustrated in figure 1, which is based on block assembling, i.e., assembling some relatively small gadgets to form a more complicated block. By convention, we denote the three Pauli matrices by X, Y, and Z. A well-known result about quantum circuit states that any single-qubit unitary gate can be expressed as eiα Rz (β)Rx (γ)Rz (δ), where α, β, γ, and δ are four real numbers, and Rz (θ) = e−iθZ/2 and Rx (θ) = e−iθX/2 are the rotation operators along the z-axis and x-axis on the Bloch sphere, respectively [40]. Inspired by this, we define a set of basic gadgets (we call them level-1 blocks) to be L = {HRx (β)HRx (γ)HRx (δ)H|β, γ, δ ∈ [0, 2π)}. In this way, we can tune the parameters of a level-1 block so that it can represent all the single-qubit unitary gates up to an irrelevant global phase factor eiα . In addition, when we set β = γ = δ = 0, the level-1 block will reduce to the identity gate.

Figure 1.

Figure 1. Pictorial illustration for the construction of the hypothesis quantum circuits. (a) The elementary (level-1) block L. This block contains four Hadamard gates and three single-qubit rotations along the x direction with rotation angles parameterized by δ, γ, and β, respectively. (b) The level-2 block Bi constructed based on four level-1 block and two CNOT gates with the first qubit being the control qubit and the ith qubit being the target one. (c) The constructed hypothesis quantum circuit, which consists of Mnc repeated layers of Bn Bn−1...B2.

Standard image High-resolution image

Using level-1 blocks we can construct level-2 blocks Bi  (i = 2, 3, ..., n). First we put two level-1 blocks (denoted as L) at qubit 1 and qubit i, and then insert two CNOT1i gates, as shown in figure 1(b). Here, we use CNOT1i to denote the controlled-NOT gate between the first and ith qubits, with the first qubit being the control qubit and the ith one being the target one. With this, the desired hypothesis quantum circuit F can be constructed as:

Equation (1)

where M is a constant independent of the number of qubits n. We mention that the hypothesis circuit F has a uniform structure for arbitrary system sizes. In addition, only the x-rotations contain variational parameters, and it is straightforward to obtain that the total number of parameters used for defining F scales as O(nc+1).

For an arbitrary quantum circuit U, we say that it can be represented by the variational circuit F if there exists a solution to the parameters (denoted collectively as θ ) in F such that F = U up to an irrelevant global phase. Now, we are ready to give the first theorem stating that for any $C\in \mathcal{C}$, we can use F to represent it:

Theorem 1. For any $C\in \mathcal{C}$, there exists a hypothesis function ${f}_{h}\in \mathcal{F}$, such that fh (x) = fC (x) for all $x\in \mathcal{X}$.

Proof. We first give a high-level intuition for the proof. We note that any gate acting on a constant number of qubits can be decomposed into a quantum circuit with a constant number of elementary gates, namely the CNOT gates, H gates, and Rx (θ) gates. Thus, any $C\in \mathcal{C}$ can be decomposed into a quantum circuit with O(nc ) elementary gates. In addition, as our hypothesis quantum circuit F consists of Mnc layers and every two layers can represent one arbitrary elementary gate acting on any pair of qubits through tuning the parameters properly, we can prove that C can be represented by F in an exact fashion.

The complete proof is as follows. First, we note that C consists of O(nc ) elementary gates by the definition of $\mathcal{C}$. By proposition 1 in appendix A, the circuit C can be written as a product form Ul Ul−1...U1, where l = O(nc ), and each Ui is either CNOT1j or a single-qubit unitary gate Vj on qubit j.

Denoting Bn Bn−1...B2 as one layer of level-2 blocks, then we define a block string of d layers as follows:

We define T as the minimal number so that a block string of T layers can represent Ul Ul−1...U1. As our hypothesis circuit F contains Mnc layers in total, we will show that to represent Ul Ul−1...U1, the minimal number of layers needed in the block string is no greater than Mnc . Then, when the previous T layers in F have represented C exactly, by the first part of proposition 2 in appendix A, we can set the remaining (Mnc T) layers to be the identity gate on the n qubits. Therefore, we can show that F can represent C in an exact fashion and complete the proof of theorem 1.

To show that TMnc , we will prove that for each Ui , i = 1, 2, ..., l, we only need two layers to represent Ui exactly. Then putting them together, we can prove that we need only 2l layers to represent Ul Ul−1...U1, which yields T ⩽ 2l.

We fix any i ∈ {1, 2, ..., l}. By proposition 2, one layer Bn Bn−1...B2 can represent any single-qubit unitary gate Vj acting on any qubit j = 1, 2, ..., n up to a global phase eiα . Moreover, two layers ${({B}_{n}{B}_{n-1}\dots {B}_{2})}^{2}$ can represent CNOT1j for any j = 2, 3, ..., n. We note that Ui is either a single-qubit gate Vj acting on some qubit j, or a two-qubit gate CNOT1j . Therefore, we only need two layers at most to represent Ui exactly.

As we have shown that T ⩽ 2l = O(nc ), by choosing a large enough constant M, we can prove that TMnc and complete the proof of the theorem.□

Theorem 1 shows that given any quantum circuit $C\in \mathcal{C}$, there always exists a solution to the parameters, such that the circuit F can simulate the quantum circuit C acting on the n-qubits with zero error. Therefore, given any training set $S={({x}_{i},{y}_{i})}_{i=1,2,\dots ,m}$ sampled independently from some distribution P over $\mathcal{X}\times \mathcal{Y}$, if we have yi = fC (xi ) for all i = 1, 2, ..., m, we can find an instance ${f}_{h}\in \mathcal{F}$ with zero training error. In fact, this theorem has a wider range of applications. When a quantum circuit consists of fewer than nc gates, we can add some identity gates after it, so our theorem covers all the quantum circuits containing no more than nc gates. In other words, we can use only O(nc+1) gates, which are arranged in a uniform pattern, to represent all the circuits with nc gates or fewer. We remark that the number of gates in many famous quantum circuits, such as quantum support vector machine [14], HHL algorithm [12], and quantum Fourier transform [41], scales polynomially with the number of qubits. Therefore, all of these circuits above can be represented exactly by our circuit F.

2.3. PAC learnability of $\mathcal{F}$

In the PAC setting, we usually assume that the input samples are randomly generated from certain unknown probability distribution. As a result, when the hypothesis space covers the underlying distribution and the training dataset is large enough, both the training and generalization error should be small. In this paper, our hypothesis space $\mathcal{F}$ has been proved in theorem 1 to be able to cover all the parametric quantum circuits in $\mathcal{C}$. Now, we study the sample complexity for training a circuit $F\in \mathcal{F}$ to represent $C\in \mathcal{C}$. To this end, we define a measure of the distance between two pure states in $\mathcal{Y}$, which is used as the loss function $\mathcal{L}:\mathcal{Y}\times \mathcal{Y}\to [0,1]$. Specifically, we define the loss function $\mathcal{L}({y}_{1},{y}_{2})$ to be the trace distance of two quantum states |ψ(y1)⟩ and |ψ(y2)⟩:

where ||⋅||1 denotes the trace norm of a matrix. Given a hypothesis function ${f}_{h}\in \mathcal{F}$ and the training set $S={({x}_{i},{y}_{i})}_{i=1,2,\dots ,m}$ sampled independently from some distribution P, we can define the empirical risk of fh , which is also known as in-sample error:

The risk of a hypothesis function ${f}_{h}\in \mathcal{F}$ is then defined as the average loss of fh over the probability distribution P:

Our goal is to find a hypothesis ${f}_{h}\in \mathcal{F}$ and minimize its risk R(fh ). As the parametric quantum circuit C is a black box in our setting, we do not know the probability distribution P. In the learning process, we use the training set S and find an empirical risk minimizer $\hat{h}\in \mathcal{F}$ over S. For convenience, given the training set S and the probability distribution P, we define $\hat{h}=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace \hat{R}({h}^{\prime })$ to be the empirical risk minimizer, and $h=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace R({h}^{\prime })$ to be the risk minimizer. Now we formally introduce the definition of the PAC learnability for completeness [42]:

Definition 1 (PAC learnability). A hypothesis space $\mathcal{F}$ is PAC learnable, if there exists a function $\nu :{(0,1)}^{2}\to \mathbb{N}$, such that for all (epsilon, δ) ∈ (0, 1)2 and all probability measures P over $\mathcal{X}\times \mathcal{Y}$, when the size of training set |S| ⩾ ν(epsilon, δ), we have

Equation (2)

Here ${\mathbb{P}}_{S}(A)$ denotes the probability that event A happens over repeated sampling of the training set S.

We note that when we randomly select a state $x\in \mathcal{X}$, input it into the circuit C and get an output state $y={f}_{C}(x)\in \mathcal{Y}$, the resulting probability distribution P of state pairs (x, y) will satisfy R(h) = 0, because we can find an instance ${f}_{h}\in \mathcal{F}$ equal to fC by theorem 1. Therefore, if our $\mathcal{F}$ is PAC learnable, after we prepare the training samples S and get an empirical minimizer $\hat{h}$, with probability 1 − δ, the average loss of $\hat{h}$ over P will be no larger than epsilon, i.e., $R(\hat{h})\leqslant {\epsilon}$. Now we are going to prove that $\mathcal{F}$ is PAC learnable, and the sample complexity ν is polynomial with n, 1/epsilon, and $\mathrm{ln}\enspace \frac{1}{\delta }$.

Theorem 2. The hypothesis space $\mathcal{F}$ satisfies the PAC learnability, with sample complexity $\nu ({\epsilon},\delta )=O(\frac{1}{{{\epsilon}}^{2}}({n}^{c+1}\enspace \mathrm{ln}\enspace \frac{n}{{\epsilon}}+\mathrm{ln}\enspace \frac{1}{\delta }))$.

Proof. The essential idea for the proof relies on the discretization of $\mathcal{F}$. First, we construct a finite set of hypothesis functions ${\mathcal{F}}^{\prime }$, such that for each function ${f}_{h}\in \mathcal{F}$, we can find a function ${f}_{h}^{\prime }\in {\mathcal{F}}^{\prime }$ close enough to fh . Then we use lemma 2 in appendix B to show that ${\mathcal{F}}^{\prime }$ is PAC learnable. Finally, for any ${f}_{h}\in \mathcal{F}$ and its corresponding ${f}_{h}^{\prime }$, as fh and ${f}_{h}^{\prime }$ are close enough, we can prove that their risk and empirical risk are close as well. Therefore, we obtain that $\mathcal{F}$ is PAC learnable.

For clarity, we denote l as the total number of Rx (θ) gates in the circuit F, and observe that l ⩽ 12Mnc+1. We recall that $\mathcal{F}$ is defined as the collection of all the functions ${f}_{h}:\mathcal{X}\to \mathcal{Y}$ that the circuit F can represent by tuning the value of the parameters θ = (θ1, θ2, ..., θl ), where θi ∈ [0, 2π) is the variational parameter characterizing the ith x-rotation. Now we define a finite set ${\mathcal{F}}^{\prime }\subseteq \mathcal{F}$ in this way: ${\mathcal{F}}^{\prime }$ is the collection of all the functions ${f}_{h}^{\prime }:\mathcal{X}\to \mathcal{Y}$ that circuit F can represent by tuning the value of all the θi in {0, e, 2e, ..., Ne}, where $e=\frac{{\epsilon}}{6K{n}^{c+1}}$, $N=\lfloor \frac{2\pi }{e}\rfloor $, and K is a large enough constant. As there are l = O(nc+1) rotational gates in circuit F in total, we have

which is finite. As a result, we can plug ${\mathcal{F}}^{\prime }$ and epsilon' = epsilon/6 into lemma 2 to obtain that when $\vert S\vert \geqslant \frac{18}{{{\epsilon}}^{2}}(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta })$,

Equation (3)

We fix all the parameters θ in circuit F, and we will get an arbitrary hypothesis function ${f}_{h}\in \mathcal{F}$. Then we can round all the parameter θ of circuit F into their nearest multiples of e in {0, e, 2e, ..., Ne}, and we will get a new hypothesis function $\tilde{{f}_{h}}\in {\mathcal{F}}^{\prime }$. By proposition 5, we obtain that for any ${f}_{h}\in \mathcal{F}$,

Equation (4)

Equation (5)

Combining the three inequalities (3)–(5) together, we arrive at

Equation (6)

when $\vert S\vert \geqslant \frac{18}{{{\epsilon}}^{2}}(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta })$. To prove that $\mathcal{F}$ is PAC learnable, we recall our notations that $h=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace R({h}^{\prime })$, and $\hat{h}=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace \hat{R}({h}^{\prime })$. Combining the inequality (6) and proposition 6, we obtain that

when $\vert S\vert \geqslant \frac{18}{{{\epsilon}}^{2}}(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta })$. Plugging in $\vert {\mathcal{F}}^{\prime }\vert ={\lceil \frac{12\pi K{n}^{c+1}}{{\epsilon}}\rceil }^{O({n}^{c+1})}$, we can prove that the hypothesis space $\mathcal{F}$ is PAC learnable with sample complexity

This completes the proof of theorem 2.□

We denote $\mathcal{E}$ as the collection of all the functions ${f}_{C}:\mathcal{X}\to \mathcal{Y}$, where $C\in \mathcal{C}$. In fact, we can prove that $\mathcal{E}$ is PAC learnable as well. By theorem 1, our hypothesis space $\mathcal{F}$ can cover all the quantum circuits in $\mathcal{C}$. Thus we can obtain that $\mathcal{E}\subseteq \mathcal{F}$. Using the inequality (6), we will arrive at

Equation (7)

when $\vert S\vert \geqslant \frac{18}{{{\epsilon}}^{2}}(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta })$. Similarly with the method in theorem 2, we combine the inequality (7) with proposition 6, and we can prove that $\mathcal{E}$ is PAC learnable with sample complexity $\nu ({\epsilon},\delta )=O(\frac{1}{{{\epsilon}}^{2}}({n}^{c+1}\enspace \mathrm{ln}\enspace \frac{n}{{\epsilon}}+\mathrm{ln}\enspace \frac{1}{\delta }))$ as well.

We stress the differences between our results and the previous works [28, 33] in the literature. First, in reference [28] Chung and Lin focused on a finite set of discretized quantum channels, and their algorithm is based on random orthogonal measurements. Whereas, in our settings we focus on a set of unitary quantum circuits with continuous variational parameters, thus the size of our concept class is infinite. Moreover, our proof is based on a family of variational quantum neural networks with an explicit uniform structure, which would be useful in practical applications. Second, in reference [33] Bu et al considered a more general class of quantum channels and their bounds of the sample complexity grow exponentially with the number of qubits n. In contrast, our focus here is variational quantum circuits and the sample complexity we obtained scales only polynomially with the system size. In other words, while reference [33]'s setting is more general, the sample complexity bounds obtained in this work is exponentially tighter. Our work and references [28, 33] are complementary to each other.

It is also worthwhile to clarify that, although we have proved that the sample complexity for learning any physical quantum circuit is low (namely, it only scales polynomially with the number of qubits involved), this does not mean that these circuits can be learned efficiently since the time complexity to learn an unknown circuit can still be exponentially high. In fact, it has been proved recently that training a variational quantum circuit, even for logarithmically many qubits and free fermionic systems, is NP-hard [43]. This implies that although we know for sure that our hypothesis $\mathcal{F}$ can cover all physical quantum circuits and only a polynomial number of samples are needed to train a variational circuit $F\in \mathcal{F}$, how to efficiently solve the optimization problem of minimizing the empirical risk remains unclear and might be an exponentially hard problem in practice.

3. Discussion

We mention that the family of hypothesis quantum variational circuits constructed in this paper is of independent interest due to its use of only O(nc+1) variational parameters while maintaining notable representation power. These circuits might be used as variational ansatz for implementing quantum classifiers [23, 25, 4448], variational quantum eigensolvers [4952], or quantum generative adversarial networks [5355], etc. On the other hand, we also remark that similarly to many other variational quantum circuits constructed in the literature, this family of variational circuits may suffer from the barren plateau (i.e., vanishing gradient) problem [56, 57] as well. In addition, our work can be appealing as the family of circuits is constructed without optimizing the structure and the number of parameters. In the future, it would also be interesting to explore other alternative structures with smaller depths and fewer parameters. Another interesting problem worth further investigation is to consider a scenario where we do not have perfect knowledge about the training data, namely that the training dataset may not be fully labeled. How to extend our results to this scenario remains unknown.

We note that in our proof, the use of PAC learning theory is in fact independent from the learning model, i.e., it can deal with both the classical and quantum objectives. In our settings, the objects to be learned are parametric quantum circuits, but we can still use standard classical techniques of PAC learning theory (like discretization) to obtain the sample complexity bound.

In summary, we have proved that unitary physical quantum circuits are PAC learnable on a quantum computer via empirical risk minimization. In particular, we proved that to learn a unitary quantum circuit with at most nc local gates, the sample complexity is bounded by $\tilde{O}({n}^{c+1})$. Our results are generally applicable to all unitary quantum circuits of practical interest. There are many notable quantum circuits (algorithms or kernels, such as Shor's factorization algorithm [39], the HHL algorithm [12], quantum support vector machine [14], quantum classification based on discrete logarithm [58], etc) that hold the intriguing potential of exponential quantum speedup. Our results imply that a polynomial number of samples are enough to learn these quantum circuits. In reference [59], Bang et al proposed a method for learning quantum algorithms assisted by machine learning, which shows learning speedup in designing quantum circuits for solving the Deutsch–Jozsa problem, and our results imply that the quantum circuits they used are PAC learnable as well.

Acknowledgments

We thank Wenjie Jiang, Peixin Shen, and Xun Gao in particular for their helpful discussions. This work is supported by the start-up fund from Tsinghua University (Grant No. 53330300320), the National Natural Science Foundation of China (Grant No. 12075128), and the Shanghai Qi Zhi Institute.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Appendix A.: The universality of $\mathcal{F}$

In this paper, all the constants such as b, c, K, and M are independent of n, epsilon, and δ. Also, we recall that $\mathcal{C}$ is the set of all the n-qubit quantum circuits with at most nc unitary gates, with each gate acting on at most b qubits.

In proving theorem 1 in the main text, we used three lemmas, which are appended in the following. The lemma 1 is proved in reference [40], which we recap here for completeness. The propositions 1 and 2 are proved in this paper.

Lemma 1 ([40], section 4.5.2). An arbitrary unitary operation on b qubits can be implemented using a circuit containing at most c0 b24b single-qubit unitary gates and CNOT gates, where c0 is a constant.

Proposition 1. For any $C\in \mathcal{C}$, there exist l = O(nc ) unitary gates U1, U2, ..., Ul , such that C = Ul Ul−1...U1, and each gate Ui is either a single-qubit unitary gate Vj acting on qubit j, or CNOT1j gate with the first qubit being the control qubit and the jth qubit being the target one.

Proof. We first prove that C can be decomposed as O(nc ) elementary gates, including CNOT gates and single-qubit unitary gates. By lemma 1, C can be implemented by at most c0 b24b nc = O(nc ) unitary gates, and each gate is either a single-qubit unitary gate or CNOTij gate with the control qubit i and the target qubit j.

To prove that C can be decomposed as the product of l = O(nc ) single-qubit unitary gates and CNOT1j gates, we need only prove that when i ≠ 1 and ij, CNOTij can be decomposed as CNOT1i , CNOT1j , and H gates.

When i > 1, j = 1, we can write CNOTi1 in this way:

Meanwhile, when i, j > 1 and ij, we can decompose CNOTij into CNOT1j and CNOTi1 in this way:

and we have shown that CNOTi1 can be decomposed as CNOT1i and H gates.

As each decomposition uses only O(1) gates, we can obtain that C can be decomposed as the product of l = O(nc ) single-qubit unitary gates and CNOT1j gates, and the proof is completed.□

In a level-2 block Bj , there are two level-1 blocks on qubit 1 and qubit j, respectively. Each level-1 block has three parameters β, γ, δ, and by ZX decomposition [40], we can tune these three parameters to enable a level-1 block Lj on qubit j to represent any single-qubit unitary gate acting on qubit j up to an irrelevant global phase. Also, by setting the three parameters to zero, a level-1 block Lj can also represent the identity gate. We will prove that by tuning the parameters of the level-2 blocks, one layer consisting of Bn Bn−1...B2 can represent any single-qubit unitary gate acting on any qubit j, and CNOT1j can be represented by two layers.

Proposition 2. (1) One layer Bn Bn−1...B2 can represent any single-qubit unitary gate Vj acting on any qubit j up to an irrelevant global phase.

(2) Two layers ${({B}_{n}{B}_{n-1}\dots {B}_{2})}^{2}$ can represent CNOT1j up to an irrelevant global phase.

Proof. To prove this lemma, we will set most of the level-2 blocks in the layers to be the identity gates and use at most two blocks to represent the gates we need.

We prove part one first. We separate the claim into two cases, j = 1 and j ≠ 1. When j = 1, we can let Bn Bn−1...B3 represent the identity gate by tuning all their parameters to zero. For clarity, we denote Lj as a level-1 block acting on the jth qubit. Given any unitary gate V1 on qubit 1, a level-1 block L1 can represent V1, and both level-1 blocks L1 and L2 can represent the identity gate. As a level-2 block B2 consists of four level-1 blocks and two CNOT gates, we can tune the parameters of the four level-1 blocks in the following way so that B2 can represent V1:

Similarly, when j ≠ 1, we can let Bn Bn−1...Bj+1 and Bj−1...B3 B2 represent the identity gate. Then we need only let Bj represent unitary gate Vj . Given any unitary gate Vj on qubit j, we can tune the parameters of the four level-1 blocks in the following way so that Bj can represent Vj :

Therefore, the proof of part one is completed. Now we will prove part two. We set all the parameters in the two layers ${({B}_{n}{B}_{n-1}\dots {B}_{2})}^{2}$ to be zero except the two Bj blocks. Then we will use two Bj blocks to represent CNOT1j . We decompose CNOT1j up to an irrelevant global phase factor e−iπ/4 in the following way:

where we set ${W}_{1}={R}_{z}\left(\frac{\pi }{2}\right),{W}_{2}={R}_{y}\left(\frac{\pi }{2}\right),{W}_{3}={R}_{z}\left(-\frac{\pi }{2}\right){R}_{y}\left(-\frac{\pi }{2}\right)$, and ${W}_{4}={R}_{z}\left(-\frac{\pi }{2}\right)$. Here we denote Rz (θ) = e−iθZ/2 and Ry (θ) = e−iθY/2 as the rotation operators along the z-axis and y-axis on the Bloch sphere, respectively. In addition, W4 and the identity gate I1 act on the first qubit, and W1, W2, and W3 act on the jth qubit.

Hence, we use the Bj block in the first layer to represent ${\text{CNOT}}_{1j}\left({I}_{1}\otimes {W}_{2}\right){\text{CNOT}}_{1j}\left({I}_{1}\otimes {W}_{1}\right)$ in this way:

Finally, we use the second level-2 block Bj to represent W4W3, where W4 acts on the first qubit and W3 acts on the jth qubit. Therefore, two layers ${({B}_{n}{B}_{n-1}\dots {B}_{2})}^{2}$ can represent CNOT1j up to an irrelevant global phase, and this completes the proof of part two.□

Appendix B.: PAC learnability of $\mathcal{F}$

The following lemma shows that any finite hypothesis space ${\mathcal{F}}^{\prime }$ is PAC learnable.

Lemma 2 ([42], corollary 1.2). Assume that the hypothesis space ${\mathcal{F}}^{\prime }$ is finite, δ ∈ (0, 1], epsilon > 0 and the range of the loss function is in an interval of length c ⩾ 0. Then if the size of the training set $\vert S\vert \geqslant \frac{{c}^{2}}{2{{\epsilon}}^{2}}\left(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta }\right)$, the event $\forall \enspace {f}_{h}\in {\mathcal{F}}^{\prime }:\vert \hat{R}({f}_{h})-R({f}_{h})\vert \leqslant {\epsilon}$ holds with probability at least 1 − δ over repeated sampling of the training set S.

Our circuit F consists of Rx (θ), H, and CNOT gates. By assigning two sets of different values to the variational parameters θ , we can get two distinct circuits F1 and F2, and their corresponding hypothesis functions f1 and f2 are different. We note that although F1 and F2 differ in the value of their variational parameters θ , their ordering of the gates (Rx (θ), H, and CNOT gates) are the same. We will show that when all the variational parameters in circuit F1 and F2 are close enough, the risk and empirical risk of f1 and f2 will be close. To prove this, we first define the distance of two unitary matrices U1, ${U}_{2}\in {\mathbb{C}}^{{2}^{n}\times {2}^{n}}$ as the 2-norm of the matrix U1U2:

Now we introduce the following lemma about the function E(U1, U2).

Proposition 3. The function E(U1, U2) satisfies the following properties:

  • (a)  
    Let U, V be the Rx (θ), Rx (θ + epsilon) gates acting on the jth qubit, respectively, where epsilon ∈ (0, 1), θ ∈ [0, 2π), j = 1, 2, ..., n. Then E(U, V) ⩽ epsilon.
  • (b)  
    $E\left({U}_{l}{U}_{l-1}\dots {U}_{1},{V}_{l}{V}_{l-1}\dots {V}_{1}\right)\leqslant \sum _{j=1}^{l}E\left({U}_{j},{V}_{j}\right)$, where U1, U2, ..., Ul , V1, V2, ..., Vl are unitary matrices.

Proof. The second property is shown in [40], section 4.5.3. We need only prove the first property.

where (i) uses Taylor's expansion of the operator Rx (epsilon) = e−iepsilonX/2, and (ii) uses Taylor's expansion of exp(epsilon/2) and that ||X||2 = ||I||2 = 1.

We recall that $\mathcal{L}({y}_{1},{y}_{2})$ is the trace distance of two pure states |ψ(y1)⟩ and |ψ(y2)⟩. Then we introduce the following properties of $\mathcal{L}({y}_{1},{y}_{2})$.

Proposition 4. The function $\mathcal{L}:\mathcal{Y}\times \mathcal{Y}\to [0,1]$ satisfies the following two properties:

  • (a)  
    For any ${y}_{1},{y}_{2},{y}_{3}\in \mathcal{Y}$, we have $\mathcal{L}({y}_{1},{y}_{3})-\mathcal{L}({y}_{2},{y}_{3})\leqslant \mathcal{L}({y}_{1},{y}_{2})$.
  • (b)  
    For any ${y}_{1},{y}_{2}\in \mathcal{Y}$, we have $\mathcal{L}({y}_{1},{y}_{2})\leqslant {\Vert}\vert \psi ({y}_{1})\rangle -\vert \psi ({y}_{2})\rangle {{\Vert}}_{2}$.

Proof. The first part of this lemma is the triangle inequality, which is proved in [40], section 9.2.1. Here, we only prove the second property. We denote $F(\vert \psi ({y}_{1})\rangle ,\vert \psi ({y}_{2})\rangle )=\left\vert \langle \psi ({y}_{1})\vert \psi ({y}_{2})\rangle \right\vert $ as the fidelity between the two states |ψ(y1)⟩ and |ψ(y2)⟩. Then we will arrive at

where the proof of equation (iii) is given in [40], section 9.2.3.

In addition, we note that for any complex number $z\in \mathbb{C}$ and its complex conjugate ${z}^{\ast }\in \mathbb{C}$, as (|z| − 1)2 ⩾ 0, we have 2 − 2|z| ⩾ 1 − |z|2. Hence, we get 2 − zz* ⩾ 2 − 2|z| ⩾ 1 − |z|2. Let z = ⟨ψ(y1)|ψ(y2)⟩, we obtain that

This completes the proof.□

Now we will use the properties of E(U1, U2) and $\mathcal{L}({y}_{1},{y}_{2})$ to show that the differences of both the risk and empirical risk between f1 and f2 are bounded by epsilon, where the hypothesis functions f1 and f2 correspond to the variational circuits F1 and F2, respectively.

Proposition 5. We denote ${\boldsymbol{\theta }}^{{F}_{1}}=({\theta }_{1}^{{F}_{1}},{\theta }_{2}^{{F}_{1}},\dots ,{\theta }_{l}^{{F}_{1}})$ as a vector containing all the variational parameters in F1, where l is the number of Rx (θ) gates in circuit F, and ${\theta }_{i}^{{F}_{1}}$ is the value of the variational parameter characterizing the ith x-rotation of F1. Similarly, we denote ${\boldsymbol{\theta }}^{{F}_{2}}=({\theta }_{1}^{{F}_{2}},{\theta }_{2}^{{F}_{2}},\dots ,{\theta }_{l}^{{F}_{2}})$ as a vector containing all the variational parameters in F2.

Let ${f}_{1},{f}_{2}\in \mathcal{F}$ be the corresponding hypothesis functions of F1, F2, respectively. Then given any probability distribution P over $\mathcal{X}\times \mathcal{Y}$ and training set $S={({x}_{i},{y}_{i})}_{i=1,2,\dots ,m}$, the following two inequalities hold if ${\Vert}{\boldsymbol{\theta }}^{{F}_{1}}-{\boldsymbol{\theta }}^{{F}_{2}}{{\Vert}}_{\infty }\leqslant \frac{{\epsilon}}{K{n}^{c+1}}$ (K is a large enough constant):

Proof. First, we will prove that E(F1, F2) ⩽ epsilon when ${\Vert}{\boldsymbol{\theta }}^{{F}_{1}}-{\boldsymbol{\theta }}^{{F}_{2}}{{\Vert}}_{\infty }\leqslant \frac{{\epsilon}}{K{n}^{c+1}}$. Then we will use it to show the risk and empirical risk of f1 and f2 are close.

As F is composed of H gates, Rx (θ) gates and CNOT gates, we can write F1 = Ul Ul−1...U1 and F2 = Vl Vl−1...V1, where Ui is the ith gate in F1, and Vi is the ith gate in F2. As Ui and Vi are of the same type of gates, we can prove that $E({U}_{i},{V}_{i})\leqslant \frac{{\epsilon}}{K{n}^{c+1}}$ by separating different cases on the types of Ui and Vi :

Case I: If Ui and Vi are both H gates or both CNOT gates, as there is no variational parameter in H or CNOT, we have Ui = Vi , and we obtain that E(Ui , Vi ) = 0.

Case II: If Ui and Vi are both Rx (θ) gates, as the difference of ${\theta }_{i}^{{F}_{1}}$ and ${\theta }_{i}^{{F}_{2}}$ is at most $\frac{{\epsilon}}{K{n}^{c+1}}$, by the first property of proposition 3, we have $E({U}_{i},{V}_{i})\leqslant \frac{{\epsilon}}{K{n}^{c+1}}$.

We note that l = O(nc+1) by our construction of F. By the second property of proposition 3 and choosing a large enough constant K such that lKnc+1, we can get

Now we can bound the differences of the risk and empirical risk between the two hypothesis functions f1 and f2, respectively. For convenience, we define D(f1, f2) as $\underset{x\in \mathcal{X},y\in \mathcal{Y}}{\mathrm{sup}}\left\vert \mathcal{L}(y,{f}_{1}(x))-\mathcal{L}(y,{f}_{2}(x))\right\vert $. We observe that both |R(f1) − R(f2)| and $\vert \hat{R}({f}_{1})-\hat{R}({f}_{2})\vert $ can be bounded by D(f1, f2). Hence, we will prove that D(f1, f2) ⩽ epsilon, and we can obtain the two inequalities |R(f1) − R(f2)| ⩽ epsilon and $\vert \hat{R}({f}_{1})-\hat{R}({f}_{2})\vert \leqslant {\epsilon}$.

where (iv) uses the first property of function $\mathcal{L}$ in proposition 4, and (v) uses the second property of function $\mathcal{L}$ in proposition 4. This completes the proof of proposition 5.□

We note that in our proof of theorem 2, we used lemma 2 and proposition 5 to show that $\forall \enspace {f}_{h}\in \mathcal{F}:\left\vert \hat{R}\left({f}_{h}\right)-R\left({f}_{h}\right)\right\vert \leqslant \frac{{\epsilon}}{2}$ holds with probability 1 − δ. To prove that $\mathcal{F}$ is PAC learnable, we introduce the following technical lemma.

Proposition 6. Assume $\forall \enspace {f}_{h}\in \mathcal{F}:\left\vert \hat{R}\left({f}_{h}\right)-R\left({f}_{h}\right)\right\vert \leqslant \frac{{\epsilon}}{2}$ holds. We denote $h=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace R({h}^{\prime })$, and $\hat{h}=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace \hat{R}({h}^{\prime })$. Then we have

Proof. The proof of this inequality is given in [42], section 1.2. We give the proof of the lemma here for completeness. To bound $R(\hat{h})-R(h)$, we observe that it can be expressed as the sum of $R(\hat{h})-\hat{R}(\hat{h})$ and $\hat{R}(\hat{h})-R(h)$. Then we can use $\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{sup}}\left\vert \hat{R}({h}^{\prime })-R({h}^{\prime })\right\vert $ to bound $R(\hat{h})-\hat{R}(\hat{h})$ and $\hat{R}(\hat{h})-R(h)$, respectively.

where (vi) uses that $h=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace R({h}^{\prime })$, and (vii) uses that $\hat{h}=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace \hat{R}({h}^{\prime })$.□

Please wait… references are loading.