Sample complexity of learning parametric quantum circuits

Haoyuan Cai; Qi Ye; Dong-Ling Deng

doi:10.1088/2058-9565/ac4f30

1. Introduction

Over the past few decades, machine learning, especially deep learning, has made dramatic progress [1, 2] in a wide range of tasks, such as playing the game of Go [3, 4], protein structure prediction [5], and computer vision [6], etc. More recently, the interplay between machine learning and quantum physics has attracted tremendous interest [7–11], giving birth to an emergent research Frontier of quantum machine learning. A number of notable quantum algorithms, such as the Harrow–Hassidim–Lloyd (HHL) algorithm [12], quantum generative models [13], and quantum support vector machine [14], have been designed to enhance, speed up, or innovate machine learning with quantum devices. These algorithms bear the intriguing potentials of exhibiting exponential advantages compared to their classical counterparts, although subtle caveats do exist and require careful examinations in practice [15].

In 1984, Valiant introduced the probably approximately correct (PAC) learning model [16], which gives a complexity-theoretical foundation and a mathematically rigorous framework for studying machine learning. Since then, the PAC learning model has been extensively studied in various machine learning scenarios to understand why and when efficient learning is possible or not [17, 18]. With the rapid progress in quantum computing [19–21], practical applications of quantum machine learning have become more and more realistic [22–26]. A natural problem is then to generalize the PAC learning model to quantum learning scenarios. Indeed, notable progress has been made along this direction [27–37]. For example, in reference [28] Chung and Lin have studied the sample complexity of learning quantum channels and demonstrated that we can PAC-learn a polynomial-size quantum circuit with a polynomial number of samples. In addition, in reference [33] Bu et al investigated the Rademacher complexity of quantum circuits in the framework of quantum resource theories [38]. They introduced a resource measure of magic for quantum channels based on the (p, q) group norm and found useful bounds for how the statistical complexity scales with resources in the quantum circuits. Yet, this fledgling research direction is still in its rapidly growing early phase and many important issues remain barely explored.

In this paper, we study the problem of the sample complexity for learning parametric quantum circuits. We focus on the supervised learning scenarios and prove that all the unitary physical quantum circuits are PAC learnable on a quantum computer via empirical risk minimization. More concretely, we prove the following two theorems: (1) any physical n-qubit quantum circuit consisting of at most n^c unitary gates with each gate acting on a constant number of qubits can be represented in an exact fashion by a family of variational quantum circuits with O(n^c+1) elementary gates arranged in a fixed uniform pattern; (2) this family of variational quantum circuits is PAC learnable. Since most quantum circuits that can be efficiently implemented on a quantum computer, such as the circuits for the Shor's algorithm [39] or the HHL algorithm [12], contain at most a polynomial number of gates, our results imply that they are all PAC learnable with a quantum computer.

2. Results

2.1. Notations and the general setting

We define the concept class $\mathcal{C}$ as the collection of all the n-qubit parametric quantum circuits with at most n^c unitary gates, each gate acting on at most b qubits (b, c are constant numbers independent of n). We note that $\mathcal{C}$ is general enough to include most quantum circuits in practical applications. Here, we study the learnability of the quantum circuits in $\mathcal{C}$ under the PAC learning framework [18]. Let $C\in \mathcal{C}$ be any n-qubit circuit in this concept class. When we input an n-qubit pure state |ψ⟩_in to C, we will get an output n-qubit pure state |ψ⟩_out = C|ψ⟩_in. Therefore, C can be viewed as a function ${f}_{C}:\mathcal{X}\to \mathcal{Y}$ , where its domain $\mathcal{X}$ and range $\mathcal{Y}$ are both the set of all n-qubit pure states. In this work, we write $x\in \mathcal{X}$ as an abbreviation of the n-qubit quantum state |ψ(x)⟩, and similarly for $y\in \mathcal{Y}$ . With these notations, we sometimes write y = f_C(x) to denote |ψ(y)⟩ = C|ψ(x)⟩ for simplicity.

We consider the supervised learning scenario [18] and denote the training set of size m as S = {(x₁, y₁), (x₂, y₂), ..., (x_m, y_m)}. Under the PAC learning framework, to learn the unknown circuit C, assume we have m independent n-qubit input samples $\left\{{x}_{1},{x}_{2},\dots ,{x}_{m}\right\}\in {\mathcal{X}}^{m}$ , we can input them into C, and obtain output states $\left\{{y}_{1},{y}_{2},\dots ,{y}_{m}\right\}\in {\mathcal{Y}}^{m}$ . The essential task of supervised learning is then to learn from S a hypothesis function (here a quantum circuit F) that can approximate the target function f_C(x). This might be accomplished by minimizing certain loss functions over a set of variational model parameters. More concretely, we construct a variational quantum circuit F consisting of multiple gates with some of them having tunable parameters. By tuning these parameters, we can use F to represent different functions ${f}_{h}:\mathcal{X}\to \mathcal{Y}$ , and we define our hypothesis space $\mathcal{F}$ as the collection of all the functions f_h that F can represent. Given m independent samples and a tunable quantum circuit F, we can use the following process to make our circuit F become a good approximation of C. We can tune the parameters of F according to the training set S, so that when we put the state x_i (i = 1, 2, ..., m) into the input of F, the output of F will be a good approximation of y_i. By the PAC learning theory, assuming that $\mathcal{F}$ has good generalization power, f_h's decent performance on the training set can imply its good performance over the whole sample space.

The effectiveness of the above process is based on two assumptions. First, the space of $\mathcal{F}$ should be large enough, so that given any set of samples $S={({x}_{i},{y}_{i})}_{i=1,2,\dots ,m}$ , we can always find a function ${f}_{h}\in \mathcal{F}$ , such that f_h(x_i) can approximate y_i with a small error for all i = 1, 2, ..., m. Second, the space of $\mathcal{F}$ should not be too large or complex, so that $\mathcal{F}$ has favorable generalization power to generalize its performance from the training set S to the true probability distribution that S is sampled from. This is a reflection of the Occam's razor principle [18]. Therefore, we need to design a variational quantum circuit class $\mathcal{F}$ , which meets the following two requirements simultaneously in order to learn f_C:

R1: For any $C\in \mathcal{C}$ , there exists a hypothesis function ${f}_{h}\in \mathcal{F}$ , such that f_h(x) = f_C(x) for all $x\in \mathcal{X}$ .
R2: The hypothesis space $\mathcal{F}$ satisfies the PAC learnability.

We note that the first requirement R1 is stronger than the first assumption, because the function f_h in R1 is the same as f_C. Hence, the training error f_h on the set of samples S is necessarily zero, whereas the first assumption only requires that f_h has a small training error for S. In supervised learning, obtaining high-quality training samples is usually resource-demanding in practice. Thus, studying the sample complexity becomes crucial. In the following, we will study the sample complexity of learning parametric quantum circuits and rigorously prove that any physical quantum circuit is PAC learnable.

2.2. A family of universal variational circuits

To meet R1, we should ensure that F has representation power for all the n-qubit parametric quantum circuits C in $\mathcal{C}$ . We observe that any quantum circuit C can be decomposed as a sequence of O(n^c) number of H gates, R_x(θ) gates, and CNOT gates (see the proposition 1 in appendix A). Thus in our construction of F, we also use these three kinds of gates and arrange them into a uniform pattern, so that its scaling to quantum circuits with more qubits is clear. The construction of F is illustrated in figure 1, which is based on block assembling, i.e., assembling some relatively small gadgets to form a more complicated block. By convention, we denote the three Pauli matrices by X, Y, and Z. A well-known result about quantum circuit states that any single-qubit unitary gate can be expressed as e^iα R_z(β)R_x(γ)R_z(δ), where α, β, γ, and δ are four real numbers, and R_z(θ) = e^−iθZ/2 and R_x(θ) = e^−iθX/2 are the rotation operators along the z-axis and x-axis on the Bloch sphere, respectively [40]. Inspired by this, we define a set of basic gadgets (we call them level-1 blocks) to be L = {HR_x(β)HR_x(γ)HR_x(δ)H|β, γ, δ ∈ [0, 2π)}. In this way, we can tune the parameters of a level-1 block so that it can represent all the single-qubit unitary gates up to an irrelevant global phase factor e^iα. In addition, when we set β = γ = δ = 0, the level-1 block will reduce to the identity gate.

**Figure 1.** Pictorial illustration for the construction of the hypothesis quantum circuits. (a) The elementary (level-1) block L. This block contains four Hadamard gates and three single-qubit rotations along the x direction with rotation angles parameterized by δ, γ, and β, respectively. (b) The level-2 block B_i constructed based on four level-1 block and two CNOT gates with the first qubit being the control qubit and the ith qubit being the target one. (c) The constructed hypothesis quantum circuit, which consists of Mn^c repeated layers of B_n B_n−1...B₂.
Download figure:
Standard image High-resolution image

Using level-1 blocks we can construct level-2 blocks B_i (i = 2, 3, ..., n). First we put two level-1 blocks (denoted as L) at qubit 1 and qubit i, and then insert two CNOT_1i gates, as shown in figure 1(b). Here, we use CNOT_1i to denote the controlled-NOT gate between the first and ith qubits, with the first qubit being the control qubit and the ith one being the target one. With this, the desired hypothesis quantum circuit F can be constructed as:

$\begin{equation}F={({B}_{n}{B}_{n-1}\dots {B}_{2})}^{M{n}^{c}},\end{equation} \tag{ 1 }$

where M is a constant independent of the number of qubits n. We mention that the hypothesis circuit F has a uniform structure for arbitrary system sizes. In addition, only the x-rotations contain variational parameters, and it is straightforward to obtain that the total number of parameters used for defining F scales as O(n^c+1).

For an arbitrary quantum circuit U, we say that it can be represented by the variational circuit F if there exists a solution to the parameters (denoted collectively as θ ) in F such that F = U up to an irrelevant global phase. Now, we are ready to give the first theorem stating that for any $C\in \mathcal{C}$ , we can use F to represent it:

Theorem 1. For any $C\in \mathcal{C}$ , there exists a hypothesis function ${f}_{h}\in \mathcal{F}$ , such that f_h(x) = f_C(x) for all $x\in \mathcal{X}$ .

Proof. We first give a high-level intuition for the proof. We note that any gate acting on a constant number of qubits can be decomposed into a quantum circuit with a constant number of elementary gates, namely the CNOT gates, H gates, and R_x(θ) gates. Thus, any $C\in \mathcal{C}$ can be decomposed into a quantum circuit with O(n^c) elementary gates. In addition, as our hypothesis quantum circuit F consists of Mn^c layers and every two layers can represent one arbitrary elementary gate acting on any pair of qubits through tuning the parameters properly, we can prove that C can be represented by F in an exact fashion.

The complete proof is as follows. First, we note that C consists of O(n^c) elementary gates by the definition of $\mathcal{C}$ . By proposition 1 in appendix A, the circuit C can be written as a product form U_l U_l−1...U₁, where l = O(n^c), and each U_i is either CNOT_1j or a single-qubit unitary gate V_j on qubit j.

Denoting B_n B_n−1...B₂ as one layer of level-2 blocks, then we define a block string of d layers as follows:

$\begin{equation*}{({B}_{n}{B}_{n-1}\dots {B}_{2})}^{d}.\end{equation*}$

We define T as the minimal number so that a block string of T layers can represent U_l U_l−1...U₁. As our hypothesis circuit F contains Mn^c layers in total, we will show that to represent U_l U_l−1...U₁, the minimal number of layers needed in the block string is no greater than Mn^c. Then, when the previous T layers in F have represented C exactly, by the first part of proposition 2 in appendix A, we can set the remaining (Mn^c − T) layers to be the identity gate on the n qubits. Therefore, we can show that F can represent C in an exact fashion and complete the proof of theorem 1.

To show that T ⩽ Mn^c, we will prove that for each U_i, i = 1, 2, ..., l, we only need two layers to represent U_i exactly. Then putting them together, we can prove that we need only 2l layers to represent U_l U_l−1...U₁, which yields T ⩽ 2l.

We fix any i ∈ {1, 2, ..., l}. By proposition 2, one layer B_n B_n−1...B₂ can represent any single-qubit unitary gate V_j acting on any qubit j = 1, 2, ..., n up to a global phase e^iα. Moreover, two layers ${({B}_{n}{B}_{n-1}\dots {B}_{2})}^{2}$ can represent CNOT_1j for any j = 2, 3, ..., n. We note that U_i is either a single-qubit gate V_j acting on some qubit j, or a two-qubit gate CNOT_1j. Therefore, we only need two layers at most to represent U_i exactly.

As we have shown that T ⩽ 2l = O(n^c), by choosing a large enough constant M, we can prove that T ⩽ Mn^c and complete the proof of the theorem.□

Theorem 1 shows that given any quantum circuit $C\in \mathcal{C}$ , there always exists a solution to the parameters, such that the circuit F can simulate the quantum circuit C acting on the n-qubits with zero error. Therefore, given any training set $S={({x}_{i},{y}_{i})}_{i=1,2,\dots ,m}$ sampled independently from some distribution P over $\mathcal{X}\times \mathcal{Y}$ , if we have y_i = f_C(x_i) for all i = 1, 2, ..., m, we can find an instance ${f}_{h}\in \mathcal{F}$ with zero training error. In fact, this theorem has a wider range of applications. When a quantum circuit consists of fewer than n^c gates, we can add some identity gates after it, so our theorem covers all the quantum circuits containing no more than n^c gates. In other words, we can use only O(n^c+1) gates, which are arranged in a uniform pattern, to represent all the circuits with n^c gates or fewer. We remark that the number of gates in many famous quantum circuits, such as quantum support vector machine [14], HHL algorithm [12], and quantum Fourier transform [41], scales polynomially with the number of qubits. Therefore, all of these circuits above can be represented exactly by our circuit F.

2.3. PAC learnability of $\mathcal{F}$

In the PAC setting, we usually assume that the input samples are randomly generated from certain unknown probability distribution. As a result, when the hypothesis space covers the underlying distribution and the training dataset is large enough, both the training and generalization error should be small. In this paper, our hypothesis space $\mathcal{F}$ has been proved in theorem 1 to be able to cover all the parametric quantum circuits in $\mathcal{C}$ . Now, we study the sample complexity for training a circuit $F\in \mathcal{F}$ to represent $C\in \mathcal{C}$ . To this end, we define a measure of the distance between two pure states in $\mathcal{Y}$ , which is used as the loss function $\mathcal{L}:\mathcal{Y}\times \mathcal{Y}\to [0,1]$ . Specifically, we define the loss function $\mathcal{L}({y}_{1},{y}_{2})$ to be the trace distance of two quantum states |ψ(y₁)⟩ and |ψ(y₂)⟩:

$\begin{equation*}\mathcal{L}({y}_{1},{y}_{2})=\frac{1}{2}{\Vert}\vert \psi ({y}_{1})\rangle \langle \psi ({y}_{1})\vert -\vert \psi ({y}_{2})\rangle \langle \psi ({y}_{2})\vert {{\Vert}}_{1},\end{equation*}$

where ||⋅||₁ denotes the trace norm of a matrix. Given a hypothesis function ${f}_{h}\in \mathcal{F}$ and the training set $S={({x}_{i},{y}_{i})}_{i=1,2,\dots ,m}$ sampled independently from some distribution P, we can define the empirical risk of f_h, which is also known as in-sample error:

$\begin{equation*}\hat{R}({f}_{h})=\frac{1}{m}\sum\limits _{i=1}^{m}\mathcal{L}({y}_{i},{f}_{h}({x}_{i})).\end{equation*}$

The risk of a hypothesis function ${f}_{h}\in \mathcal{F}$ is then defined as the average loss of f_h over the probability distribution P:

$\begin{equation*}R({f}_{h})={\int }_{\mathcal{X}\times \mathcal{Y}}\mathcal{L}(y,{f}_{h}(x))\mathrm{d}P(x,y).\end{equation*}$

Our goal is to find a hypothesis ${f}_{h}\in \mathcal{F}$ and minimize its risk R(f_h). As the parametric quantum circuit C is a black box in our setting, we do not know the probability distribution P. In the learning process, we use the training set S and find an empirical risk minimizer $\hat{h}\in \mathcal{F}$ over S. For convenience, given the training set S and the probability distribution P, we define $\hat{h}=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace \hat{R}({h}^{\prime })$ to be the empirical risk minimizer, and $h=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace R({h}^{\prime })$ to be the risk minimizer. Now we formally introduce the definition of the PAC learnability for completeness [42]:

Definition 1 (PAC learnability). A hypothesis space $\mathcal{F}$ is PAC learnable, if there exists a function $\nu :{(0,1)}^{2}\to \mathbb{N}$ , such that for all (, δ) ∈ (0, 1)² and all probability measures P over $\mathcal{X}\times \mathcal{Y}$ , when the size of training set |S| ⩾ ν(, δ), we have

$\begin{equation}{\mathbb{P}}_{S}\left(R(\hat{h})-R(h)\leqslant {\epsilon}\right)\geqslant 1-\delta .\end{equation} \tag{ 2 }$

Here ${\mathbb{P}}_{S}(A)$ denotes the probability that event A happens over repeated sampling of the training set S.

We note that when we randomly select a state $x\in \mathcal{X}$ , input it into the circuit C and get an output state $y={f}_{C}(x)\in \mathcal{Y}$ , the resulting probability distribution P of state pairs (x, y) will satisfy R(h) = 0, because we can find an instance ${f}_{h}\in \mathcal{F}$ equal to f_C by theorem 1. Therefore, if our $\mathcal{F}$ is PAC learnable, after we prepare the training samples S and get an empirical minimizer $\hat{h}$ , with probability 1 − δ, the average loss of $\hat{h}$ over P will be no larger than , i.e., $R(\hat{h})\leqslant {\epsilon}$ . Now we are going to prove that $\mathcal{F}$ is PAC learnable, and the sample complexity ν is polynomial with n, 1/, and $\mathrm{ln}\enspace \frac{1}{\delta }$ .

Theorem 2. The hypothesis space $\mathcal{F}$ satisfies the PAC learnability, with sample complexity $\nu ({\epsilon},\delta )=O(\frac{1}{{{\epsilon}}^{2}}({n}^{c+1}\enspace \mathrm{ln}\enspace \frac{n}{{\epsilon}}+\mathrm{ln}\enspace \frac{1}{\delta }))$ .

Proof. The essential idea for the proof relies on the discretization of $\mathcal{F}$ . First, we construct a finite set of hypothesis functions ${\mathcal{F}}^{\prime }$ , such that for each function ${f}_{h}\in \mathcal{F}$ , we can find a function ${f}_{h}^{\prime }\in {\mathcal{F}}^{\prime }$ close enough to f_h. Then we use lemma 2 in appendix B to show that ${\mathcal{F}}^{\prime }$ is PAC learnable. Finally, for any ${f}_{h}\in \mathcal{F}$ and its corresponding ${f}_{h}^{\prime }$ , as f_h and ${f}_{h}^{\prime }$ are close enough, we can prove that their risk and empirical risk are close as well. Therefore, we obtain that $\mathcal{F}$ is PAC learnable.

For clarity, we denote l as the total number of R_x(θ) gates in the circuit F, and observe that l ⩽ 12Mn^c+1. We recall that $\mathcal{F}$ is defined as the collection of all the functions ${f}_{h}:\mathcal{X}\to \mathcal{Y}$ that the circuit F can represent by tuning the value of the parameters θ = (θ₁, θ₂, ..., θ_l), where θ_i ∈ [0, 2π) is the variational parameter characterizing the ith x-rotation. Now we define a finite set ${\mathcal{F}}^{\prime }\subseteq \mathcal{F}$ in this way: ${\mathcal{F}}^{\prime }$ is the collection of all the functions ${f}_{h}^{\prime }:\mathcal{X}\to \mathcal{Y}$ that circuit F can represent by tuning the value of all the θ_i in {0, e, 2e, ..., Ne}, where $e=\frac{{\epsilon}}{6K{n}^{c+1}}$ , $N=\lfloor \frac{2\pi }{e}\rfloor$ , and K is a large enough constant. As there are l = O(n^c+1) rotational gates in circuit F in total, we have

$\begin{equation*}\vert {\mathcal{F}}^{\prime }\vert ={\lceil \frac{12\pi K{n}^{c+1}}{{\epsilon}}\rceil }^{O({n}^{c+1})},\end{equation*}$

which is finite. As a result, we can plug ${\mathcal{F}}^{\prime }$ and ' = /6 into lemma 2 to obtain that when $\vert S\vert \geqslant \frac{18}{{{\epsilon}}^{2}}(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta })$ ,

$\begin{equation}{\mathbb{P}}_{S}\left(\forall \enspace {f}_{h}^{\prime }\in {\mathcal{F}}^{\prime }:\left\vert \hat{R}\left({f}_{h}^{\prime }\right)-R\left({f}_{h}^{\prime }\right)\right\vert \leqslant \frac{{\epsilon}}{6}\right)\geqslant 1-\delta .\end{equation} \tag{ 3 }$

We fix all the parameters θ in circuit F, and we will get an arbitrary hypothesis function ${f}_{h}\in \mathcal{F}$ . Then we can round all the parameter θ of circuit F into their nearest multiples of e in {0, e, 2e, ..., Ne}, and we will get a new hypothesis function $\tilde{{f}_{h}}\in {\mathcal{F}}^{\prime }$ . By proposition 5, we obtain that for any ${f}_{h}\in \mathcal{F}$ ,

$\begin{equation}\left\vert R(\tilde{{f}_{h}})-R({f}_{h})\right\vert \leqslant \frac{{\epsilon}}{6},\end{equation} \tag{ 4 }$

$\begin{equation}\left\vert \hat{R}(\tilde{{f}_{h}})-\hat{R}({f}_{h})\right\vert \leqslant \frac{{\epsilon}}{6}.\end{equation} \tag{ 5 }$

Combining the three inequalities (3)–(5) together, we arrive at

$\begin{equation}{\mathbb{P}}_{S}\left(\forall \enspace {f}_{h}\in \mathcal{F}:\left\vert \hat{R}\left({f}_{h}\right)-R\left({f}_{h}\right)\right\vert \leqslant \frac{{\epsilon}}{2}\right)\geqslant 1-\delta ,\end{equation} \tag{ 6 }$

when $\vert S\vert \geqslant \frac{18}{{{\epsilon}}^{2}}(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta })$ . To prove that $\mathcal{F}$ is PAC learnable, we recall our notations that $h=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace R({h}^{\prime })$ , and $\hat{h}=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace \hat{R}({h}^{\prime })$ . Combining the inequality (6) and proposition 6, we obtain that

$\begin{equation*}{\mathbb{P}}_{S}\left(R(\hat{h})-R(h)\leqslant {\epsilon}\right)\geqslant 1-\delta ,\end{equation*}$

when $\vert S\vert \geqslant \frac{18}{{{\epsilon}}^{2}}(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta })$ . Plugging in $\vert {\mathcal{F}}^{\prime }\vert ={\lceil \frac{12\pi K{n}^{c+1}}{{\epsilon}}\rceil }^{O({n}^{c+1})}$ , we can prove that the hypothesis space $\mathcal{F}$ is PAC learnable with sample complexity

$\begin{equation*}\nu ({\epsilon},\delta )=O\left(\frac{1}{{{\epsilon}}^{2}}\left({n}^{c+1}\enspace \mathrm{ln}\enspace \frac{n}{{\epsilon}}+\mathrm{ln}\enspace \frac{1}{\delta }\right)\right).\end{equation*}$

This completes the proof of theorem 2.□

We denote $\mathcal{E}$ as the collection of all the functions ${f}_{C}:\mathcal{X}\to \mathcal{Y}$ , where $C\in \mathcal{C}$ . In fact, we can prove that $\mathcal{E}$ is PAC learnable as well. By theorem 1, our hypothesis space $\mathcal{F}$ can cover all the quantum circuits in $\mathcal{C}$ . Thus we can obtain that $\mathcal{E}\subseteq \mathcal{F}$ . Using the inequality (6), we will arrive at

$\begin{equation}{\mathbb{P}}_{S}\left(\forall \enspace {f}_{h}\in \mathcal{E}:\left\vert \hat{R}\left({f}_{h}\right)-R\left({f}_{h}\right)\right\vert \leqslant \frac{{\epsilon}}{2}\right)\geqslant 1-\delta ,\end{equation} \tag{ 7 }$

when $\vert S\vert \geqslant \frac{18}{{{\epsilon}}^{2}}(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta })$ . Similarly with the method in theorem 2, we combine the inequality (7) with proposition 6, and we can prove that $\mathcal{E}$ is PAC learnable with sample complexity $\nu ({\epsilon},\delta )=O(\frac{1}{{{\epsilon}}^{2}}({n}^{c+1}\enspace \mathrm{ln}\enspace \frac{n}{{\epsilon}}+\mathrm{ln}\enspace \frac{1}{\delta }))$ as well.

We stress the differences between our results and the previous works [28, 33] in the literature. First, in reference [28] Chung and Lin focused on a finite set of discretized quantum channels, and their algorithm is based on random orthogonal measurements. Whereas, in our settings we focus on a set of unitary quantum circuits with continuous variational parameters, thus the size of our concept class is infinite. Moreover, our proof is based on a family of variational quantum neural networks with an explicit uniform structure, which would be useful in practical applications. Second, in reference [33] Bu et al considered a more general class of quantum channels and their bounds of the sample complexity grow exponentially with the number of qubits n. In contrast, our focus here is variational quantum circuits and the sample complexity we obtained scales only polynomially with the system size. In other words, while reference [33]'s setting is more general, the sample complexity bounds obtained in this work is exponentially tighter. Our work and references [28, 33] are complementary to each other.

It is also worthwhile to clarify that, although we have proved that the sample complexity for learning any physical quantum circuit is low (namely, it only scales polynomially with the number of qubits involved), this does not mean that these circuits can be learned efficiently since the time complexity to learn an unknown circuit can still be exponentially high. In fact, it has been proved recently that training a variational quantum circuit, even for logarithmically many qubits and free fermionic systems, is NP-hard [43]. This implies that although we know for sure that our hypothesis $\mathcal{F}$ can cover all physical quantum circuits and only a polynomial number of samples are needed to train a variational circuit $F\in \mathcal{F}$ , how to efficiently solve the optimization problem of minimizing the empirical risk remains unclear and might be an exponentially hard problem in practice.

3. Discussion

We mention that the family of hypothesis quantum variational circuits constructed in this paper is of independent interest due to its use of only O(n^c+1) variational parameters while maintaining notable representation power. These circuits might be used as variational ansatz for implementing quantum classifiers [23, 25, 44–48], variational quantum eigensolvers [49–52], or quantum generative adversarial networks [53–55], etc. On the other hand, we also remark that similarly to many other variational quantum circuits constructed in the literature, this family of variational circuits may suffer from the barren plateau (i.e., vanishing gradient) problem [56, 57] as well. In addition, our work can be appealing as the family of circuits is constructed without optimizing the structure and the number of parameters. In the future, it would also be interesting to explore other alternative structures with smaller depths and fewer parameters. Another interesting problem worth further investigation is to consider a scenario where we do not have perfect knowledge about the training data, namely that the training dataset may not be fully labeled. How to extend our results to this scenario remains unknown.

We note that in our proof, the use of PAC learning theory is in fact independent from the learning model, i.e., it can deal with both the classical and quantum objectives. In our settings, the objects to be learned are parametric quantum circuits, but we can still use standard classical techniques of PAC learning theory (like discretization) to obtain the sample complexity bound.

In summary, we have proved that unitary physical quantum circuits are PAC learnable on a quantum computer via empirical risk minimization. In particular, we proved that to learn a unitary quantum circuit with at most n^c local gates, the sample complexity is bounded by $\tilde{O}({n}^{c+1})$ . Our results are generally applicable to all unitary quantum circuits of practical interest. There are many notable quantum circuits (algorithms or kernels, such as Shor's factorization algorithm [39], the HHL algorithm [12], quantum support vector machine [14], quantum classification based on discrete logarithm [58], etc) that hold the intriguing potential of exponential quantum speedup. Our results imply that a polynomial number of samples are enough to learn these quantum circuits. In reference [59], Bang et al proposed a method for learning quantum algorithms assisted by machine learning, which shows learning speedup in designing quantum circuits for solving the Deutsch–Jozsa problem, and our results imply that the quantum circuits they used are PAC learnable as well.

Acknowledgments

We thank Wenjie Jiang, Peixin Shen, and Xun Gao in particular for their helpful discussions. This work is supported by the start-up fund from Tsinghua University (Grant No. 53330300320), the National Natural Science Foundation of China (Grant No. 12075128), and the Shanghai Qi Zhi Institute.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Appendix A.: The universality of $\mathcal{F}$

In this paper, all the constants such as b, c, K, and M are independent of n, , and δ. Also, we recall that $\mathcal{C}$ is the set of all the n-qubit quantum circuits with at most n^c unitary gates, with each gate acting on at most b qubits.

In proving theorem 1 in the main text, we used three lemmas, which are appended in the following. The lemma 1 is proved in reference [40], which we recap here for completeness. The propositions 1 and 2 are proved in this paper.

Lemma 1 ([40], section 4.5.2). An arbitrary unitary operation on b qubits can be implemented using a circuit containing at most c₀ b²4^b single-qubit unitary gates and CNOT gates, where c₀ is a constant.

Proposition 1. For any $C\in \mathcal{C}$ , there exist l = O(n^c) unitary gates U₁, U₂, ..., U_l, such that C = U_l U_l−1...U₁, and each gate U_i is either a single-qubit unitary gate V_j acting on qubit j, or CNOT_1j gate with the first qubit being the control qubit and the jth qubit being the target one.

Proof. We first prove that C can be decomposed as O(n^c) elementary gates, including CNOT gates and single-qubit unitary gates. By lemma 1, C can be implemented by at most c₀ b²4^b n^c = O(n^c) unitary gates, and each gate is either a single-qubit unitary gate or CNOT_ij gate with the control qubit i and the target qubit j.

To prove that C can be decomposed as the product of l = O(n^c) single-qubit unitary gates and CNOT_1j gates, we need only prove that when i ≠ 1 and i ≠ j, CNOT_ij can be decomposed as CNOT_1i, CNOT_1j, and H gates.

When i > 1, j = 1, we can write CNOT_i1 in this way:

$\begin{equation*}{\text{CNOT}}_{i1}=({H}_{1}\otimes {H}_{i}){\text{CNOT}}_{1i}({H}_{1}\otimes {H}_{i}).\end{equation*}$

Meanwhile, when i, j > 1 and i ≠ j, we can decompose CNOT_ij into CNOT_1j and CNOT_i1 in this way:

$\begin{equation*}{\text{CNOT}}_{ij}={({\text{CNOT}}_{1j}{\text{CNOT}}_{i1})}^{2},\end{equation*}$

and we have shown that CNOT_i1 can be decomposed as CNOT_1i and H gates.

As each decomposition uses only O(1) gates, we can obtain that C can be decomposed as the product of l = O(n^c) single-qubit unitary gates and CNOT_1j gates, and the proof is completed.□

In a level-2 block B_j, there are two level-1 blocks on qubit 1 and qubit j, respectively. Each level-1 block has three parameters β, γ, δ, and by Z–X decomposition [40], we can tune these three parameters to enable a level-1 block L_j on qubit j to represent any single-qubit unitary gate acting on qubit j up to an irrelevant global phase. Also, by setting the three parameters to zero, a level-1 block L_j can also represent the identity gate. We will prove that by tuning the parameters of the level-2 blocks, one layer consisting of B_n B_n−1...B₂ can represent any single-qubit unitary gate acting on any qubit j, and CNOT_1j can be represented by two layers.

Proposition 2. (1) One layer B_n B_n−1...B₂ can represent any single-qubit unitary gate V_j acting on any qubit j up to an irrelevant global phase.

(2) Two layers ${({B}_{n}{B}_{n-1}\dots {B}_{2})}^{2}$ can represent CNOT_1j up to an irrelevant global phase.

Proof. To prove this lemma, we will set most of the level-2 blocks in the layers to be the identity gates and use at most two blocks to represent the gates we need.

We prove part one first. We separate the claim into two cases, j = 1 and j ≠ 1. When j = 1, we can let B_n B_n−1...B₃ represent the identity gate by tuning all their parameters to zero. For clarity, we denote L_j as a level-1 block acting on the jth qubit. Given any unitary gate V₁ on qubit 1, a level-1 block L₁ can represent V₁, and both level-1 blocks L₁ and L₂ can represent the identity gate. As a level-2 block B₂ consists of four level-1 blocks and two CNOT gates, we can tune the parameters of the four level-1 blocks in the following way so that B₂ can represent V₁:

Similarly, when j ≠ 1, we can let B_n B_n−1...B_j+1 and B_j−1...B₃ B₂ represent the identity gate. Then we need only let B_j represent unitary gate V_j. Given any unitary gate V_j on qubit j, we can tune the parameters of the four level-1 blocks in the following way so that B_j can represent V_j:

Therefore, the proof of part one is completed. Now we will prove part two. We set all the parameters in the two layers ${({B}_{n}{B}_{n-1}\dots {B}_{2})}^{2}$ to be zero except the two B_j blocks. Then we will use two B_j blocks to represent CNOT_1j. We decompose CNOT_1j up to an irrelevant global phase factor e^−iπ/4 in the following way:

$\begin{equation*}\left({W}_{4}\otimes {W}_{3}\right){\text{CNOT}}_{1j}\left({I}_{1}\otimes {W}_{2}\right){\text{CNOT}}_{1j}\left({I}_{1}\otimes {W}_{1}\right),\end{equation*}$

where we set ${W}_{1}={R}_{z}\left(\frac{\pi }{2}\right),{W}_{2}={R}_{y}\left(\frac{\pi }{2}\right),{W}_{3}={R}_{z}\left(-\frac{\pi }{2}\right){R}_{y}\left(-\frac{\pi }{2}\right)$ , and ${W}_{4}={R}_{z}\left(-\frac{\pi }{2}\right)$ . Here we denote R_z(θ) = e^−iθZ/2 and R_y(θ) = e^−iθY/2 as the rotation operators along the z-axis and y-axis on the Bloch sphere, respectively. In addition, W₄ and the identity gate I₁ act on the first qubit, and W₁, W₂, and W₃ act on the jth qubit.

Hence, we use the B_j block in the first layer to represent ${\text{CNOT}}_{1j}\left({I}_{1}\otimes {W}_{2}\right){\text{CNOT}}_{1j}\left({I}_{1}\otimes {W}_{1}\right)$ in this way:

Finally, we use the second level-2 block B_j to represent W₄ ⊗ W₃, where W₄ acts on the first qubit and W₃ acts on the jth qubit. Therefore, two layers ${({B}_{n}{B}_{n-1}\dots {B}_{2})}^{2}$ can represent CNOT_1j up to an irrelevant global phase, and this completes the proof of part two.□

Appendix B.: PAC learnability of $\mathcal{F}$

The following lemma shows that any finite hypothesis space ${\mathcal{F}}^{\prime }$ is PAC learnable.

Lemma 2 ([42], corollary 1.2). Assume that the hypothesis space ${\mathcal{F}}^{\prime }$ is finite, δ ∈ (0, 1], > 0 and the range of the loss function is in an interval of length c ⩾ 0. Then if the size of the training set $\vert S\vert \geqslant \frac{{c}^{2}}{2{{\epsilon}}^{2}}\left(\mathrm{ln}\vert {\mathcal{F}}^{\prime }\vert +\mathrm{ln}\enspace \frac{2}{\delta }\right)$ , the event $\forall \enspace {f}_{h}\in {\mathcal{F}}^{\prime }:\vert \hat{R}({f}_{h})-R({f}_{h})\vert \leqslant {\epsilon}$ holds with probability at least 1 − δ over repeated sampling of the training set S.

Our circuit F consists of R_x(θ), H, and CNOT gates. By assigning two sets of different values to the variational parameters θ , we can get two distinct circuits F₁ and F₂, and their corresponding hypothesis functions f₁ and f₂ are different. We note that although F₁ and F₂ differ in the value of their variational parameters θ , their ordering of the gates (R_x(θ), H, and CNOT gates) are the same. We will show that when all the variational parameters in circuit F₁ and F₂ are close enough, the risk and empirical risk of f₁ and f₂ will be close. To prove this, we first define the distance of two unitary matrices U₁, ${U}_{2}\in {\mathbb{C}}^{{2}^{n}\times {2}^{n}}$ as the 2-norm of the matrix U₁ − U₂:

$\begin{equation*}E({U}_{1},{U}_{2})={\Vert}{U}_{1}-{U}_{2}{{\Vert}}_{2}=\underset{x\in \mathcal{X}}{\mathrm{sup}}{\Vert}({U}_{1}-{U}_{2})\vert \psi (x)\rangle {{\Vert}}_{2}.\end{equation*}$

Now we introduce the following lemma about the function E(U₁, U₂).

Proposition 3. The function E(U₁, U₂) satisfies the following properties:

(a)
Let U, V be the R_x(θ), R_x(θ + ) gates acting on the jth qubit, respectively, where ∈ (0, 1), θ ∈ [0, 2π), j = 1, 2, ..., n. Then E(U, V) ⩽ .
(b)
$E\left({U}_{l}{U}_{l-1}\dots {U}_{1},{V}_{l}{V}_{l-1}\dots {V}_{1}\right)\leqslant \sum _{j=1}^{l}E\left({U}_{j},{V}_{j}\right)$ , where U₁, U₂, ..., U_l, V₁, V₂, ..., V_l are unitary matrices.

Proof. The second property is shown in [40], section 4.5.3. We need only prove the first property.

$\begin{align*}\hfill E(U,V)& ={\Vert}U-V{{\Vert}}_{2}\hfill \\ \hfill & ={\Vert}{R}_{x}(\theta )-{R}_{x}(\theta +{\epsilon}){{\Vert}}_{2}\hfill \\ \hfill & ={{\Vert}I-{R}_{x}({\epsilon}){\Vert}}_{2}\hfill \\ \hfill & \enspace \stackrel{\mathrm{(i)}}{=}{\Vert}I-(I-i{\epsilon}X/2+{(i{\epsilon}X/2)}^{2}/(2!)-\cdots \enspace ){{\Vert}}_{2}\hfill \\ \hfill & \leqslant {\Vert}i{\epsilon}X/2{{\Vert}}_{2}+{\Vert}{(i{\epsilon}X/2)}^{2}/(2!){{\Vert}}_{2}+\cdots \hfill \\ \hfill & \enspace \stackrel{\mathrm{(ii)}}{=}\mathrm{exp}({\epsilon}/2)-1\hfill \\ \hfill & \leqslant {\epsilon},\hfill \end{align*}$

where (i) uses Taylor's expansion of the operator R_x() = e^−iX/2, and (ii) uses Taylor's expansion of exp(/2) and that ||X||₂ = ||I||₂ = 1.

We recall that $\mathcal{L}({y}_{1},{y}_{2})$ is the trace distance of two pure states |ψ(y₁)⟩ and |ψ(y₂)⟩. Then we introduce the following properties of $\mathcal{L}({y}_{1},{y}_{2})$ .

Proposition 4. The function $\mathcal{L}:\mathcal{Y}\times \mathcal{Y}\to [0,1]$ satisfies the following two properties:

(a)
For any ${y}_{1},{y}_{2},{y}_{3}\in \mathcal{Y}$ , we have $\mathcal{L}({y}_{1},{y}_{3})-\mathcal{L}({y}_{2},{y}_{3})\leqslant \mathcal{L}({y}_{1},{y}_{2})$ .
(b)
For any ${y}_{1},{y}_{2}\in \mathcal{Y}$ , we have $\mathcal{L}({y}_{1},{y}_{2})\leqslant {\Vert}\vert \psi ({y}_{1})\rangle -\vert \psi ({y}_{2})\rangle {{\Vert}}_{2}$ .

Proof. The first part of this lemma is the triangle inequality, which is proved in [40], section 9.2.1. Here, we only prove the second property. We denote $F(\vert \psi ({y}_{1})\rangle ,\vert \psi ({y}_{2})\rangle )=\left\vert \langle \psi ({y}_{1})\vert \psi ({y}_{2})\rangle \right\vert$ as the fidelity between the two states |ψ(y₁)⟩ and |ψ(y₂)⟩. Then we will arrive at

$\begin{align*}\hfill \mathcal{L}({y}_{1},{y}_{2})& =\frac{1}{2}{\Vert}\vert \psi ({y}_{1})\rangle \langle \psi ({y}_{1})\vert -\vert \psi ({y}_{2})\rangle \langle \psi ({y}_{2})\vert {{\Vert}}_{1}\hfill \\ \hfill & \stackrel{\mathrm{(iii)}}{=}\sqrt{1-F{(\vert \psi ({y}_{1})\rangle ,\vert \psi ({y}_{2})\rangle )}^{2}},\hfill \end{align*}$

where the proof of equation (iii) is given in [40], section 9.2.3.

In addition, we note that for any complex number $z\in \mathbb{C}$ and its complex conjugate ${z}^{\ast }\in \mathbb{C}$ , as (|z| − 1)² ⩾ 0, we have 2 − 2|z| ⩾ 1 − |z|². Hence, we get 2 − z − z* ⩾ 2 − 2|z| ⩾ 1 − |z|². Let z = ⟨ψ(y₁)|ψ(y₂)⟩, we obtain that

$\begin{align*}\hfill \mathcal{L}({y}_{1},{y}_{2})& =\sqrt{1-F{(\vert \psi ({y}_{1})\rangle ,\vert \psi ({y}_{2})\rangle )}^{2}}=\sqrt{1-\vert z{\vert }^{2}}\hfill \\ \hfill & \leqslant \sqrt{2-z-{z}^{\ast }}={\Vert}\vert \psi ({y}_{1})\rangle -\vert \psi ({y}_{2})\rangle {{\Vert}}_{2}.\hfill \end{align*}$

This completes the proof.□

Now we will use the properties of E(U₁, U₂) and $\mathcal{L}({y}_{1},{y}_{2})$ to show that the differences of both the risk and empirical risk between f₁ and f₂ are bounded by , where the hypothesis functions f₁ and f₂ correspond to the variational circuits F₁ and F₂, respectively.

Proposition 5. We denote ${\boldsymbol{\theta }}^{{F}_{1}}=({\theta }_{1}^{{F}_{1}},{\theta }_{2}^{{F}_{1}},\dots ,{\theta }_{l}^{{F}_{1}})$ as a vector containing all the variational parameters in F₁, where l is the number of R_x(θ) gates in circuit F, and ${\theta }_{i}^{{F}_{1}}$ is the value of the variational parameter characterizing the ith x-rotation of F₁. Similarly, we denote ${\boldsymbol{\theta }}^{{F}_{2}}=({\theta }_{1}^{{F}_{2}},{\theta }_{2}^{{F}_{2}},\dots ,{\theta }_{l}^{{F}_{2}})$ as a vector containing all the variational parameters in F₂.

Let ${f}_{1},{f}_{2}\in \mathcal{F}$ be the corresponding hypothesis functions of F₁, F₂, respectively. Then given any probability distribution P over $\mathcal{X}\times \mathcal{Y}$ and training set $S={({x}_{i},{y}_{i})}_{i=1,2,\dots ,m}$ , the following two inequalities hold if ${\Vert}{\boldsymbol{\theta }}^{{F}_{1}}-{\boldsymbol{\theta }}^{{F}_{2}}{{\Vert}}_{\infty }\leqslant \frac{{\epsilon}}{K{n}^{c+1}}$ (K is a large enough constant):

$\begin{align*}\hfill \vert R({f}_{1})-R({f}_{2})\vert \leqslant {\epsilon},\\ \hfill \vert \hat{R}({f}_{1})-\hat{R}({f}_{2})\vert \leqslant {\epsilon}.\end{align*}$

Proof. First, we will prove that E(F₁, F₂) ⩽ when ${\Vert}{\boldsymbol{\theta }}^{{F}_{1}}-{\boldsymbol{\theta }}^{{F}_{2}}{{\Vert}}_{\infty }\leqslant \frac{{\epsilon}}{K{n}^{c+1}}$ . Then we will use it to show the risk and empirical risk of f₁ and f₂ are close.

As F is composed of H gates, R_x(θ) gates and CNOT gates, we can write F₁ = U_l U_l−1...U₁ and F₂ = V_l V_l−1...V₁, where U_i is the ith gate in F₁, and V_i is the ith gate in F₂. As U_i and V_i are of the same type of gates, we can prove that $E({U}_{i},{V}_{i})\leqslant \frac{{\epsilon}}{K{n}^{c+1}}$ by separating different cases on the types of U_i and V_i:

Case I: If U_i and V_i are both H gates or both CNOT gates, as there is no variational parameter in H or CNOT, we have U_i = V_i, and we obtain that E(U_i, V_i) = 0.

Case II: If U_i and V_i are both R_x(θ) gates, as the difference of ${\theta }_{i}^{{F}_{1}}$ and ${\theta }_{i}^{{F}_{2}}$ is at most $\frac{{\epsilon}}{K{n}^{c+1}}$ , by the first property of proposition 3, we have $E({U}_{i},{V}_{i})\leqslant \frac{{\epsilon}}{K{n}^{c+1}}$ .

We note that l = O(n^c+1) by our construction of F. By the second property of proposition 3 and choosing a large enough constant K such that l ⩽ Kn^c+1, we can get

$\begin{align*}\hfill E({F}_{1},{F}_{2})& =E({U}_{l}{U}_{l-1}\dots {U}_{1},{V}_{l}{V}_{l-1}\dots {V}_{1})\hfill \\ \hfill & \leqslant \sum\limits _{j=1}^{l}E\left({U}_{j},{V}_{j}\right)\hfill \\ \hfill & \leqslant \sum\limits _{j=1}^{l}\frac{{\epsilon}}{K{n}^{c+1}}\leqslant {\epsilon}.\hfill \end{align*}$

Now we can bound the differences of the risk and empirical risk between the two hypothesis functions f₁ and f₂, respectively. For convenience, we define D(f₁, f₂) as $\underset{x\in \mathcal{X},y\in \mathcal{Y}}{\mathrm{sup}}\left\vert \mathcal{L}(y,{f}_{1}(x))-\mathcal{L}(y,{f}_{2}(x))\right\vert$ . We observe that both |R(f₁) − R(f₂)| and $\vert \hat{R}({f}_{1})-\hat{R}({f}_{2})\vert$ can be bounded by D(f₁, f₂). Hence, we will prove that D(f₁, f₂) ⩽ , and we can obtain the two inequalities |R(f₁) − R(f₂)| ⩽ and $\vert \hat{R}({f}_{1})-\hat{R}({f}_{2})\vert \leqslant {\epsilon}$ .

$\begin{align*}\hfill D({f}_{1},{f}_{2})& \stackrel{\mathrm{(iv)}}{\leqslant }\underset{x\in \mathcal{X}}{\mathrm{sup}}\enspace \mathcal{L}({f}_{1}(x),{f}_{2}(x))\hfill \\ \hfill & \stackrel{\mathrm{(v)}}{\leqslant }\underset{x\in \mathcal{X}}{\mathrm{sup}}{\Vert}{f}_{1}(x)-{f}_{2}(x){{\Vert}}_{2}\enspace \hfill \\ \hfill & =\underset{x\in \mathcal{X}}{\mathrm{sup}}{\Vert}({F}_{1}-{F}_{2})\vert \psi (x)\rangle {{\Vert}}_{2}\hfill \\ \hfill & =E({F}_{1},{F}_{2})\leqslant {\epsilon},\hfill \end{align*}$

where (iv) uses the first property of function $\mathcal{L}$ in proposition 4, and (v) uses the second property of function $\mathcal{L}$ in proposition 4. This completes the proof of proposition 5.□

We note that in our proof of theorem 2, we used lemma 2 and proposition 5 to show that $\forall \enspace {f}_{h}\in \mathcal{F}:\left\vert \hat{R}\left({f}_{h}\right)-R\left({f}_{h}\right)\right\vert \leqslant \frac{{\epsilon}}{2}$ holds with probability 1 − δ. To prove that $\mathcal{F}$ is PAC learnable, we introduce the following technical lemma.

Proposition 6. Assume $\forall \enspace {f}_{h}\in \mathcal{F}:\left\vert \hat{R}\left({f}_{h}\right)-R\left({f}_{h}\right)\right\vert \leqslant \frac{{\epsilon}}{2}$ holds. We denote $h=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace R({h}^{\prime })$ , and $\hat{h}=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace \hat{R}({h}^{\prime })$ . Then we have

$\begin{equation*}R(\hat{h})-R(h)\leqslant 2\enspace \underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{sup}}\left\vert \hat{R}({h}^{\prime })-R({h}^{\prime })\right\vert \leqslant {\epsilon}.\end{equation*}$

Proof. The proof of this inequality is given in [42], section 1.2. We give the proof of the lemma here for completeness. To bound $R(\hat{h})-R(h)$ , we observe that it can be expressed as the sum of $R(\hat{h})-\hat{R}(\hat{h})$ and $\hat{R}(\hat{h})-R(h)$ . Then we can use $\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{sup}}\left\vert \hat{R}({h}^{\prime })-R({h}^{\prime })\right\vert$ to bound $R(\hat{h})-\hat{R}(\hat{h})$ and $\hat{R}(\hat{h})-R(h)$ , respectively.

$\begin{align}\hfill R(\hat{h})-R(h)& =R(\hat{h})-\hat{R}(\hat{h})+\hat{R}(\hat{h})-R(h)\hfill \\ \hfill & \stackrel{\mathrm{(vi)}}{=}R(\hat{h})-\hat{R}(\hat{h})+\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{sup}}(\hat{R}(\hat{h})-R({h}^{\prime }))\hfill \\ \hfill & \stackrel{\mathrm{(vii)}}{\leqslant }R(\hat{h})-\hat{R}(\hat{h})+\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{sup}}(\hat{R}({h}^{\prime })-R({h}^{\prime }))\hfill \\ \hfill & \leqslant \vert \hat{R}(\hat{h})-R(\hat{h})\vert +\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{sup}}\vert \hat{R}({h}^{\prime })-R({h}^{\prime })\vert \hfill \\ \hfill & \leqslant 2\enspace \underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{sup}}\left\vert \hat{R}({h}^{\prime })-R({h}^{\prime })\right\vert \leqslant {\epsilon},\hfill \end{align}$

where (vi) uses that $h=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace R({h}^{\prime })$ , and (vii) uses that $\hat{h}=\mathrm{arg}\underset{{h}^{\prime }\in \mathcal{F}}{\mathrm{min}}\enspace \hat{R}({h}^{\prime })$ .□

Sample complexity of learning parametric quantum circuits

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Results

2.1. Notations and the general setting

2.2. A family of universal variational circuits

2.3. PAC learnability of $\mathcal{F}$

3. Discussion

Acknowledgments

Data availability statement

Appendix A.: The universality of $\mathcal{F}$

Appendix B.: PAC learnability of $\mathcal{F}$

Sample complexity of learning parametric quantum circuits

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Results

2.1. Notations and the general setting

2.2. A family of universal variational circuits

2.3. PAC learnability of \mathcal{F}

3. Discussion

Acknowledgments

Data availability statement

Appendix A.: The universality of \mathcal{F}

Appendix B.: PAC learnability of \mathcal{F}

2.3. PAC learnability of $\mathcal{F}$

Appendix A.: The universality of $\mathcal{F}$

Appendix B.: PAC learnability of $\mathcal{F}$