Learning cell for superconducting neural networks

Andrey Schegolev; Nikolay Klenov; Igor Soloviev; Maxim Tereshonok

doi:10.1088/1361-6668/abc569

1. Introduction

Artificial neural networks (ANNs) have already taken a worthy place in almost all areas of modern investigations, beginning from the neurophysiology, biology, and finishing (but not limited) to cybernetics and physics. ANNs are able to solve different tasks like image recognition or classification and optimization problems.

The emergence of more and more complex challenges forces the science and engineering community to develop specialized neuromorphic processors where elemental base and architecture are designed to execute neural network computing [1–7]. Among various options for implementing such computing systems, superconducting neural networks (SNNs) are in special attention due to their energy efficiency and high clock frequencies [8–23]. An artificial neuron in a superconducting network can be implemented using just one or two Josephson junctions. This is an order of magnitude smaller than the corresponding number of transistors in a CMOS implementation. The solution of model problem (letter recognition) was demonstrated by means of a simple three-layer SNN with given weights [10]. Attempts to combine the capabilities of optics featured by high fan-out and short delay with superconducting electronics for computations in ANN design seem to be very promising [24–27]. The development of this direction is hampered by problems with SNNs learning, and in this article, we propose an approach to eliminate this difficulty.

An ANN has two modes of operation: learning and data recognition. The learning process is much more complicated than recognition, especially in hardware implementations of ANN, due to the necessity of activation function derivative computation and synapse weights modification.

The learning procedure for SNN has not been sufficiently developed yet. In general, the learning processes of a hardware implementation of ANN can be divided into three types: Off-Chip Learning, Chip-in-the-Loop learning, and On-Chip learning [6]. Currently, for CMOS/MOFOS ANNs the first or second types are most often used for learning since they are much easier to implement [28, 29]. However, in the case of SNNs the last method—On-chip learning—may be preferred. Alternatives with data exchange between cryogenic and room-temperature electronics turn out to be extremely energy-consuming and slow, thus disrupting the advantages of superconducting electronics.

In the analysis of the training procedure of SNN, we developed ideas presented in our previous works [16–20] where physical concepts of superconducting neuron and synapse have been proposed. In our investigations, we seek the complete hardware implementation of SNN to take full advantage of the benefits provided by the utilization of superconductivity.

In this article, we propose a learning circuit architecture that implements feedback for the correction of synapse weights. Furthermore, we suggest a superconducting adiabatic learning cell for a one-shot calculation of the derivative of a nonlinear activation function. We have studied its static and dynamic characteristics in the frame of the resistively shunted junction (RSJ) model, and then optimized parameters of the proposed learning cell (LC) during the simulations of the learning process.

2. Learning circuit architecture

Typical ANN transforms its input (the features of a recognizable object) into an output which represents the recognition result [30]. A simple example of this transformation is shown in figure 1 in the green rectangle inset. Here (x₁, x₂) is a column vector of the features, corresponding to a recognizable object. Each neuron I₁, I₂ (input neurons) and O₁ (output neuron) performs a nonlinear transformation of its input described with the so-called activation function [29]. It is commonly chosen to be a sigmoid function f(x) = (1 +exp(−x))⁻¹ . Input neurons I₁ and I₂ transfer x₁ and x₂ in the forms f(x₁) and f(x₂) respectively to the output neuron O₁ through synapses with weights w₁ and w₂ . Signals from the input neurons are summarized, x_in = f(x₁) · w₁ + f(x₂) · w₂ , and transferred to the output neuron which gives us f(x_in) at the output. Hence the output can be expressed as a function O₁ = O₁(x₁, x₂, w₁, w₂). It means all the ANN influence on the recognition result is contained in the weights of the synapses w₁ and w₂ . Thus the aim of an ANN learning is to make its output O₁ correct via the 'w₁, w₂ modification' [31].

**Figure 1.** The learning scheme for a simple SNN, which elements are presented in the inset in the upper right corner. Circles correspond to the main elements and arrows correspond to data transfers between elements. Signs '+' and '×' indicate the operations of addition and multiplication, respectively. Boxes 1 and 2 indicate in the scheme a digital comparison circuit and a specialized unit for implementing the RPROP method of weight correction, respectively.
Download figure:
Standard image High-resolution image

Loss function for multilayer ANN $E\left( {\vec w} \right)$ can be described as

$\begin{equation*}E\left( {\vec w} \right) = \frac{1}{2}\sum\limits_{i = 1}^N {{{\left( {{O_{id}} - {O_i}\left( {\vec w} \right)} \right)}^2}}, \end{equation*}$

where N is the number of neurons in the output layer, i is a neuron counter, O_id is the ideal output of the ith neuron, O_i is the actual output of the ith neuron, $\vec w$ is a vector containing all the synapse weights in the ANN. So the loss function can be interpreted as the squared Euclidean distance between the desired and actual output of the ANN. It is clear that the loss function depends on the synapse weights $\vec w$ , so the goal of the learning procedure is to obtain such ${\vec w_{{\text{opt}}}}$ , that minimizes $E\left( {{{\vec w}_{{\text{opt}}}}} \right) \,=\, {E_{{\text{min}}}}$ . The most reasonable way to find $E\left( {{{\vec w}_{{\text{opt}}}}} \right)$ is the gradient descent in the synapse weights space. One of the most commonly used efficient methods of a neural network learning is the backpropagation algorithm [31], that implements a loss function gradient descent, exploiting the chain rule. The backpropagation algorithm allows to modify the synapse weights between each pair of neurons demanding only the ANN output O_id and the form of neurons' activation functions, f, and their derivatives, f'.

The schematic describing the classic backpropagation learning [31] of the ANN is depicted in the main part of figure 1, where the circles correspond to the main elements and the arrows correspond to data transfer between elements. For convenience, three stages of the process are highlighted here: step A, step B and step C. O_id is the ideal (desired) output for each recognizable class of the object being provided in the learning process for the ability to calculate the loss function 1/2(O_id −O₁)² . For correctness of learning process, the parameter epsilon was introduced: if δ = f(x_in )−O_id is less than then the learning process stops, otherwise it continues, see 'Step A' in figure 1. In the next step the learning cell should calculate the derivative of the O₁ neuron's activation function f'(x_in) and multiply it by δ, see 'Step B' in figure 1. Further in order to obtain weight modifications for w₁ and w₂ one needs the following functions: f'(x_in ) × δ × f(x₁) and f'(x_in ) × δ × f(x₂) ('Step C').

The weight modification rule is chosen to be RPROP [32, 33] for its proven performance in various tasks. There are two parameters of the RPROP method: η₊ and η_- . We suppose that for RPROP realisation weights (on the current and last iterations) and η_± should be contained in superconducting digital 'storage' for processing and updating weight values. If one prefers classic gradient descent, conjugate gradient method or Levenberg–Marquardt algorithm, all the difference will be just in the weight modification rule. All the rest logic will remain unaltered, so the proposed approach is not limited to implement only the RPROP algorithm.

There is a trick in ANNs: analytical expression of the activation function derivative through the activation function itself can be performed. This trick saves us the need to calculate the derivative using iterative or numerical methods, which leads to a drastic improvement in learning speed.

The derivative of sigmoid or logistic function, which is expressed as $f\left( x \right) = {\left( {1 - \exp \left( { - x} \right)} \right)^{ - 1}}$ , can be easily calculated via the activation function itself ${f{^{^{\prime}}}_x}\left( x \right) = f\left( x \right)\left( {1 - f\left( x \right)} \right)$ . So it is not a problem to be implemented in the software medium, but at the moment there were no ways for the derivative of activation function to be calculated in the superconductive hardware, especially in an analog implementation.

An element based on Josephson contacts that converts the input magnetic flux into a current at the output according to a sigmoidal law is already proposed [16]. So the goal of this paper is to propose an element that produces its derivative.

3. Learning cell

In figure 2(a) we present the concept of the learning cell, which transfer function approximates the derivative of sigmoid activation function, performed by the Sigma-cell [9]. Input flux, Φ_in, penetrates the cell through the inductances l; output flux, Φ_out , is induced by the current in inductance l_out. The considered circuit contains Josephson junctions (JJs) acting as nonlinear inductances which provide the desired bell-shaped transfer characteristic, Φ_out(Φ_in). An additional tiny flux shift, Φ_shift, is required to determine the direction of the output flux, which is zero in the absence of Φ_shift due to the symmetry of the circuit. Hereinafter, the fluxes are normalized to the magnitude of the magnetic flux quantum, Φ₀/2π; the main processes are described in the phase ϕ-space. We also use a RSJ model for the currents through the overdamped Josephson heterostructures:

$\begin{equation*}{I_{1,2}} = {\dot \varphi _{1,2}}\left( {{{{\Phi _0}}/{2\pi {R_{1,2}}}}} \right) + {I_{C1,2}} \cdot \sin {\varphi _{1,2}},\end{equation*}$

where I_C ₁=I_C ₂ =I_C are their critical currents, R₁ = R₂ =R are the normal resistances, ϕ_1,2 are the Josephson phases.

It is convenient to analyze the characteristics of a circuit containing a pair of JJs in terms of the sum, θ, and difference, ψ, of their Josephson phases. Equations describing the dynamics of the learning cell are as follows:

$\begin{equation*}\dot \theta = \frac{{{\varphi _{{\text{shift}}}} - \theta }}{{l + 2{l_{{\text{out}}}}}} - \sin \theta \cos \psi ,\end{equation*}$

$\dot \psi = \frac{{{\varphi _{{\text{in}}}}/2 + \psi }}{l} - \sin \psi \cos \theta$ , ${\varphi _{{\text{out}}}} = \frac{{2{l_{{\text{out}}}}}}{{l + 2{l_{{\text{out}}}}}}\left( {\theta - {\varphi _{{\text{shif}}t}}} \right)$ .

The 'output' and 'input' inductances, L and L_out, are normalized to a characteristic value determined by the critical current of the Josephson junctions, Φ₀/2πI_C . Illustrations of the time-resolved processes occurring in the learning cell are shown in figure 3. One can see that the junctions in the cell do not switch to the Josephson generation mode.

**Figure 3.** Illustrations of the processes occurring in the learning cell: phases, currents and voltages at Josephson junctions versus time under the action of a magnetic field pulse. Normalized inductances are as follows: l = 0.25, l_out = 0.16; ϕ_shift = 0.1π.
Download figure:
Standard image High-resolution image

We should note that a Sigma-cell with a sigmoidal activation function can be obtained from the LC by replacing one of the JJs with a conventional 'linear' inductance, 1 + l.

Modern superconducting technology allows one to create structures schematically depicted in figure 2(a) having sufficiently small planar dimensions. The area of the Josephson junction itself is about 0.2 μm². The remaining parts of the cell (resistive shunts of the Josephson capacitance—JJ shunts, inductors, transformers) occupy more than 90% of the area on a chip [34]. Thus, the entire learning cell can be inscribed in a square of 10 × 10 microns.

To analyze the ability of the LC to approximate the derivative of the sigmoid activation function, we have calculated a family of transfer characteristics, ϕ_out (ϕ_in), see figure 4(a) and (b). The expected bell-shaped dependence was obtained for a fairly wide range of parameters. Figure 4(a) shows especially clear that the transfer functions of both the Sigma-cell [16–19] and the proposed LC are well approximated by the sigmoid and its derivative, respectively. An additional constant fluх in the input signal line allows one to set the desired position of the center of the activation function and its derivative.

Figure 4(b) demonstrates the flexibility in selection of the derivative of the activation function with different slope coefficients. A quantitative measure—the standard deviation (SD)—of the obtained dependence from the derivative of the sigmoid function is presented in figure 4(b) as a function of the two most important parameters of the LC: (l, l_out). Standard deviation has been calculated in the form:

$\begin{equation*}SD = \left. \sum_{n = 1}^{N} \left[f^{\,^{\prime}}_{x} \left(\varphi_{in}^{(n)}\right) - \varphi_{out} \left( \varphi_{in}^{(n)} \right) \right]^{2} \right/ N. \end{equation*}$

It is seen that the best approximation with minimal value of standard deviation corresponds to the values of LC parameters: l_out = 0.16 and l = 0.25. It should be noted that all characteristics before calculation of standard deviation were shifted to zero level of ϕ_out axis for better approximation with ideal derivative of activation function mentioned above.

4. Learning processes

It is important to check how the obtained characteristics will affect the operation and learning of the neural network since there is a rather close correlation between the LC's transfer function and the mathematical derivative of the sigmoid function. Analysis of the applicability of the obtained characteristics will be observed in this part of the article.

In this study we conducted a simulation of multi-layer SNN learning process using the proposed activation function derivatives to determine the stability and performance of a large neural network structure, which utilizes our LCs in each neuron except the input layer neurons.

Unlike the mathematical forms of activation function and its derivative which are defined on the whole number line, the transfer characteristics of the superconducting neural cells are defined on a limited range due to periodic behavior of Josephson circuits. It definitely leads to the usage of input data normalization and weight regularization [32] during the learning procedure.

In our SNN model, we have used supervised learning with backpropagation method, modified to achieve higher learning rates by Riedmiller and Braun [32] (resilient propagation method, RPROP). The values of correction coefficients were chosen in accordance with papers' [32, 33] recommendations: η₋= 0.5 (decreasing coefficient) and η₊ = 1.2 (increasing coefficient). The network has learned to solve the classical task—Fisher's Iris data set recognition [35]. The network had three layers with four neurons in the input layer, nine neurons in the hidden layer, and three output neurons. In accordance with the model the learning cells were attached to three output and nine hidden neurons. Below are the simulation results for the learning processes with different parameters of superconducting cells. Each learning simulation took 5000 epochs. In figure 5 we demonstrate the learning curves with different parameters of superconducting cells on the 1100 epochs compared with learning processing in the 'software-based' network (mathematical curves). Each of these curves has been averaged by 30 tests conducted with different random initial weights of each synapse. As one can see, the suggested SNN has shown fairly good results, reducing the recognition error to almost 0, and, more importantly, demonstrated a stable behavior.

**Figure 5.** Loss functions for learning process in SNN with the following parameters of the LC: *l = 0.25* and l_out *= 0.16* for different values of ϕ_shift (a) and ϕ_shift *= 0.1π* for different *l, l*_out (b). Insets present the part of the learning curves.
Download figure:
Standard image High-resolution image

Thus, a change in the magnitude of the auxiliary flow by 50% compared with the optimal value does not exclude the effective recognition of sample patterns. Also on the basis of simulation results we have found the area of the LC's physical parameters, for which the functioning of the SNN is possible (green circle in figure 4(c) around red-point line).

The analysis of the obtained data allows us to conclude that the form of activation function derivative has a significant influence on the neural network's learning ability and on the dynamics of its learning. The closer the superconducting learning function to the exact mathematical derivative is, the greater SNN's learning performance will be achieved.

5. Dynamics of the learning process

In order to estimate the energy efficiency and performance of the considered neural network, we have to analyse the dynamics of the learning cell. We paid special attention to the case when the cell parameters allow efficient learning of the SNN. Since the main idea is to create a full-fledged separate neural network unit with on-chip learning, we should know the behavior of 'dynamic' learning function. Figure 6(a) shows such dynamic transfer functions for different parameters and rise/fall rates (t_RF) for the applied signal. It is seen that as the duration of one epoch of the learning process decreases, the shape of this curve may be distorted. The calculation results in figure 6(b) show that for the parameters used in the simulation of learning, these dynamic distortions can be neglected.

**Figure 6.** (a) Dynamic transfer characteristics of the learning cell at l_out *= 0.46, l = 0.1* (a) and l_out *= 0.16, l = 0.25* (b) for different rise/fall times of applied signal.
Download figure:
Standard image High-resolution image

For the learning cell shown in figure 2(a), one can propose the following estimates for the values of the critical current and characteristic voltage: I_C ≈ 0.1 mA, V_C ≈ 0.7 mV. This gives the following values for the characteristic time of the Josephson processes, t_C, and the LC dynamics in the framework of one epoch of learning: t_C ≈ 0.5 ps; t_RF ≈ 50...5000 ps. Consequently, single learning process in the LC takes about 0.2...5 ns. The number of epochs required depends on the recognition problem being solved. For the classic Fisher's Iris data set [30] the number of learning epochs required to obtain 99% of correctly recognized patterns is about 300.

We also calculated the rate of changes of the Josephson junctions' phases in the frame of the RSJ model. This provided information on the energy dissipation in the LC per one operation (DpO). The calculation results as a function of cell parameters and the characteristic duration of the operation are presented in figures 7(a) and (b). It is seen that, for the LC parameters used in the simulation, adiabatic operating modes are possible when the energy dissipation per operation exponentially decreases with an increase in its duration. Also in this energy-efficient adiabatic mode, the value does not depend significantly on the cell inductances. It can be seen from the above estimates that, in the adiabatic regime, the energy dissipation in the cell during one epoch lies in the range from 4 zJ to 60 zJ.

**Figure 7.** Energy dissipation versus rise/fall time at l_out *= 0.46* and *l = 0.1* (a) and energy dissipation versus l_out (b). Inset presents the input signal and dissipation dynamics. We take for the structure in figure 2(a): I_C *≈ 0.1*mA, V_C *≈ 0.7* mV.
Download figure:
Standard image High-resolution image

6. Conclusion

In this paper, we considered the architecture of a circuit providing on-chip learning of an adiabatic superconducting ANN. Since learning is a high power- and time-consuming procedure, it is important to carefully optimize the physical implementation of the learning steps.

One of the most complicated operations in a cycle of synapse weights [36–39] update is the computation of a neuron activation function derivative. Here we proposed an adiabatic superconducting cell allowing one-shot calculation of the activation function derivative. The cell contains just two Josephson junctions and its planar dimensions are limited from above by a square of 10 × 10 microns. The proposed learning cell operates in the framework of one epoch during 1 ns with energy dissipation of about 10 zJ. The presented concept of adiabatic on-chip learning promises to be the most compact and energy-efficient solution for ANNs of considered type. The development of this concept will allow us to implement fast and energy-efficient superconducting neuromorphic processors with intrinsic learning ability.

Acknowledgments

This work was supported in part by Grant Nos. 18-72-10118 of the Russian Science Foundation (the learning cell topology), by the Council of the President of the Russian Federation for State Support of Young Scientists and Leading Scientific Schools (MD-186.2020.8, software technique for modeling the learning process), and by RFBR, project number 19-37-90020 (dynamic process analysis). AS is also appreciative for the stipend of the Foundation for the Development of Theoretical Physics and Mathematics 'BASIS'.

Learning cell for superconducting neural networks

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Learning circuit architecture

3. Learning cell

4. Learning processes

5. Dynamics of the learning process

6. Conclusion

Acknowledgments

Learning cell for superconducting neural networks

Article metrics

Submit

Permissions

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Learning circuit architecture

3. Learning cell

4. Learning processes

5. Dynamics of the learning process

6. Conclusion

Acknowledgments