1 Introduction

Electricity is crucial to build the social economy, where its consumption and generation can affect the overall policy of one country [1]. In the past decade, thanks to the development of smart grid and intelligent meters, the power systems became more intellectualized and robust. Accordingly, the efficiency of energy utilization got rise and tactics of power scheduling was more reasonable. However, due to the increase in electricity consumption and the rise in electricity prices, the electricity charge brought considerable economic burden to the populace and enterprises [2]. Thereby, the problem of electricity stealing [3] became more severe which may lead to a great economic loss for one country. To solve this issue, amounts of research on effective detection methods have been studied and, among them, two types of ideas are popular, i.e., to monitor the physical characteristics of meters, or to analyze features of electricity data.

For the first type thought, the data of the smart meters are tampered illegally and the detection methods are supposed to find the abnormal data. According to types of electrical parameters, the electricity stealing methods can be divided into five categories [4], i.e., under-voltage stealing, under-current stealing, phase shift stealing, spread stealing, no-meter stealing. For the first three kinds of methods, the common approach is to improve the structure of intelligent meters whose threshold value of each electrical parameter is set, e.g., [5, 6]. As for the latter two methods, the common approach is to monitor the changes of electricity consumptions with networks, e.g., [7, 8]. However, because the electricity data are varied and redundant, the mentioned approaches are hard to achieve synchronous transmission and efficient management in energy systems, and meanwhile, their ability of abnormal data detection is limited.

To handle these issues and improve the capability of detection, the second type thought attracts wide attention. The core idea of these methods is that the electricity data are collected in energy systems and its features are analyzed with various algorithms. In the beginning, the energy systems mainly calculate the line loss rate (LLR) [9] to determine whether users steal electricity. However, because of the massive data, this method brings heavy workload to energy systems, which makes detection efficiency become low. To handle this issue, some new techniques, e.g., grey model (GM) [10, 11] and neural network (NN) [12, 13], are adopted to predict LLR automatically, but the detection models are still limited by the downside of LLR. It means that only long-term electricity stealing can be identified, while the short-term electricity stealing cannot be detected. Subsequently, various studies concern more about the features of electricity data rather than LLR. Specifically, paper [14, 15] utilizes support vector machines (SVM) to seek for the hyperplanes which can distinguish ordinary users and electricity stealing users. Paper [16, 17] adopts long and short-term memory (LSTM) network to study the time characteristics of electricity data and achieves anti-electricity theft load. Paper [18, 19] leverages convolutional neural network (CNN) to deeply learn the features of electricity data, and the electricity stealing users can be identified accurately with these features. Although the mentioned methods can recognize electricity stealing users with kinds of features gotten from training process, when the total amount of stolen electricity is small or its change does not have strong regularity on time, the performance of these methods drops dramatically.

From the above description, we know that kinds of detection models have been proposed and the ability to detect electricity stealing has been improved. However, for the small scale of electricity stealing or the stealing behaviors is not strongly related to the time, the existing methods do not work well on the electricity stealing detection. Thus, in this paper, we put forward a timing shift-based bi-residual network (TS-BiResNet) to consider the features of electricity data on three aspects, i.e., shallow features, deep features and time factor. The main contribution of this paper is fourfold as follows:

  • Owing to the fact that electricity stealing data are difficult to acquire, we first analyze the distribution of a real electricity dataset without any electricity stealing behaviors. Then, we generate specific electricity stealing data based on the distribution of a real electricity dataset and the principle of electricity stealing. After preprocessing, the obtained hybrid dataset, composed of real electricity data and specific electricity stealing data, can be used for the training and testing of our detection model.

  • In order to consider the deep features of electricity stealing data, we redesign the residual network (ResNet) detection model for the hourly electricity data within one week. The model leverages shortcut connections to preserve shallow features so that the issue about vanishing gradient can be avoided during deep learning process and this model is fit for the detection of small-scale electricity stealing.

  • To make the model applicable to the electricity stealing users with varying temporal correlation, we propose timing shift (TS) preprocessing and improve ResNet to a two-layer structure, i.e., Bi-ResNet. For the former, the electricity data in different periods can be selected by the same convolution kernel. From which, the time factor is no longer restricted by the time interval. For the latter, the convolution kernels of different sizes are used in the same layer, so that Bi-ResNet model can extract time features with different durations.

  • To further analyze the performance of TS-BiResNet model, we perform numerical evaluations via simulations. The simulation results show that the accuracy of our proposed model is higher than that of existing work. Meanwhile, this model can also detect the electricity stealing users whose scale of stolen electricity is small or time periods of electricity stealing are not regular.

The rest of this paper is organized as follows. In Sect. 2, we first analyze the distribution of real electricity dataset based on hourly electricity data and time interval and then generate specific electricity stealing data. In Sect. 3, the Bi-ResNet model is put forward and its input data are preprocessed by the TS preprocessing in Sect. 4. The performance of our proposed model is verified by simulation results in Sect. 5. Finally, we draw conclusions in Sect. 6.

2 Problem formulation: electricity stealing data

It is known that electricity data are often confidential due to the state secrecy provisions [20], which means that the real data are hard to be obtained by the populace. For this case, we take the open-source dataset provided by National Renewable Energy Laboratory [21] as the original data S, Denoted as D1. This dataset contains the hourly electricity data of 936 users, i.e., S = \(\{{{\varvec{s}}_1}, \dots ,{{\varvec{s}}_{936}}\}\). For each user \(k,\forall k \in {\{1,\dots ,936\}}\), it records the electricity data within 1 year, i.e., \({\varvec{s}}_{k}=\{{{\varvec{s}}_{k,1}}, \dots ,{{\varvec{s}}_{k,365}}\}\), and for any day \(d,\forall d \in {\{1,\dots ,365\}}\), there are 24 hourly electricity data within it, i.e., \({{\varvec{s}}_{k,d}}=\{{s_{d,1},\dots ,s_{d,24}}\}\). However, in this dataset, no electricity stealing users are available there. To make the data suitable for the detection models, we first analyze the real electricity dataset and then provide a universal method to generate electricity stealing data. Thus, we randomly select the hourly electricity data of 400 users in half a year, i.e., S\(_{\mathrm{selected}} \in {\mathbb {C}}^{{168 \times 26 \times 400 }}\). From which, 168 is the quantity of hourly electricity data within 1 week, 26 denotes the amounts of weeks within half a year, and 400 presents the number of all selected users. For convenience, we abbreviate S\(_{\mathrm{selected}}\) as S. The statistical results of S are shown as Figs. 1 and 2.

Figure 1 presents the statistics of hourly electricity data for all users within half a year. It indicates that, although the range of hourly electricity data is wide, i.e., from 0.4 to 6, its values are mainly concentrated between 0.4 and 2.3, especially around 1. It means that when the amount of stolen electricity account for a tiny proportion of the electricity consumption during period of time, e.g., 10% or 20%, the value of electricity detected by energy systems is still within the normal range and it does not slump rapidly. For this case, the stealing users are hard to be detected so that deep features are desired.

Fig. 1
figure 1

The statistics of hourly electricity data

Besides, we further analyze the distribution of hourly electricity data on the basis of time period. Figure 2 shows that the electricity behaviors of all users are correlated with time. Specifically, in the wee hours, i.e., from 10:00 p.m. to 6:00 a.m., the distribution of electricity consumption is concentrated and its value is relatively small. For this case, the range of electricity stealing is limited and behaviors are easy to be detected, where the traditional detection methods and neural network-based methods can detect the electricity stealing users successfully. In the morning and at noon, i.e., from 6:00 a.m. to 4:00 p.m., the amount of electricity consumption is rising and the variance becomes large. For this case, electricity stealing is relatively easy to be detected but some traditional detection methods will fail. As for the last case, i.e., from 4:00 p.m. to 10:00 p.m., the electricity consumption of all users reaches its peak, and its distribution is the most scattered. At this moment, the electricity stealing is the most difficult to be found. Herein, we focus on the latter two cases, i.e., electricity stealing behavior is relatively difficult to be detected.

Fig. 2
figure 2

The distribution of hourly electricity data

Based on above analysis, we randomly preprocess 150 users out of selected 400 users as the electricity stealing users. From which, the original data are divided into two parts, i.e., \({\mathbf{S }} = \{ {{\mathbf{S} }_{\mathrm{real}}},{{\mathbf{S }}_{\mathrm{stl}}}\}\) , where \({{\mathbf{S }}_{\mathrm{real}}} \in {\mathbb {C}}^{{168 \times 26 \times 250}}\) is the real electricity data and \({{\mathbf{S }}_{\mathrm{stl}}} \in {\mathbb {C}}^{{168 \times 26 \times 150}}\) is the specific electricity stealing data. According to the [4], the electricity stealing methods can be equivalent to the product of electricity consumption and a stealing weighting factor. Thus, we define the weighting factor as electricity stealing coefficient (ESC) and stipulate that the value of ESC ranges from 0.2 to 1, i.e., \({\mathrm{ESC}}\in (0.2,1)\). Meanwhile, we further define the duration of electricity stealing as electricity stealing period (ESP). Considering the possibility of stealing electricity shown in Fig. 2, the ESP is expressed as

$$\begin{aligned} \varvec{ESP}=r(\{6,22\}), \end{aligned}$$
(1)

where \(r(\cdot )\) is the random function which returns an array. Herein, the electricity stealing function is designed as

$$\begin{aligned} f({s_{k,i}}) = \left\{ {\begin{array}{l} {{\mathrm{ESC }}\times {s_{k,i}},(182 - \left\lfloor {i/7} \right\rfloor \times 7) \in ESP}\\ {{s_{k,i}},(182 - \left\lfloor {i/7} \right\rfloor \times 7) \notin ESP} \end{array}} \right. , \end{aligned}$$
(2)

where \(\left\lfloor * \right\rfloor\) is the round down operation. Herein, we can get the hybrid dataset \({\bar{\mathbf{S }}} = \{ {{\mathbf{S }}_{\mathrm{real}}},{{\bar{\mathbf{S }}}_{\mathrm{stl}}}\}\) which is composed of real electricity data \({{\mathbf{S }}_{\mathrm{real}}}\) and specific electricity stealing data \({\bar{\mathbf{S }}}_{\mathrm{stl}}\), as shown in Table 1.

Table 1 Hybrid dataset composed of real electricity data and specific electricity stealing data

Notice that, electricity stealing users with different ESC come from the same users in original dataset. For each experiment, only ordinary users and electricity stealing users with one kind of ESC are selected. This operation can reflect how the ESC and ESP affect the accuracy of electricity stealing detection.

3 Method: Bi-ResNet model

After getting the hybrid dataset composed of real electricity data and specific electricity stealing data, various detection models can be adopted [9,10,11,12,13,14,15,16,17,18,19]. However, due to two features of \({\bar{\mathbf{S }}}\), i.e., small scale of electricity stealing and random coherence with time, the performance of these models is not optimal and deep features are needed. In the process of seeking for deep features, the value of electricity consumption data is small and its change trend is usually gentle. For this case, if we take neural networks with multiple layers, the vanishing gradient [22] will affect the performance of detection models. Aiming at this issue, ResNet model [23] attracts wide attention. It adds some shortcut connections to preserve shallow features so that the problem of vanishing gradient can be solved. However, the performance of ResNet is still not optimal, because every time when the convolution kernels extract features of electricity data, the obtained information is limited. Therefore, in this paper, we introduce feature-map attention [24] and then propose Bi-ResNet.

3.1 Unit block of Bi-ResNet

The unit block of Bi-ResNet consists of two kinds of basic components, i.e., convolution modules and shortcut connections, and two kinds of logical operations, i.e., copy and vector addition. Its structure is shown in Fig. 3. For convenient analysis, we further divide the unit block into three steps.

Fig. 3
figure 3

The unit block of Bi-ResNet

For the step 1, the obtained \({\bar{\mathbf{S }}}\) is first mapped from three dimension to two dimensions for the adaption of the input. For the user k on the d-th day, the input data \({\bar{\mathbf{S }}}_{k,d}\) are changed from the \({{\varvec{{s}}}_{k,d}}\) as follows

$$\begin{aligned} {\bar{\mathbf{S }}}_{k,d} = \left[ {\begin{array}{*{20}{c}} {{s_{d,1}}}&{}\quad {{s_{d,2}}}&{}\quad \cdots &{}\quad {{s_{d,167}}}&{}\quad {{s_{d,168}}}\\ {{s_{d,2}}}&{}\quad {{s_{d,3}}}&{}\quad \cdots &{}\quad {{s_{d,168}}}&{}\quad {{s_{d,1}}}\\ \vdots &{}\quad \vdots &{}\quad \ddots &{}\quad \vdots &{}\quad \vdots \\ {{s_{d,167}}}&{}\quad {{s_{d,168}}}&{}\quad \cdots &{}\quad {{s_{d,165}}}&{}\quad {{s_{d,166}}}\\ {{s_{d,168}}}&{}\quad {{s_{d,1}}}&{}\quad \cdots &{}\quad {{s_{d,166}}}&{}\quad {{s_{d,167}}} \end{array}} \right] \end{aligned}$$
(3)

When \({\bar{\mathbf{S }}}_{k,d}\) is put into the network, the copy function \({\mathrm{cp}}(\cdot )\) is taken, which duplicates the input data into two same copies

$$\begin{aligned} \{ {\mathbf{X }}_1,{\mathbf{X }}_2\} = {\mathrm{cp}}({\bar{\mathbf{S }}_{k,d}}). \end{aligned}$$
(4)

Then, two convolution modules with different size of convolution kernels are taken. Specifically, each convolution module function \(i, \forall i \in \{1,2\}\) consists of three parts, i.e., convolution function \({\mathrm{conv}}( \cdot )\), batch normalization \({\mathrm{BN}}( \cdot )\) and rectified linear unit function \({\mathrm{relu}}( \cdot )\). We use \({H}_{i,{\mathrm{size}}}(\cdot )\) function to represent the i-th convolution model with size \(\times\) size convolution kernel, thereby the i-th pseudo-output \({\mathbf{Y}}_{i}\) is calculated by

$$\begin{aligned} {\mathbf{Y }}_{i} = {H_{i,{\mathrm{size}}}}({\mathbf{X }}_i). \end{aligned}$$
(5)

Notice that, when two convolution kernels with different sizes are used for the same input data, the features can be extracted more thoroughly, that is essential for deep features.

For the step 2, we fuse the same feature under different scopes, i.e., convolution modules with \(3 \times 3\) convolution kernel and \(5\times 5\) convolution kernel and then use one convolution module with \(1\times 1\) convolution kernel to extract the deep feature based on the fused feature. Specifically, we first duplicate the pseudo-output \({{\mathbf{Y }}_{i}}, i \in \{1,2\}\) as

$$\begin{aligned} \{ {\mathbf{X }}_{i\times 2+1},{\mathbf{X }}_{(i+1)\times 2}\} = {\mathrm{cp}}({{\mathbf{Y }}_{i}}). \end{aligned}$$
(6)

After duplicating, \({\mathbf{X }}_{i\times 2+1}\) are connected by a shortcut connection that can maintain the shallow feature. As for \({\mathbf{X }}_{(i+1)\times 2}\), they are fused by the \({\mathrm{va}}(\cdot )\) function as

$$\begin{aligned} {\mathrm{va}}({\mathbf{X}} _{{4}},{\mathbf{X}} _6) = {1 /2} \times {\mathbf{X}} _{{4}} + {1 /2} \times {\mathbf{X}} _{{6}}, \end{aligned}$$
(7)

and the deep feature is extracted by the 3-rd convolution model with \(1 \times 1\) convolution kernel. Accordingly, the 3-rd pseudo-output \({{\mathbf{Y }}_{3}}\) is calculated by

$$\begin{aligned} {{\mathbf{Y }}_{3}} = {H_{3,1}}({\mathrm{va}}({\mathbf{X}} _{{4}},{\mathbf{X}} _6)). \end{aligned}$$
(8)

To make the pseudo-output contain the deep feature and the shallow feature, we duplicate the \({{\mathbf{Y }}_{3}}\) as

$$\begin{aligned} \{ {\mathbf{X }}_{7},{\mathbf{X }}_{8}\} = {\mathrm{cp}}({{\mathbf{Y }}_{3}}), \end{aligned}$$
(9)

and fuse them with the \({\mathbf{X} }_{i\times 2+1},i \in \{1,2\}\) which come from shortcut connections, calculated by

$$\begin{aligned} {\mathrm{va}}({\mathbf{X}} _{i\times 2+1},{\mathbf{X}} _{i\times 3+1}) = {1 /2} \times {\mathbf{X}} _{i\times 2+1} + {1 /2} \times {\mathbf{X}} _{i\times 3+1}. \end{aligned}$$
(10)

Herein, we get two set of excellent pseudo-output with hybrid features. One is the feature that maintains the shallow feature gotten by \(3 \times 3\) convolution kernel and meanwhile contains more attentive and deep feature gotten by \(5\times 5\) convolution kernel. On the contrary, the other one maintains the shallow feature gotten by \(5\times 5\) convolution kernel and contains more attentive and deep feature gotten by \(3\times 3\) convolution kernel.

For the step 3, the features of pseudo-output \({\mathrm{va}}({{\mathbf{X }}_{{{i}} \times {{2}} + {{1}}}},\) \({{\mathbf{X }}_{{{i}} \times {{3}} + {{1}}}}),i \in \{ 1,2\}\) are extracted by two convolution modules with \(5\times 5\) convolution kernel and \(3\times 3\) convolution kernel. The pseudo-output \({{\mathbf{Y }}_{i+3}}, i \in \{1,2\}\) is calculated by

$$\begin{aligned} {\mathbf{Y }}_{i+3} = {H_{i+3,{\mathrm{size}}}}({\mathrm{va}}({{\mathbf{X}} _{i\times 2+1}},{\mathbf{X}} _{i\times 3+1})). \end{aligned}$$
(11)

Then, the \({{\mathbf{Y} }_{i+3}}, i \in \{1,2\}\) are merged again with \({\mathrm{va}}(\cdot )\) function and 6-th convolution model with \(1\times 1\) convolution kernel, expressed as

$$\begin{aligned} {\mathbf{Y }}_{6} = {H_{6,1}}({\mathrm{va}}({{\mathbf{Y}} _4},{\mathbf{Y}} _5)). \end{aligned}$$
(12)

Notice that, \({\mathbf{Y }}_{6}\) is the final output of the Bi-ResNet’s unit block and, compared with the input data \({{\bar{\mathbf{S }}}_{k,d}}\), its dimension is reduced by 6, i.e., from \(168 \times 168\) to \(162 \times 162\).

3.2 Framework of Bi-ResNet

After a series of mixed and cross-learning, the final hybrid feature contains various deep features and shallow features with different levels of attention, thereby the vanishing gradient disappears and deep features are obtained. Herein, the framework of Bi-ResNet is developed, as shown in Fig. 4.

Fig. 4
figure 4

The framework of Bi-ResNet

Considering that the input data of detection model are the hourly electricity data within one week, we design Bi-ResNet which contains the \(3\times 3\) convolution kernel, 5 Bi-ResNet blocks, 4 maxpooling layers with stride 2, 1 maxpooling layer with stride 1, and some traditional components of CNN [25]. Specifically, due to the fact that shallow features are important but relatively easy to extract, we first pick a convolution kernel with smaller size, i.e., \(3\times 3\), and Bi-ResNet block is unnecessary for this step. Then, we use Bi-ResNet blocks and maxpooling layers to extract deep features. Notice that, when the input data are processed by the Bi-ResNet blocks, maxpooling layers with stride 2 and maxpooling layers with stride 1, the size of output data is reduced by 4, halved, and reduced by 1, respectively. Therefore, a hourly electricity data with the size of \(168\times 168\) are put into the Bi-ResNet and the output data with the size of \(6\times 6\) can be obtained after maxpooling 5. Subsequently, the type of hourly electricity data will be determined by the fully connected layer, and classification layer distinguishes whether the user steals electricity.

4 Method improvement: timing Shift Preprocessing

For the hourly electricity data, our proposed Bi-ReNet model can take both the deep features of electricity data and time factor into account, so the prediction results related to the electricity stealing users are accurate. However, the learning of time factor is not sufficient. Taking a \(10\times 10\) input data and \(3\times 3\) convolution kernel for instance, the convolution process is shown as Fig. 5.

Fig. 5
figure 5

The process of convolution

Due to the fact that the size of convolution kernel is fixed, the width of each convolution is the same and the vertical interval will not change. This means that neural networks can only learn the features of time factor with respect to the same time period, e.g., 1, 2, 3 corresponding to 3 hours and time interval, e.g., 1, 2, 3 1, 10, 19 corresponding to per 9 h. To handle this issue, we propose timing shift algorithm and its specific process is shown in Table 2.

Table 2 Timing shift algorithm

Notice that, the value of stride st is randomly selected from 0 to 23 due to the features of electricity stealing behaviors, i.e., the time period for electricity stealing will change based on date. For this case, we should extract the features of different dates and time periods, so that the combination of different time periods is considered.

5 Results and discussion

In this section, we present numerical results on three aspects, i.e., the training process of TS-BiResNet model, the accuracy of detection, and the further discussion about generalization ability. The first part verifies that TS-BiResNet model is fit for the detection of electricity stealing and its training process is completed successfully. For the second part, it evaluates the performance of the proposed model by comparing it with four benchmark models, i.e., LSTM model [16], Bi-ResNet model [26], gated recurrent unit (GRU) model and combined convolutional neural network and LSTM (CNN-LSTM) model [27, 28]. The simulation results are presented as Figs. 6 and 7, respectively. Then, the issue about generalization ability is discussed and analyzed with two additional datasets, whose simulation results are presented in Table 3.

Fig. 6
figure 6

The training process of TS-BiResNet model with different value of ESC. a ESC = 0.2, b ESC = 0.3, c ESC = 0.4, d ESC = 0.5, e ESC = 0.6, f ESC = 0.7, g ESC = 0.8, h ESC = 0.9

Fig. 7
figure 7

The prediction accuracy of five models

Table 3 Comparison of prediction accuracy for three datasets

5.1 Training process

Figure 6 is the training process of the TS-BiResNet model with different values of ESC. We take the situation that ESC = 0.8 as an example to analyze the training process of the model. Since the curves, i.e., the accuracy and the loss function, share the same variation trend, we only analyze the curve of accuracy. The figure reflects that shallow features and deep features of the hourly electricity data coexist and both of them can be learned by TS-BiResNet model. Specifically, the accuracy of detection soars into 65%, while the accuracy fluctuates around this value for several epochs. This is because we take \(3\times 3\) convolution kernel to find the shallow features but this kind of preliminary detection is ineffective due to the fact that the value of ESC is large, so that deep features are desired. After long training, the accuracy of detection increases sharply at the time integration = 450, reaching 80%. Then, its value rises gradually and reaches the peak around 98%. It indicates that deep features have great impact on the accuracy of detection and these features can be found by the TS-BiResNet model. Overall, the two curves are convergent, which means that the proposed model is suitable for detecting electricity stealing users in the dataset D1, and there is no problem about over-fitting and under-fitting during prediction process.

5.2 Prediction accuracy

Figure 7 is the performance comparison between the TS-BiResNet model and four benchmark models in terms of prediction accuracy. According to the mentioned analysis, we set the value of ESC ranges from 0.2 to 0.9. It verifies the TS-BiResNet model has capability to detect electricity users for extreme situations, i.e., the total amount of stolen electricity is small or its change does not have strong regularity on time. In general, the accuracy of all models is declined with the increase in ESC, this comes from the fact that the value of electricity stealing data becomes closer to the value of ordinary data, so that the features of electricity stealing users are difficult to be found. Specifically, when the value of ESC ranges from 0.2 to 0.5, the features of electricity stealing users are relatively obvious. Thereby, the accuracy of TS-BiResNet model, BiResNet model and CNN-LSTM model is very high, more than 98.0%, which exceeds the accuracy of other two models more than 5%. The reason for this phenomenon is that these three models learn features of electricity stealing while the other does not. Notice that, when the value of ESC reaches 0.5, the accuracy of BiResNet model and CNN-LSTM model decrease dramatically. This is because BiResNet model and CNN-LSTM need to extract the shallow features of electricity stealing for prediction, while the shallow features are difficult to be found and deep features are desired. When ESC ranges from 0.6 to 0.7, although the performance of BiResNet model deteriorates due to the inconspicuous shallow features, it is still superior to the performance of other four models. Besides, when ESC ranges from 0.7 to 0.8, the accuracy of TS-BiResNet model and BiResNet model experiences a great decline. This is because the shallow features disappear due to the distribution of electricity consumption data, so that only deep features and time factors can be used for the detection. It is worth noting that the performance of GRU model is greater than that of Bi-ResNet model when ESC = 0.8. The reason for this phenomenon is that GRU model can utilize the temporal correlation of data rather than make use of shallow features, so that its performance is close to the GRU model and LSTM model. At this time, the performance of CNN-LSTM model, GRU model and LSTM model is greater than that of Bi-ResNET but worse than BiResNet model. Last but not least, the TS-BiResNet model is able to detect electricity stealing users even when the value of ESC is 0.8 and 0.9, reaching 90.8% and 81.9% accuracy, respectively. It shows that the TS-ResNet can learn deep features and time factors for users with small scale of electricity stealing and varying temporal correlation, and its performance is better than that of four benchmark models. (When ESC = 0.9, the performance of GRU model deteriorates because of its model structure)

5.3 Further analysis

Based on the above analysis, we get the conclusion that our proposed TS-Bi-ResNet model can achieve detection for the users with small scale of electricity stealing and varying temporal correlation. However, it is worth noting that the convergence of model is a necessary condition for overfitting problem, and the generalization ability should be further proved. Thus, we further introduce two different datasets [29] and [30], denoted as D2 and D3, respectively.

In specific, D2 is a dataset consisting of the hourly electricity data of 32 users within half a year. Compared with the dataset D1, we can verify that whether our proposed model is suitable for small dataset. This is because D1 and D2 have some similar characters, i.e., they are, respectively, taken from the residential users within a limited region, which means that the difference among the maximum value of each user is small. Besides, for each user, the habit with respect to the electricity consumption is the same as the others, which means that the distribution of electricity consumption is similar. However, the scale of D1 is 10 times larger than D2, so that D2 is a small dataset.

As for D3, it is a dataset containing the hourly electricity data of 913 companies within half a year. Compared with the dataset D1, we can analyze that whether our proposed model is fit for the dataset with various individual differences. This is because, although the data of companies are taken from the same region, the scale and habit of electricity consumption are varied, e.g., for company A, the scale of electricity consumption may be 10 times larger than company B, and meanwhile, company A’s peak electricity consumption is in the morning, while Company B’s peak electricity consumption is in the afternoon. In general, compared with D1, D3 is a dataset with large individual differences.

For these three types of datasets, we carried out simulations, respectively, and the simulation results are shown in Table 3. It indicates that, for the small dataset D2, the performance of all models decreases sharply. It means that recurrent neural network (RNN)-based prediction model and deep neural networks (DNN)-based prediction model are not fit for small sample classification or prediction, especially for the former. This is because LSTM and GRU require temporal depth and regularity, that is to say the scale of the data should be large and the data should be temporally correlated. In comparison, although the prediction effect of the TS-BiResNet and Bi-ResNet model is not good enough, when the value of ESC is low, e.g., 0.6 or 0.7, they can still achieve a certain accuracy of detection, around \(80\%\). As for the D3 dataset with large individual differences, the performance of different models varies greatly. In specific, when the value of ESC is low, e.g., 0.6 or 0.7, the performance of the models is similar to the case in D1, except for the CNN-LSTM model. The reason why the performance of CNN-LSTM model drops dramatically is that it tries to get the correlation of the same moment in different weeks, rather than finding the relationship between the current moment and the next moment. However, it seems that there is no regulation can be found so that CNN-LSTM model does not perform well. Besides, when the value of ESC is 0.8, the performance of TS-BiResNet and Bi-ResNet model maintains good performance, reaching \(90.65\%\) and \(81.27\%\), respectively. It proves that our proposed model has capability to detect electricity stealing users even if they have significant individual differences, but other models cannot. However, this advantage is not kept when ESC is 0.9, in other words, the detection of all models fails. This is because the individual difference is so huge that the models cannot determine whether it is individual difference or electricity stealing behaviors. In general, for large-scale datasets, our proposed model can effectively detect electricity stealing users, proving its generalization, while the performance of the detection regarding to small-scale datasets should be improved.

6 Conclusions

For electricity stealing detection, we first analyze the distribution of the real electricity dataset and behaviors of electricity stealing. Based on this, we generate be the hybrid dataset composed of real electricity data and specific electricity stealing data. To detect electricity stealing users, we design Bi-ResNet model for learning deep features of electricity data. Then, we propose the TS-Bi-ResNet model which takes time factor into consideration. Simulation results show that our proposed model can achieve detection for the users with small scale of electricity stealing and varying temporal correlation. Meanwhile, its detection accuracy is roundly superior to the four benchmark models. Its generalization ability is further discussed and simulated with two additional datasets.