Introduction

The proliferation of using power systems loads in buildings has resulted in high energy demand within the buildings. With more and more users and more and more loads there is a need to manage the energy within the buildings. The main focus point of mains disaggregation and load recognition is to achieve an automatic energy management in mainly residential and commercial buildings as these are the high consumers of electrical energy. However, there are various other electrical load equipment usage expectations by different users. The electrical mains supply signal on which the energy management system can be developed is a fusion into a complex form of the various electrical equipment load signals within a building. Through the non-intrusive-load-monitoring (NILM) [1,2,3,4] method we are able to extract each load signal from this composite thereby establishing the equipment ‘s’ exact operational status. The contemporary NILM mains power series signal disaggregation and load recognition approach focuses on deep learning (DL) algorithms that are modeled on speech recognition and natural language processing recognition systems. Some examples of NILM power series based recognition systems include: (1) the “sequence-to-point learning” where the output is made up of one point of the target appliance and input is made up of a window of the aggregate signal as raw data, (2) one-dimensional convolutional differential input systems, and 3) stacked denoising autoencoders (sdAEs) with the ability to reconstruct a good signal from a composite of noise and signal [4,5,6,7,8,9]. The NILM method has traditionally been based on the power series format of the equipment signal [6, 7, 10] in labeled or unlabeled form, often with a detailed incorporation of event detection mechanism [11, 12]. The appliance features that are used in NILM systems broadly fall in the following categories of steady state (power change, time and frequency domain voltage-current (V–I), V–I trajectory), transient state (transient power, start-up current waveforms, voltage noise), combined steady and transient states features, and features obtained or inferred from the behavior of the appliance [13].

In NILM recognition systems, power series spanning over long time periods are often required to avail sufficient features for model training, since power series methods suffer poor signal feature localization, and normally require involved signal formatting and pre-processing. A shapelets learning method that can benefit the NILM power series based recognition scheme is proposed by [14] to improve on the recognition of general time-series with very limited data samples. These shapelets represent tendencies in the signal thereby placing the signal in a certain class. However, the shapelets method is still power series signal based. In this paper for improved recognition based on the powerful computer vision model(s) we change the power series feature space to image space. The image equivalent to the power series contains a rich set of localized signal features. We transform the power series into image through the use of the Gramian angular summation fields (GASF). However, it is also possible to encode the power series to image using Gramian angular difference fields (GADF) and Markov transition fields (MTF). The main advantage of Gramian angular fields (GAF) over other time series visualization methods is that we can readily reconstruct the power series from the image parameters [15]. Some researchers [16, 17, 22] have proposed image-based NILM recognition systems with varying degrees of success. However, the image-based approach was mainly implemented in the classification stages rather than the entirety of the NILM recognition to include the disaggregation.

Having reviewed the related literature we propose the development and improvement of the image-based NILM recognition system. In this paper, we introduce an improved feature extraction image based approach that performs both disaggregation and classification of power systems load signals via less complex deep learning model configurations having reduced computation times. The developed system is completely evaluated in the laboratory setup. We then propose the installation of the designed NILM recognition system at the mains powerpoint into the building housing the appliances as a practical implementation of the system. The appliance classification is achieved through the Oxford Visual Geometry Group (VGG) convolutional neural network (VGG–CNN) [16] due to its very high image classification count. The load signal image disaggregation is achieved through the powerful stacked denoising autoencoder noise extraction network applied to images. In this study, we generated our own dataset from three mains lamps, a refrigerator and a microwave oven. In future, we can extend the image-based NILM recognition strategy to the recognition of multi-state appliances.

In literature, the traditional two-dimensional (2-D) convolutional neural network (CNN or ConvNet) is a common feature in most if not all NILM image-based recognition systems. To improve the performance of NILM image-based designs it is necessary to modify the basic CNN structure [17,18,19,20,21,22]. Based on a twenty-one appliance dataset, the authors of [17], proposed a 2-D CNN composed of a residual model and a Batch Normalization layer to correct the gradient disappearance issue during training. For the transformation of the time series signals to 2-D GADF images in their NILM recognition system, in [17], the authors recommended GADF over GASF or MTF images as GADF capture and represent more signal event/timing information than the other two. However, there is a need to improve the NILM recognition system performance of [17] which hovered at 97.2%.

The authors in [18] proposed a GASF image-based CNN NILM disaggregation strategy for the standard Dataport dataset. The model in [18] was able to achieve reasonable disaggregation with image pixel sizes of \(30 \times 30\) for microwave and \(100 \times 100\) for air conditioner. However, the disaggregation performance relative error in total energy varied from 14% for microwave to 32% for the air conditioner. Clearly, there is a need to improve the NILM disaggregation performance in [18]. The event-driven NILM recognition method proposed in [19] captures event-based information that includes establishing the signal’s zero-crossing point, the similarity between current signals, threshold measure and point at which event starts and stops. All these event current characteristics are converted to gray-scale images as an input of a VGG-16 CNN model. The method in [19] achieved high NILM image-based recognition performance for a considerably reduced signal dataset when the number of appliances is few. However, there is a need to further improve the method in [19], as the NILM recognition accuracy degrades with an increase in the number of appliances. Furthermore, image-based event algorithms improve the complexity of the NILM design.

Voltage–current (VI) trajectories constitute a form of 2-D NILM signature recognition scheme [20]. The VI trajectories provide a characteristic image for each appliance. The image is then recognized through the Hierarchical clustering classifier [20]. The authors of [4, 5] developed a much more robust NILM vision-based VI trajectories recognition method based on the convolutional and Siamese neural networks. The Siamese neural network is composed of two similar CNN networks in parallel feeding one output label. The inputs of these networks are single identical images. Siamese neural networks can also be successfully implemented in one-shot learning [5]. The aim is to find the similarity between two inputs for example that of the ground truth signal and the disaggregated signal. The constructive loss function gives a quantitative measure of the relationship between the Siamese network inputs. A clustering algorithm known as the density-based spatial clustering of applications with noise (DBSCAN) [5] is then used to place the inputs into their classes. However, in [4, 5] the F1-measure which requires improvement was low for similar signature appliances. The authors in [21] proposed an image-based NILM recognition approach premised on the vector projection classification (VPC) technique that was formally applied to human faces recognition. In this case, appliance data images are projected onto some 2-D vector surface and their similarity noted, the closer the images to each other the more probable that there are in that class and hence recognized.

In [22], the authors proposed to represent event-based NILM VI appliance features in image form fusing the weighted recurrence graphs (WRGs). Traditional VI models are capable of only representing the phase relationships with the exclusion of the signal magnitude of the appliances. According to [22], the traditional VI trajectories approach is incapable of extracting adequate VI trajectories information on a purely resistive load. However, the WRGs approach is capable of combining and representing both the signal magnitude and VI trajectories into a single image which is then processed through an image-based CNN. By so doing, extracting adequate VI trajectories information from purely resistive loads can be addressed. Although the method in [22] is capable of very high NILM recognition performance there are some appliances that it wrongly identifies.

We have shown the diversity of NILM image recognition methods that often achieve high performance. The continued development of NILM image recognition systems has been made possible by the technological advancements in computing that has allowed for the development of deep machine learning algorithms with computer vision capable of outperforming human biological-based vision. One deep-learning image detection algorithm and its variants, which stands out of the rest and are used in most image recognition systems, are the CNN and its variants. The CNN has enhanced image feature extraction capabilities that allow it to achieve advanced levels of image recognition [23,24,25]. In this paper, we propose an improved NILM image disaggregation framework that is based on the staked denoising autoencoder (sdAE) using CNN layers. The effectiveness of an image recognition system is in its ability to obtain a clean image from a poor and noisy representation of the image. Although a number of image cleaning techniques have been proposed [26], with the deep image denoising concept pioneered in 2015 [27], the CNN based sdAE has achieved very high image cleaning performance [26]. Hence, we aim to exploit this property of the sdAE to obtain a clean appliance signal image from the mains supply signal composite image. Furthermore, we aim to address some of the image-based NILM deficiencies [17,18,19,20,21,22]. As the authors in [18] we proposed GASF generated images for the disaggregation strategy, however, in our case we use an in-house generated dataset and go a step further to include the image-based equipment classification part which was not done in [18] who use a standard dataport dataset. In the final analysis, we compare the performance of our proposed image recognition system to that of a power series one, also based on the convolutional neural network. The procedure involves measuring the current, real power and power factor load parameters for the aggregate and each appliance power series based signal, and converting each parameter power series into an image representation. We performed rigorous NILM recognition experimentation with the images generated from all the three signal parameters. The current and power signals do, in fact, individually provide all the features required to provide unique signal identity. However, it is also possible to provide signal identification by considering the PF. We get a boost on the signal recognition performance if we consider an increased dataset that includes current and power factor or active power and power factor. We then train the proposed sdAE disaggregation and VGG–CNN classification networks. Finally, we perform image-based disaggregation and recognition of each appliance based on the power series images. We make the following contributions in our study:

  • Development and improvement of the NILM recognition scheme by basing it on a powerful computer vision appliance signal disaggregation and classification technique.

  • Compare the performance of our proposed image based NILM recognition scheme to that of the power series signal system.

The remaining sections are structured as follows: “Methodology” details the design of the proposed image-based non-intrusive load monitoring system. The disaggregation is achieved through a number of trained stacked denoising autoencoders (sdAE) equal to the number of target mains loads and the classification through a single multiclass trained deep convolutional neural network (DCNN). We also show the intended application of the designed image-based NILM recognition system. We detail the experimental setup in relation to the creation of our in-house laboratory dataset from power series to image form, performance measure, proposed method pseudo code, model training/testing approach framework and procedure. A breakdown of the model architectures in terms of the deep learning network layers, and the comparison or relationship between the encoding, decoding, ConvNet classification and power series classification are also given here. “Discussion of experimental results” gives an in-depth presentation and analysis of the results. “Conclusion” gives a conclusion of the developed system. We also give an insight into future work related to the outcomes of this paper.

Methodology

Proposed overall topology

The proposed topology in Fig. 1 is made up of two parts, the disaggregation and classification. The disaggregation is made up of five sdAE networks, whilst the classification is made up of one ConvNet VGG network. The aggregate signal image is input into five trained sdAE networks, each is capable of disaggregating only one target appliance signal image. The output of each sdAE is a clean target appliance signal image. The image is the input into the trained VGG classifier for recognition.

Fig. 1
figure 1

Proposed complete NILM recognition system showing five dAE networks and one ConvNet VGG classifier. The following are identified a aggregate input, b FR, c MW, d L1, e L2, and f L3 target images, with g the load identity output

In the classification part, we train the model to recognize and classify only the ground truth signature images of the appliances. In this case, we consider only the refrigerator and microwave oven input images. However, we can generate relevant signature images for the other appliances in the entire experiment. Both the disaggregation and classification networks are built around the CNN. The CNN can extract detailed image features and reduce the overall dimension of an image but preserve the image identity through the linear convolution of the input image. The CNN employs a number of filters whose dimensions are much smaller than the image to scan the entire image at intervals known as strides, thereby obtaining a representative mapping of these scanned areas. Nonlinearity is introduced into the convolution result through the application of a Rectified Linear Unit (ReLu) operation which effectively removes all negatives in the result. The produced CNN + ReLu feature image is then passed through a pooling layer to reduce the dimensionality of the convolution result but maintaining the essential parts of the input information. For image recognition, the CNN uses max or sum pooling method. [4, 23,24,25]. To increase the number of detected image features the CNN requires an increase in the number of filters connected in parallel, with each filter detecting a specific image feature. The CNN can be made deeper by adding successive convolution layers and pooling layers to extract as much information as possible from the data. The pooling layers introduce blurring of the image hence it needs to have deeper networks to extract as much relevant information as possible. Figure 2 shows a 28 × 28 input image, the convolved output image and finally the image from a 2 × 2 max pooling based on 16 filters.

Fig. 2
figure 2

Convolution on refrigerator GADF image..a Input image, b convolution + ReLu image, and c blurred first max pooling resultant image

In the power series-based NILM recognition the network input is the aggregated power series signal and the target is the specific appliance signature. In the proposed system, the number of disaggregation networks is equal to the number of appliances under test, where the input power series is equal to the entire appliance activation. In disaggregation, the output power series length is also equal to the entire appliance activation. In our proposed method, the disaggregation output is the image equivalent of the power series output. Instead of power series based partial disaggregated signals (that are combined through some reconstruction filter or through addition and finding the mean) defined by the number of sliding windows, our proposed system outputs an image representing the entire ground truth activation characteristics of the disaggregated image equivalent to the signature. To improve gradient convergence and avoid instability it is necessary to normalize all power series data and then apply standardization [zero mean (µ) and unit standard deviation (σ)] to the data. The disaggregated signals are then fed into a trained multi-class power series classification network. Long-short-term-memory (LSTM) recurrent networks that find wide application in speech recognition and language processing are highly adaptable to time-series disaggregation. Furthermore, whilst ConvNets are highly adapted to spatial based recognition they can also be used in one-dimensional (1D) power series univariate and multivariate based NILM disaggregation and classification systems [6] with acceptable performance. Power series deep learning NILM recognition systems are often based on (1) combined convolutional and recurrent neural network (CNN–RNN), and (2) autoencoder (AE) [7, 10, 13] that are well adapted to complex feature extraction, sequence prediction and signal reconstruction, all requirements that are crucial in signal disaggregation. In applications where we need to detect feature trends within the data (fixed sequence length) without worrying about the specific location of that feature we use one dimensional (1-D) CNN. RNN that is based on memory cells can direct the output predictions to be in the order determined by the position of the input signal elements. However, due to the vanishing gradient problem of the RNN, an enhanced form of the RNN known as the long-short-term-memory (LSTM) network is used instead [28]. In the backpropagation, weights are updated according to the gradient descent where a vanishing gradient deprives layers close to the input of error signal making these layers less effective in training, whilst an exploding gradient error signal causes instability in the same layers [29]. In the multi-layer perceptron (MLP) as the hidden layers go wider and deeper, a number of issues arise. Wider implies more weights, hence strenuous computations. Deeper implies a vanishing and exploding gradient. The autoencoder (AE), which is made up of the same number of input neurons as output neurons and having a significantly reduced deep layer count that form an extension of the input, can address the pitfalls of the deep MLP. Disaggregation by ‘denoising’ the unwanted parts of the aggregate signal is an effective way of extracting the required load signal [6]. Our proposed system uses 2-D CNN based classification as opposed to 1-D CNN classification in the power series system.

The proposed system samples data from the common mains power cable supplying three mains lamps, a refrigerator and a microwave oven in the house. The hardware/software components include digital signal processing processors, main system processor, embedded system development board/platform, IoT module and python with Keras deep learning library. There is also a need to convert the high-level language to low-level format or machine code for loading the NILM program into the embedded system. The power supply for the whole recognition unit is tapped from the mains power cable. In implementation we consider both manual and online capture of signal information for training and disaggregation, respectively.

Disaggregation framework

Inspired by the ability of autoencoders to reconstruct a good signal from a composite of noise and signal, an NILM disaggregation system based on the stacked image denoising autoencoders (dAEs) [9] is proposed. The dAE will effectively disaggregate the required load image from a noisy environment due to other (aggregate) loads from the aggregate image. We obtain the full benefits of the dAE by developing stacked dAEs which are basically deep dAE structures. By implementing stacked dAEs we obtain a better generalization of the recognition system. Our proposed stacked dAE recognition system is given in Fig. 3 with a number of hidden layers.

Fig. 3
figure 3

Image denoising autoencoder for the NILM disaggregation. a Aggregate image. b encoder, c encoding hidden layer, d decoder, e target image, f decoder hidden layer, g encoder hidden layer, and h Gaussian added noise equivalent due to loads other than the target load

The aggregate signal \(x\left( t \right)\) can be represented in-terms of appliance \(j\) signature \(z_{j} \left( t \right)\) and an overall noise term due to the other appliances \(z_{i} \left( t \right)\) and a spurious noise term \(e\left( t \right)\) as [30]

$$ x\left( t \right) = z_{j} \left( t \right) + v_{j} \left( t \right)\quad {\text{for}}\;j = 1,2, \ldots N, $$
(1)

where

$$ v_{j} \left( t \right) = \mathop \sum \limits_{i = 1}^{N} z_{i} \left( t \right) + e\left( t \right)\quad i \ne j. $$
(2)

The dAE will remove the \(v_{j} \left( t \right)\) term from the aggregate signal so that there remains with only the appliance j signature \(z_{j} \left( t \right)\) term. The dAE comprises of an aggregate input \(x_{i}\) followed by an encoder which gives an internal representation of the input to an encoding hidden layer \(y_{h}\) and then a decoder which moves this internalized representation to the target output \( z_{i}\) provided \(i > h\) the number of neurons in the respective layers. These are actually back to back connected full networks where the first full network based on CNN incorporates max pooling and the second full network incorporates up-sampling [9, 31,32,33]. Normally during training the network a Gaussian or Salt-and-Pepper noise is added to the input to give a noisy term \(x^{\prime}\). Then a nonlinear encoding layer y [9] is given in Eq. 3 as

$$ y = f_{\theta } \left( {x^{\prime}} \right) = \sigma \left( {Wx^{\prime} + b} \right), $$
(3)

where b is an encoding layer bias, \(W\;{\text{is}}\;{\text{the}}\;i \times h\) weight matrix, and \(\sigma\) is the ReLU activation function. The mapping to z from the output y is

$$ z = g_{{\theta^{\prime}}} \left( y \right) = s\left( {W^{\prime}y + b^{\prime}} \right), $$
(4)

where \(b^{\prime}\) is decoding layer bias, \(W^{\prime}\) is the \(h \times i\) weight matrix that translates to \(W^{T}\), s is a softplus activation function. \(\theta = \left\{ {W,b} \right\}\) for encoding layer \(y_{h}\) and \(\theta^{\prime} = \left\{ {W^{\prime}, b^{;} } \right\}\) for decoding layer \(z_{i}\). For training and optimizing the parameters \( {\Theta } = \{ W,b,b^{\prime}\)} we apply the objective loss function

$$ L\left( \Theta \right) = \sum\limits_{i} {\left\| {v_{i} - g_{{\theta ^{\prime } }} \left( {f_{\theta } \left( {x^{\prime } } \right)} \right)_{i} } \right\|_{2}^{2} } , $$
(5)

where \(v_{i}\) is the clean signal.

Classification framework

The classification model was premised on a simplified three-section Oxford Visual Geometry Group (ConvNet VGG) in [16]. This is a multilayer very deep CNN structure that has achieved a very high level of multiclass recognition and good generalization of a large varying image dataset count. The ConvNet VGG model is a good benchmark classification model. Figure 4 shows our proposed VGG classification model.

Fig. 4
figure 4

Proposed CNN classification network. a Disaggregated input images, b convolution + ReLu, c feature maps, d max pooling, e fully connected network, f class outputs, and g backpropagation network path

Dataset creation

A number of public datasets exist for experimenting on developing NILM recognition systems. However, most of the load equipment in these datasets is either obsolete or advancement in technology has altered slightly their signature characteristics. This in itself is not a major issue for developing the models, but how to apply and validate these models becomes a problem. Also some datasets have varied data acquisition sampling time and are defined for activation periods of days and even months. This would generate enormous data which is beyond the scope of the CPU computations platform that we are using in this paper. To this end, we propose a simpler dataset, however, not necessarily less important for our experiment.

The data used in the experiments were obtained in a laboratory setup using the following appliances:

  • Hisense refrigerator (RF),

  • SMW20E Salton microwave oven (MW),

  • Philips 5 W (60 W) LED lamp (L2),

  • Radiant 12 W (100 mA) CFL lamp (L1) and

  • Radiant 14 W (110 mA) CFL lamp (L3).

The data were acquired at a sampling rate of 1 Hz by using a Tektronix PA1000 Power Analyzer [34]. The programming environment was based on 64 bit python 3.6.3 64 bit software, keras 2.2.4, tensorflow 1.5.0 backend, numpy 1.17.0, pandas 0.20.3, pyts 0.8.0, scipy 1.3.1, and scikit-learn 0.20.1 packages, on an Intel® CPU 2.30 GHz 4.00 GB Ram 64bit HP ProBook 450 G3 laptop. Figure 5 shows the experiment setup for acquiring the data using the PA1000 Power Analyser.

Fig. 5
figure 5

Photo of experiment setup for RF, MW, L1, L2 and L3 parameter measurement. a RF, and b MW

The appliances are connected in parallel as per PA1000 instructions to an alternating-current (A.C) mains power source. Each appliance is connected to the mains power cable through switched mains power extension cables incorporating lamp holders in the case of lamps. For measurement reproducibility, each appliance and plug point is assigned a specific label. USB data logging, for datasets creation is at a frequency of 1 Hz. This sampling frequency determines whether we implement high transient, slow transient, event detection or non-event detection based, feature extraction algorithms. However, in the event of designing a data acquisition signal processing hardware we can make use of much higher sampling frequency since provision for buffer storage can be incorporated.

Power series signals

Figure 6 shows the current (I_rms) aggregate appliances signal for a laboratory experiment whose objective was to disaggregate and classify each appliance specified in this diagram using deep learning method.

Fig. 6
figure 6

Aggregate current (I_rms) profile showing the appliance activations in the laboratory experiment. The following are the load operational status: At point a FR “ON”, intervals b FR + MW on idle, c MW boils 100 ml of water, d FR + MW, point e MW “OFF”, intervals f FR + L1, g FR + L1 + L2, h FR + L1 + L2 + L3, i FR “OFF” + L1 + L2 + L3, j L2 + L3, k L2, and at i. All loads “OFF”

The diagram gives the aggregate profile of a refrigerator (FR), a microwave oven (MW), and mains 12 W (L1), 5 W (L2) and 14 W (L3) lamps. The appliances under consideration have various activation periods. In the experiment, the refrigerator’s activation period before the next appliance (microwave oven) is switched ON spans 1170 s. However, from about 28 s to 1170 s its response is fairly constant. The microwave oven is switched on to operate at idle for 120 s after the 1170 s of refrigerator operation. Figure 7 shows the point at which we switch on the microwave oven idle status into the circuit.

Fig. 7
figure 7

Refrigerator and microwave oven activation points. a Refrigerator switch “ON”, b microwave oven switch “ON”, and c refrigerator and microwave oven “ON” but operating at idle

After 120 s the microwave oven is timed for a period of another 120 s to bring 100 ml of water to the boil and then switched off. The refrigerator continues to run and after operating for 120 s L1 is switched on to operate with the refrigerator for a period of 360 s before L2 is added into the circuit and the combination operates for 300 s. We then add another lamp L3 into the circuit comprising RF, L1, and L2 to operate for an additional 300 s before the refrigerator cuts off automatically leaving L1, L2 and L3 switched on. The remaining L1, L2 and L3 combination operates for 120 s after which we switch off L1. Now we remain with L2 and L3 in the circuit for 240 s after which we disconnect L3. L2 operates for a short period before we finally remove it from the supply to remain with no connected appliances.

The data acquisition unit automatically samples additional load parameters that include voltage, frequency, active power (Watts), and power factor (PF) for both the aggregate and ground truth signals. The active power load profile was similar in shape to the current load profile. Figure 8 shows the PF aggregate profile for the experiment in this paper. PF which is the ratio of real power to apparent power \(\left( {{\text{PF}} = \frac{{{\text{watts}}}}{{{\text{voltage}} \times {\text{current}}}}} \right)\) normally measurements the energy efficiency of the appliance. As can be seen from the preceding expression the PF in appliance steady-state operation is a consequence of the active power, current and voltage. We performed rigorous NILM recognition experimentation with the images generated from all the three signal parameters. The PF energy efficiency characteristics in principle can be used to provide recognition of an appliance since PF varies as the active power which is directly related to the current. However, the I_rms and Watts signals give a direct representation of the operational features of the appliance and are therefore more appropriate parameters for the recognition. Hence, the I_rms and Watts parameters do, in fact, individually provide all the features required to provide unique signal identity. We get a boost on the signal recognition performance if we consider an increased dataset that includes current and power factor or active power and power factor.

Fig. 8
figure 8

Aggregate PF appliances profile for the laboratory experiment. Operating points and intervals are defined as in Fig. 1 since the time scale is the same

The power series signals are taken as raw data spanning the entire aggregate signal sample length and target load activation windows. The pyts package in python facilitates the generation of the signal images from power series representation. We consider gramian angular fields (GAFs) to transform the power series into image equivalent form for input into our image-based NILM recognition system.

Gramian angular fields

The Gram Matrix (Gramian or metric) matrix [35] is the basis for encoding appliance power series signal to an image. The appliance signals are encoded to GAF using the procedure in [15, 36], and then rescaling the signals \(X = \left\{ {x_{1} ,x_{2} , \ldots ,x_{n} } \right\}\) to fit in the ranges of − 1 to 1 or 0 to 1 as given in Eqs. (6) and (7) respectively.

$$ \tilde{X}_{ - 1}^{i} = \frac{{(x_{i} - \max \left( X \right)) + \left( {x_{i} - \min \left( X \right)} \right)}}{\max \left( X \right) - \min \left( X \right)}, $$
(6)
$$ \tilde{X}_{0}^{i} = \frac{{x_{i} - \min \left( X \right)}}{\max \left( X \right) - \min \left( X \right)}. $$
(7)

After rescaling, the time series is converted to polar coordinate as given in Eq. (8), where the value is the angular cosine and the time stamp \(\left( {t_{i} } \right)\) is the radius r \(\emptyset\) is the polar coordinates angle and \(N\) is the regularization constant factor for the span of the polar coordinate system [15, 20]. On the polar plot advancing time scale concentric circles are accompanied by time scale values that warp through the various angular points. The angular limit for the scale \(\left[ {0, 1} \right]\) is \( \left[ {0, \pi } \right]\), and for \(\left[ { - 1, 1} \right]\) is \(\left[ {0, \frac{\pi }{2}} \right]\) [15].

$$ \left\{ {\begin{array}{*{20}l} {\emptyset = ar\cos \left( {\tilde{x}_{i} } \right),} \hfill & {\quad - 1 \le \tilde{x}_{i} \le 1,\;\;\tilde{x}_{i} \in \tilde{X}} \hfill \\ {r = \frac{{t_{i} }}{N},} \hfill & {\quad t_{i} \in N.} \hfill \\ \end{array} } \right. $$
(8)

A Gramian Matrix [15] is realized from the polar coordinate vectors. Either, the image-based Gramian Angular Summation Field (GASF) or the Gramian Angular Difference Field (GADF) as defined in Eqs. (9)–(12) [15, 20] image form of the matrix is possible,

$$ {\text{GASF}} = \left[ {\cos \left( {\emptyset_{i} + \emptyset_{j} } \right)} \right], $$
(9)
$$ = \tilde{X}^{^{\prime}} \, \cdot \,\tilde{X} - \sqrt {I - \tilde{X}^{2} }^{\prime } \, \cdot \,\sqrt {I - \tilde{X}^{2} } , $$
(10)
$$ GADF = \left[ {\cos \left( {\emptyset_{i} - \emptyset_{j} } \right)} \right], $$
(11)
$$ = \sqrt {I - \tilde{X}^{2} }^{\prime } \, \cdot \,\tilde{X} - \tilde{X}^{\prime } \, \cdot \,\sqrt {I - \tilde{X}^{2} } , $$
(12)

where I is the unit row vector \(\left[ {1,1, \ldots ,1} \right]\).

Equation (13) shows how the time series can be accurately reconstructed from the GASF main diagonal [20].

$$ \cos \left( \emptyset \right) = \sqrt {\frac{{\cos \left( {2\emptyset } \right) + 1}}{2}} \quad \emptyset \in \left[ {0, \frac{\pi }{2}} \right]. $$
(13)

In Fig. 9, we show a typical polar plot and the respective gramian angular field 28 × 28 images generated from the experiment PF composite signal.

Fig. 9
figure 9

PF aggregate signal transformed to image. We show a polar plot, b GASF, and c GADF generated images

The disaggregation algorithm is evaluated based on the training and validation/testing. To this end, we create a database composed of training and aggregate validation/testing images. We created the aggregate training dataset by adding synthetic power series data to the real validation data in the ratio 50:50. The validation data remains as only real data. Figure 10 shows aggregate training and validation images for the current parameter.

Fig. 10
figure 10

Disaggregation training and validation aggregate I_rms based images. a Training GASF, b Training GADF, c validation GASF, and d validation GADF images

The outlook of poorly outlined transformed images can be improved by applying the logarithmic transform [37] to the image values using the expression in Eq. (14).

$$ \tilde{I}_{{{\text{GASF}}}} = \log_{10} \left( {1 + I_{{{\text{GASF}}}} } \right), $$
(14)

where \(I_{{{\text{GASF}}}}\) is the GASF image after the time series polar plot transformation, and \(\tilde{I}_{{{\text{GASF}}}}\) is the log transformed image. A method of improving the image contrast is suggested in [37].

Markov transition fields

The Markov transition field (MTF) involves the encoding of time series into quartile bins. A Markov transition matrix is produced and the MTF result is given as in [15]. The MTF captures well, time-series dynamics as opposed to GAF that is good at static time series transformations. However, MTF has poor capability to reconstruct the time series from the image as opposed to GAF. This necessitates a future holistic approach to our neural network image dataset creation to include both the GAF and MTF as done in [15].

Ground truth signals

The appliances various activation periods define the number of images that can be produced. The steady-state operation of the refrigerator is defined between 28 and 1170 s. Figure 11 shows the refrigerator switch ON characteristics and active microwave oven activation period.

Fig. 11
figure 11

Close up refrigerator and microwave oven operating characteristics: a refrigerator at switching “ON” and b microwave oven high power activation period

The refrigerator dataset images are created from Fig. 11 and from various data lengths up to 1170 s Likewise, the microwave oven dataset is created from the respective microwave oven power series. In this experiment, the refrigerator plus microwave oven on idle are ON for 120 s whilst in the active mode heating 100 ml of water the microwave oven is ON for 120 s Hence, with an image set of 30 s (approximated from the 28 s above) long we can realize a minimum of four microwave ON images. These images are the training labels in the supervised learning case. Figure 12 shows typical refrigerator and microwave oven I_rms and PF GASF, and GADF ground truth images.

Fig. 12
figure 12

Refrigerator and microwave oven ground truth signal images. a Refrigerator PF GASF, b refrigerator PF GADF, c microwave oven PF GASF, d microwave oven PF GADF, e Refrigerator I_rms GASF, f refrigerator I_rms GADF, g microwave oven I_rms GASF, and d microwave oven I_rms GADF

Performance metrics

We use the receiver operating characteristic (ROC) Curve to evaluate our classification performance. The area under curve (AUC) of the ROC characteristics gives the probability of correctly discriminating between two entities. The AUC in Fig. 13 is interpreted as [38, 39]

  • 0.5–0.6 failed,

  • 0.6–0.7 poor,

  • 0.7–0.8 fair,

  • 0.8–0.9 good, and

  • 0.9–1 excellent discrimination.

Fig. 13
figure 13

Information from two ROC curve representations with true positive rate (P (TP)) and false positive rate (P(FP)) axis. a Comparing ROC Curves [39], and b Diagonal line, empirical and Gaussian (solid line) curves on the ROC [38]

Multiclass classification can also be achieved through the ROC curve [40].

We compliment the ROC classification metrics with accuracy, precision, recall, F-measure [13], and confusion matrix. These metrics are defined as,

$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}, $$
(15)
$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}}, $$
(16)
$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}}, $$
(17)
$$ F{\text{ - measure}}\left( {F_{1} } \right) = \frac{{2 \times {\text{Precision}} \times {\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}, $$
(18)

where TP is true positives, FP is false positives, FN is false negatives and TN is true negatives.

Accuracy defines the output popularity of an expected outcome in relation to the total possible outcomes in a sample. Say we have 480 TN outcomes in one class and 20 TP outcomes in another class. Then 480 outcomes over 500 total outcomes will give us an accuracy of 96%. This translates to TN = 480, FN = 20, TP = 0 and FP = 0. A classification model trained on this unbalanced data may give a high accuracy in favor of the higher sample count hence accuracy on its own will not provide a good measure of the models’ performance. The precision and recall determine how good the TP is acknowledged by the model. By looking at Eqs. (15) and (16) the preferred values of precision and recall are unity. Hence, precision and recall are preferred classification metrics so as to obtain the classification outcomes we want. This takes us to Eq. (18) the F-measure. The F1 which contains the values of precision and recall give a better representation of the performance of the model in terms of providing the right classification. The preferred value of F1 score is unity. The confusion matrix gives a summary result of the expected against the predicted outcomes.

To evaluate the disaggregation performance we use the binary cross entropy (BCE) loss [41]. The cross-entropy loss (CE) is given as

$$ CE = - \mathop \sum \limits_{i}^{w} z_{i} \log \left( {q_{i} } \right), $$
(19)

where w is number of classes, q is class i predicted probability and \(z_{i}\) class i true probability. The CE gives the interpretation of the log-likelihood for \(z_{i}\) given a function \( q_{i}\). The BCE is then given as in Eq. (20), for two classes.

$$ {\text{BCE}} = - \mathop \sum \limits_{i = 1}^{w = 2} z_{i} \log \left( {q_{i} } \right) = - z_{1} \log \left( {q_{1} } \right) - \left( {1 - z_{1} } \right)\log \left( {1 - q_{1} } \right). $$
(20)

The kappa index which represents the level of agreement between two raters is defined in the range [− 1, 1]. A value of − 1 is no agreement at all, 0 is a chance agreement and 1 is perfect agreement. The code based on the confusion matrix in [42] was taken as the basis for formulating the kappa index calculations which have been included in our Results.

Description of proposed method

Pseudo-code for proposed method

Pseudo-code for proposed method In the proposed method we verify the performance of an image-based NILM disaggregation and classification scheme of five appliances from the aggregate mains supply. We use an image-based denoising autoencoder for the disaggregation and a ConvNet VGG architecture for the classification of the denoised appliance signature images. In our method, we identify five image-based disaggregation networks and one image based multi-class classification network. This basically translates to two main algorithms in our method given by Pseudocode 1, and Pseudocode 2 for the disaggregation and classification respectively. We also compare the classification performance of our system to a power series based on using the same ConvNet VGG model.

figure a
figure b

Model architectures

Encoding The encoder model is made-up of three 2-D CNN layers. The first CNN layer which accepts the aggregate image input of shape of \(28 \times 28 \times 3\) has 64 filters each of dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 2-D max pooling operator of dimensions of 2 × 2. The second CNN layer which accepts the 2-D max pooling output of the first CNN layer has 32 filters each of dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 2-D max pooling operator of dimensions of 2 × 2. The third CNN layer which accepts the 2-D max pooling output of the second CNN layer has 16 filters each dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 2-D max pooling operator of dimensions of 2 × 2 to give the encoded output.


Decoding The decoder model is made-up of three 2-D CNN layers. The first CNN layer which accepts the encoded input from the encoder has 16 filters each of dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 2-D up sampling operator of dimensions of 2 × 2. The second CNN layer which accepts the 2-D up sampling output of the first CNN layer has 32 filters each of dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 2-D up sampling operator of dimensions of 2 × 2. The third CNN layer which accepts the 2-D up sampling output of the second CNN layer has 64 filters each of dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 2-D up sampling operator of dimensions of 2 × 2. The output is a CNN layer which accepts the 2-D up sampling output of the third CNN layer and has three filters each of dimensions of 3 × 3 acted upon by a sigmoid activation function.

The encoding and decoding model used the adam optimizer and binary_crossentropy loss function with early stopping based on the minimum validation loss.


ConvNet classification The ConvNet classification model is made-up of three 2-D CNN layers, followed by a flatten operation, and finally by two fully connected layers. The first CNN layer which accepts the disaggregated appliance image inputs of shape of \(28 \times 28 \times 3\) has 8 filters each of dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 2-D max pooling operator of dimensions of 2 × 2. The second CNN layer which accepts the 2-D max pooling output of the first CNN layer has 16 filters each of dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 2-D max pooling operator of dimensions of 2 × 2. The third CNN layer which accepts the 2-D max pooling output of the second CNN layer has 64 filters each dimensions of 3 × 3. The output of this layer is acted upon by a ReLU activation function. Then a Flatten layer operates on the 2-D max pooling output of the third CNN layer. The flattened output is input into a fully connected dense layer with 16 neurons, followed by a ReLU activation function and a dropout factor of 0.25. The output of the first fully connected dense layer is then channeled into the input of another fully connected layer with N output neurons (N is equal to 2 for binary classification, and N is equal to 4 for classification of four appliances). The output of this layer is then operated on by a softmax activation function for class output. We can use the sigmoid activation function which outputs probability values instead of the class values for the classification. The classification model used the RMSprop optimizer with a learning rate of 0.001.

(The Adam optimizer can be used with the sigmoid function).


Power series classification model for comparison The 1-D CNN classification model is made-up of five 1-D CNN layers, followed by a single output dense layer. The first CNN layer which accepts the disaggregated appliance power series inputs of shape of (4, 1) has eight filters each of dimension of 1. The output of this layer is acted upon by a ReLU activation function. The second CNN layer which accepts the output of the first CNN layer has 16 filters each of dimension 1. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 1-D max pooling operator of dimension 1. The third CNN layer which accepts the output of the second CNN layer has 16 filters each of dimension 1. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 1-D max pooling operator of dimension 1, followed by a dropout factor of 0.5. The fourth CNN layer which accepts the output of the third CNN layer has 16 filters each of dimension 1. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 1-D max pooling operator of dimension 1, followed by a dropout factor of 0.5. The fifth CNN layer which accepts the output of the fourth CNN layer has 16 filters each of dimension 1. The output of this layer is acted upon by a ReLU activation function. The ReLU plus CNN output is then operated on by a 1-D GlobalAveragePooling layer, followed by dropout factor of 0.25. The final output layer which follows is a dense layer with 1 neuron and a sigmoid activation. The power series model used RMSprop optimizer with the hyperparameter settings: lr = 0.001, ρ = 0.9, ε = none, and decay = 0.0.

A comparison between the encoding, decoding, ConvNet classification and power series models is shown in Table 1.

Table 1 Comparison between the encoding, decoding, ConvNet classification and power series models

Training framework and procedure

With reference to the architecture of the CNN we added experiments where we changed the learning rate, type of optimizer, the number of epochs, the number of layer neurons and the batch size. In the sdDAE, we started with three CNN layers in the encoder having the numbers of filters (neurons) of 1024, 512, and 256, respectively. In the decoder, we had three CNN layers with the numbers of filters (neurons) of 256, 512, and 1024, respectively. Increasing the number of CNN layers above three layers for the encoder and decoder, respectively, did not provide any noticeable improvement in the performance of the model. However, a decrease below three layers for the encoder and decoder, respectively, did adversely affect the performance of the model. We trained the sdAE so that the binary cross-entropy (BCE) loss function was minimized between the disaggregated output image and the ground truth appliance image. The BCE loss parameter calculates the error to be used in the weights and bias updates. To address overfitting we used the adam algorithm with early stopping and a learning rate that varied from 1e−5 to 0.01. We initially started with thirty epochs but the model had no noticeable convergence. We gradually increased the epoch count to 350 epochs and we incorporated early stopping facility based on minimum validation loss and a patience of 10. The model achieved acceptable disaggregation results for learning rates of 0.01 and 0.001 with training batch size equal to one and early stopping at 200 epochs. We experimented with the Adadelta, Adamax, Adam and Adagad optimizers, but the Adam optimizer provided better convergence results. When we gradually increased the number of CNN layer neurons above the ones that we initially had specified, we obtained overfitting with increased program running time. Consequently, we gradually reduced the number of CNN layer neurons until we obtained best results when the encoder CNN layers had the numbers of filters (neurons) of 64, 32, and 16 in each layer, respectively. In the decoder best disaggregation results were obtained when the CNN layers had the numbers of filters (neurons) of 16, 32, and 64 in each layer, respectively. With this new CNN layer count, the microwave disaggregation achieved best results at a learning rate of 0.01. However, the refrigerator and lamps disaggregation achieved best results at a learning rate of 0.001. For a batch size of 1 and 120 epochs the program running time was also considerably reduced. Increasing the batch size reduced the performance of the model. Also to increase the learning (faster execution time) and lower the usage of system memory we set our initial batch size to 1 (online learning). In this case, the network weights are updated after each training instance. In all our experimentation to account for non-linearity we introduced the ‘relu’ non-linear function into the convolution process.

Our classification model is decided on by the set structure of the ConvNet VGG high rate image classifier with defined \(3 \times 3\) filter dimensions. However, we had to limit the number of layers to three and had to reduce the number of CNN neurons to 16, 32, and 64 for each of the layers, respectively. Too high a number of CNN neuron layers resulted in overfitting and too low CNN neuron layers resulted in underfitting. The dropout was varied from 0.25 to 0.5. The output activation function was set to the multi-class ‘softmax’ function. We experimented with various optimizers that included the RMSprop, adam and stochastic gradient descent (SGD) for learning rates that varied from 0.001 to 0.00001. To realize the 1D CNN power series model we experimented with various filter numbers in the range 4 to 128 and found best results for five CNN layers with the number of filters (neurons) of 8, 16, 16, 16, and 16 per layer from the input respectively. We experimented with various dropouts from 0.25 to 0.5. The third and fourth CNN layers from the input were each followed by a regularization dropout factor of 0.5, and the fifth CNN layer by a dropout factor of 0.25. The output activation function of the 1D CNN model was set to the ‘sigmoid’ which can be used for both regression and classification analysis. The power series model used RMSprop optimizer with the hyperparameter settings: lr = 0.001, ρ = 0.9, ε = none, and decay = 0.0. The image datasets for all the models were produced as \(400 \times 400\) images that were then reshaped to \(28 \times 28\) normalised images before input into the CNN.

Discussion of experimental results

The models are first trained using the I_rms image Dataset A and then trained using the PF Dataset B. Both datasets are split into train and validation data in the ratio 3:1. However, the test data varies from as little as one image to a total of five images.

Dataset A

The disaggregation was simulated using the autoencoder image to image regression code idea in [43]. In this section, we present the results of the disaggregation based on Fig. 3 that uses the denoising autoencoder. We present the disaggregated microwave, refrigerator and L2 lamp target load signals. We are also able to extend to the disaggregation of the other appliances namely L1 and L3, when their respective power series signals are transformed to image. In Fig. 14, we show the target refrigerator ground truth and disaggregated images and their respective power series representation. During training, a learning rate of 0.01 produced a blank predicted image, but the result was satisfactory for a learning rate of 0.001.

Fig. 14
figure 14

Disaggregation of the refrigerator I_rms parameter. a Ground truth signal image, b disaggregated image, c ground truth signal, and d disaggregated signal

The results in Fig. 14 show that we are able to successfully disaggregate (predict) the refrigerator I_rms from the complex aggregate mains signal. The image features rather than the color define the signal. In Fig. 2 we have shown the convolution and feature extraction where the colour is represented by varying shades of white to grey to black. So as far as the classification is concerned this predicted image is classified as a refrigerator. Figure 15 gives the BCE train loss characteristics for the refrigerator I_rms disaggregation model.

Fig. 15
figure 15

BCE loss plot for the I_rms disaggregation

Dataset B

In Fig. 16 we show the target microwave oven image, the aggregate image for all the appliance activations and the predicted (disaggregated) microwave oven image for an Adam learning rate of 0.001.

Fig. 16
figure 16

Disaggregation of MW load signal from aggregate for an adam learning rate of 0.001. a Target MW load image without noise, b aggregate image with noise, and c disaggregated microwave oven signal image

The results in Fig. 16 show that we are able to successfully disaggregate (predict) the microwave oven from the complex aggregate mains signal. Unlike in the I_rms dataset, in this dataset there is an improvement in the disaggregation output as the learning rate approaches 0.01 as shown if Fig. 17. Figure 17 shows that the disaggregated image is identical to the target image for that load equipment.

Fig. 17
figure 17

Disaggregation of MW load signal from aggregate for an adam learning rate of 0.01. a Target MW load image without noise, b aggregate image with noise, and c disaggregated MW signal image

It was observed that as the learning rate is decreased to 1e−5 the disaggregation performance also significantly decreased until there was no recognition at all. In Fig. 18, we obtain a further decrease in the binary cross-entropy loss function as the learning rate is increased from 0.001 to 0.01. An increase in learning rate means that the loss function decreases faster to reach minima. However, due to the erratic behaviour of parameter updates local minima might not be achieved. Very low learning rates cause the loss function to stagnant, whilst very high learning rates can cause divergence (increase) in the loss function.

Fig. 18
figure 18

MW Binary cross entropy loss for adam learning rate of 0.01

We then evaluated the autoencoder model on disaggregating the second load which is the refrigerator. Figure 19 shows that our developed model is able to disaggregate the second appliance from the same aggregate image as the previous load appliance with very high accuracy. The cross-entropy plot in Fig. 20 consolidates the high disaggregation capability of the network on the second load appliance.

Fig. 19
figure 19

Disaggregation of RF load signal from aggregate for an adam learning rate of 0.01. a Target RF load image without noise, b aggregate image with noise, and c disaggregated RF signal image

Fig. 20
figure 20

RF binary cross entropy loss for adam learning rate of 0.01

In the third case, we evaluated the autoencoder model on disaggregating the LED mains lamp (L2) load. Once again Fig. 21 shows that our developed model is able to disaggregate the third appliance from the same aggregate image as the previous load appliances with high accuracy. The diagram in Fig. 21 shows switching bars around the image. However, there is a slight loss in detail at the upper right-hand corner of the predicted image. Nonetheless, the predicted image is a true representation of the target image as can be attested to the stable cross-entropy plot in Fig. 22.

Fig. 21
figure 21

Disaggregation of L2 load signal from aggregate for an adam learning rate of 0.01. a Target L2 load image without noise, b aggregate image with noise, and c disaggregated L2 signal image

Fig. 22
figure 22

L2 Binary cross entropy loss for adam learning rate of 0.01

Dataset B recognition performance

The initial model development entry point is based on 30 training images, 8 validation images and 8 test images belonging to the two classes of refrigerator and microwave oven. Based on only one input channel of PF, the model achieved a 100% model evaluation capability and was able to accurately classify the eight test images that had not been seen before where class (0) is fridge and class (1) is microwave oven. The ROC plot is shown Fig. 23.

Fig. 23
figure 23

ROC curve for PF RF and MW appliance classification

The corresponding confusion matrix for the ROC plot above is shown in Fig. 24. The confusion matrix shows that all the eight test samples are accurately classified. The precision, recall and F1 score values are all equal to unity implying a perfect classifier.

Fig. 24
figure 24

Confusion matrix for PF RF and MW appliance classification

We compare the proposed image-based model to a one-dimensional power series convolutional neural network (Conv1D) model based on 144 training samples (with a validation split of 0.2) and 40 test samples belonging to the two classes of the refrigerator and microwave oven. Based on only one input channel of PF, the model achieved a 100% model evaluation capability and was able to accurately classify the forty samples that were seen before where class (0) is fridge and class (1) is microwave oven. The ROC plot is shown in Fig. 25. Again here the precision, recall and F1 score values are all equal to unity showing that this is also a good classification model.

Fig. 25
figure 25

ROC curve for PF RF and MW appliance classification in power series recognition for data

The respective confusion matrix related to the ROC plot information in Fig. 25 is shown below in Fig. 26.

Fig. 26
figure 26

Confusion matrix for PF RF and MW appliance seen power series data classification

The situation changes when we test unseen data during the training for the power series signal. We obtain the performance ROC plot in Fig. 27 and the confusion matrix in Fig. 28, with the resulting precision, recall and F1 score values are all equal to eighty per cent (0.8) which is an average classification result. The results show that our image proposed model achieves higher performance than the univariate power series which achieves eighty percent recognition for unseen test data. By implementing a multivariate based power recognition whether by fusion techniques or otherwise we can investigate to see if the performance of the power series method can improve. If it does we would have used more data points than in our proposed method. Power series redundancies can contribute to the lower performance of the recognition network. Redundancies are not a major factor in the image recognition system as the generated redundancies images overlap into one during the power series to image transformation.

Fig. 27
figure 27

ROC curve for PF RF and MW appliance classification in power series recognition for unseen data

Fig. 28
figure 28

Confusion matrix for PF RF and MW appliance unseen power series data classification

Comparative evaluation of our model based on the parameter type

A NILM recognition system can be developed bordering on various approaches of inputting data into the model. Primarily our data is acquired as a univariate power series that is individually acquired. We can then feed the data into the neural network as a stream of one power series or as parallel data in what is commonly termed the multivariate approach. The parallel data can also be fused to produce one composite data stream into the neural network. The multivariate or fusion approach has the advantage of availing more recognition features at the expense of more data handling and more data storage required capacity. The univariate approach is simpler, has less memory requirements but has the disadvantage of availing less recognition features for the deep learning algorithm. Nonetheless in this paper for want of memory conservation and less data handling we used the univariate approach. Hence, we generate the required images from a single power series at a time and use this image in the designed recognition system. It is necessary to assess the developed models response to each signal image parameter. To this end, we evaluate the performance of the recognition model on the different parameters. The performance of a particular model on specific data can be improved by considering such aspects as transfer and ensemble learning. However, now we will not consider these approaches. Although we achieved excellent signature disaggregation the binary classification model results as can be seen in Table 2 show a need to improve the classification model design as explained in the last part of this results section.

Table 2 Comparative evaluation of model performance on different signal parameters generated images

The results in Table 2 show that all three parameters can successfully be used in the image-based NILM recognition system developed here. The recognition based on the power signal parameter although somewhat less than that for PF produces acceptable average performance. In general, the current and power-based parameters provide a more interpretable outcome, since it is easier to tell that the magnitude is high or low. On the other hand, PF is a more abstract energy efficiency measure parameter. The recognition model was trained with an RMSprop optimizer having a learning rate of 0.00001 arrived at through experimentation. The current accuracy and loss plots were noisy as shown in Fig. 29. In Fig. 30 we show the confusion matrix and the ROC plot for the Watt parameter.

Fig. 29
figure 29

Arms performance plots. a Model accuracy, and b categorical crossentropy loss

Fig. 30
figure 30

Watt performance diagrams. a ROC, and b confusion matrix

By testing unseen data, the classification results show that our proposed model outperforms the power series based model with a dataset (Dataset B) of fewer image inputs as given by the kappa index in Table 3. The agreement between the raters is higher in the image-based system than for the power series system.

Table 3 AUC and Kappa index (k) values for the image and time series models

By carrying out more simulations and comparisons we were able to improve the results in Table 2 to those shown in Table 4 for I_rms based classification of four appliances.

Table 4 Classification report for the recognition of the four appliances

In Fig. 31 we show the designed classification training and validation model characteristics for the refrigerator (RF), microwave oven (MW), 12 W CFL (L1) mains lamp, and 5 W LED (L2) mains lamp.

Fig. 31
figure 31

Categorical crossentropy loss for the recognition of four appliances based on image disaggregation

We obtained 100% classification of three of the four appliance disaggregated images as shown in Fig. 32 confusion matrix. The poor recognition result of the LED (L2) lamp could be attributed to an insufficiently designed system which needs to factor in the low appliance signal which could be taken as noise in the system.

Fig. 32
figure 32

Confusion matrix for the recognition of four appliances based on image disaggregation

The results show that we can successfully implement an entirely image-based NILM mains load status recognition system and achieve acceptable results. This has the effect of considerably reducing the model dataset and pre-processing of raw data to be input into the neural network. In the classification model, we achieved acceptable values of accuracy, recall, precision and F-measure. We also achieved an overall appliance recognition rate of 75%. This is a good recognition rate considering that we had a model simulation platform which did not allow for extensively deeper models to be simulated. The disaggregation performance plots show the stability of the autoencoder model, and the training and validation losses decrease together as is expected in a good model. In the classification, we used balanced data to give a stable recognition model for the four appliances. To assess how good the disaggregation is we reconstructed the refrigerator I_rms disaggregated signal from the image gramian diagonal matrix, and found that it closely resembles the refrigerator I_rms ground truth signal as shown in Fig. 14.

Conclusion

The research objective of designing an image-based NILM recognition systems has been achieved in this paper. We have managed to provide extraction of appliance signal features in a simpler way by adopting an image-based deep learning self-feature extraction method. Secondly, by basing the recognition system on a computer vision approach that possesses a high input receptive field, we increase the field and depth of the features that we can extract without much need of data preprocessing. Immediate outcomes of this approach are the dispensing with the direct power series method, the reduction in the dataset and a high-performance system that is easier to handle. Under the constrained CPU platform that we did our simulations, we show that all the appliance parameters are capable of achieving acceptable NILM appliance recognition performance. However, with a detailed model design it is possible to achieve higher recognition performance.

In this paper, we obtain a rich set of localized aggregate and load features for more accurate NILM recognition through transforming the power series into the image by way of the Gramian Angular Fields technique. For recognition, we used a two-dimensional convolutional neural network that is very well adapted to computer vision applications and possesses a very high image feature extraction and detection capabilities. In this paper, the deep convolutional neural network is configured for both image classification and disaggregation. The disaggregation is in the form of an image-based denoising autoencoder model. We are able to harness the powerful denoising capabilities of the autoencoder to come up with an effective disaggregation method in the NILM recognition scheme. Our models perform excellent image-based disaggregation and classification respectively. In the final analysis, we compare the performance of our proposed recognition system to that of a one-dimensional power series convolutional neural network recognition system. The results show that our proposed method achieves acceptable performance.

In future, to reduce power series noise and improve on recognition we will consider various sensor (information) fusion techniques that include the Kalman filter, fuzzy fusion etc., and image fusion of the various signal parameter images. We need to investigate a few short learning as a means of increasing the performance and reducing the dataset of the NILM image-based recognition system.