Introduction

In recent years, due to information technology such as a computer, Internet, software and the development of sensor technology, make present life appeared “everything can be a digital”, big data is becoming a hot research topic in the field of all kinds of industrial [1]. Big Data has been created in the fields of business applications such as marketing, social networking, science, and smartphone applications. Nowadays, better computing power is more necessary in the area of data mining [2].

Image comparison is largely used in different fields: shape matching [3], registration [4]. However, comparing binary images that represent the same content is not easy because an image can undergo transformations like rotation, translation, resolution change. In data mining, there are many frequent patterns (FP) mining algorithms e.g FP-growth [5], CFP-growth [6], MISFP-growth (Multiple Item Support Frequent Patterns) [7], Apriori [8]. A comparison of FP mining algorithms is given in [9]. Apriori algorithm is the best-known basic algorithm proposed by R. Agrawal and R. Srikant in 1994. The Apriori algorithm is one of the most important algorithms which is used to retrieve frequent itemsets. For large databases, there exists the algorithms FTWeightedHashT [10] and MISFP-growth [9]. It essentially requires two important things: minimum support and minimum confidence [11, 12]. Most of the commonly used association rule discovery algorithm that utilise the frequent itemset strategy, and which is exemplified by the Apriori algorithm [13]. The Apriori algorithm searches for a large set of items in the initial database analysis. The result of this analysis is then used as a base for discovering other datasets during other passes. The rules having a support level above the minimum threshold are called small itemsets [14]. On the other hand, the algorithm is based on the big itemset property which states: Each subset of a large itemset is large and if an itemset is not wide and then none of its supersets are large [8].

In this paper, we propose a novel method to compare the binary image based Apriori algorithm. We transform our image into a set of items and transactions (see Fig. 1). In which, we suppose that image is a set of transactions and items. Our method is applied along rows and columns of an image. Thus, the Apriori algorithm is applied along rows and columns of an image:

  1. 1.

    Apriori algorithm along row: The rows of the image will be considered as transactions. The columns of the image will be considered as items.

  2. 2.

    Apriori algorithm along column: The rows of the image will be considered as items. The columns of the image will be considered as transactions.

The rest of the paper is organized as follows. “Related work” section discusses related work. In the “Methodology” section, we explain the proposed approach. “Results and discussion” section shows experimental results and discussion. Finally, the conclusion is drawn, and the future work described, followed by references.

Fig. 1
figure 1

An example of Apriori execution for binary image. a Binary image, b Apriori algorithm along row, c Apriori algorithm along column

Related works

Different types of techniques are presently used in the domains of big data analytics and image processing [15,16,17,18,19]. The integration and interaction of the two wide fields need more vision to better exploit and explore the benefits of the two techniques. Vahini et al. [20] proposed a state of the art in image processing and big data analytics, this study aims to focus of the recent research progresses in the two broad fields of Big Data analytics and image processing to show out the importance of their interaction and integration. A novel workflow based on big data analytic for biomedical image classification is proposed in [21]. It is composed of two structures; the first is based on the spark and the second on the Hadoop framework. In [22] authors present an efficient fast-response content-based image retrieval (CBIR) Hadoop-based framework that consists of a set of modules working in two layers. Creating a content-based image retrieval system pointing at big data to operate efficiently with real-time data is counted as a critical competition.

There are a lot of methods for image comparison. Current papers aim at reviewing the available tools and techniques for image processing and comparison [23,24,25]. These tools in relation to big data can create a user-friendly solution in solving the industrial problem [26]. Several approaches to compare binary images are proposed in the literature [27]. Some are based on the comparison of image descriptors and others on the direct comparison of image. In direct approaches, the comparison is done between the images directly from the pixels. Direct comparisons between binary images are difficult to be implemented. This is because unlike the color image, the pixels in the binary image have little information: the amplitude of the variation is 0 or 1. Direct comparisons can be used to assess noise, the segmentation fault [28] and to compare images of faces contours to evaluate the dissimilarity [29,30,31]. Most are based on an aggregate measure and provide a measure of similarity in the scalar form (a positive real), they are also based on the local study (as part of grayscale images) [32] and provide a similarity index in two dimensions. Indirect methods are based on image processing to represent the image in a new space. The geometrical moments, with two dimensions, which provide descriptors for recognition have many applications in image processing. In robotics, they are used for motion search calculation and guidance for registration; they are also used in image processing for solving some problems such as matching scene [33] and character recognition [34]. The methods based on frontier methods such as Fourier descriptors, space decomposition, contour curvature-scale and decomposition of the principal component outline have many applications especially for the treatment of forms [35]. The methods based on the border are a robust descriptor for a shape, but they are not available for a complex image, which contains several details. Transformed the entire image: global transforms (Fourier and wavelet transform) are unsuitable for the processing of digital images as they are composed of discontinuities. Filtering non-linear methods are adapted for binary images [36]. However, in the case of linear filtering, the comparison of the transformed images can be performed after selecting the most significant coefficients (for the comparison of their standards), which is much less discriminating factor in the case of binary coefficients.

We have presented in this section direct and indirect methods of image comparison. Indirect methods are inadequately appropriate for comparing binary images other than shapes. However, direct methods are actually interesting for measuring a distance between sets of points and not only pixel to pixel. But, they do not generally allow access to the local dissimilarity information or so without adapting to the type of dissimilarity to assess. In this work, we apply the Apriori algorithm, which is an indirect method to find frequent itemsets.

Methodology

In this section, we first describe briefly Hu’s moments method [37]. Then, we present our proposed method.

Moments method

Moments invariants are introduced by Hu. In [37], Hu derived six absolute orthogonal invariants and one skew orthogonal invariant based upon algebraic invariants, which are not only independent of position, size and orientation but also independent of parallel projection. The moment invariants have been proved to be the adequate measures for tracing image patterns regarding the translation of the images, scaling and rotation under the assumption of images with continuous functions and noise-free. Moment invariants have been extensively applied to image registration [38] and image pattern recognition [38,39,40,41]. The well-known moments include geometric moments [37], zernik moments [42, 43], rotational moments [44] and complex moments [39].

In his work, Hu proved that each irradiance function f(x, y) (or image) has a one-to-one correspondence to a single set of moments and vice versa. The \((p+q)^{th}\) order geometric moment of f(x,y) can be expressed in the following equation:

$$\begin{aligned} m_{pq}(x,y)=\int _{-{\infty }}^{+{\infty }}\int _{-{\infty }}^{+{\infty }}x^{p}x^{q}dxdy \end{aligned}$$
(1)

So, the 7 invariants moments Hu are shown in Eq. (4).

With

$$\begin{aligned} \mu _{ij}=\int _{-{\infty }}^{+{\infty }}\int _{-{\infty }}^{+{\infty }}(x-\tilde{x})(y-\tilde{y})I(x,y)dxdy \end{aligned}$$
(2)

i, j = 0, 1, 2,...

and

$$\begin{aligned} \theta _{ij}=\frac{\mu _{ij}}{\mu _{00}^{\left(1+ \frac{1+j}{2} \right)}} \end{aligned}$$
(3)

where i+j \(\ge\)2, then

$$\begin{aligned} \left\{ \begin{array}{llllllllllllll} \phi _{1}=(\theta _{20}+\theta _{02}),\\ \phi _{2}=(\theta _{20}-\theta _{02})+4\theta _{11}^2,\\ \phi _{3}=(\theta _{30}-3\theta _{12})^2+(3\theta _{21}-\theta _{03})^2,\\ \phi _{4}=(\theta _{30}+\theta _{12})^2+(\theta _{21}+\theta _{03})^2,\\ \phi _{5}=(\theta _{30}-3\theta _{12})(\theta _{30}+\theta _{12}) \left[(\theta _{30}+\theta _{12})^2-3(\theta _{21}+\theta _{03})^2\right],\\ \quad \quad +(3\theta _{12}-\theta _{03})(\theta _{21}+\theta _{03})\left[3(\theta _{30}+\theta _{12})^2-(\theta _{21}+\theta _{03})^2 \right]\\ \phi _{6}=(\theta _{20}-\theta _{02})\left[(\theta _{30}+\theta _{21})^2-(\theta _{21}+\theta _{03})^2\right]+4\theta _{11}\\ \quad \quad (\theta _{30}+\theta _{12})^2(\theta _{21}+\theta _{03}),\\ \phi _{7}=(3\theta _{21}-\theta _{03})(\theta _{30}+\theta _{12})\left[(\theta _{30}+\theta _{12})^2-3(\theta _{21}+\theta _{03})^2\right]\\ \quad \quad-(\theta _{30}+3\theta _{12})(\theta _{21}+\theta _{03})\left[3(\theta _{30}+\theta _{12})^2-(\theta _{21}+\theta _{03})^2\right]\\ \end{array} \right. \end{aligned}$$
(4)

Proposed technique

In this section, we present our approach. In Fig. 2, an illustration to explain the proposed approach is presented. As shown in this figure, example image the size \(8 \times 10\). If we take min_sup = 2, then the triplet ((2,3), 2,5) contains three frequent items of length 1, two frequent items with length 2 and one frequent item of large three (see Fig. 2b, c).

In general, we can rewrite a frequent pattern using the triplet (Position First Pixel, Start row, End row, Width) or (Position First Pixel, Start column, End column, Width).

Where

Position_First_Pixel: indicates the position of the first pixel inside a frequent pattern.

  • Star_row: denotes the number of starting row,

  • End_row: denotes the number of ending row,

  • Width: indicates the large of frequent pattern.

Fig. 2
figure 2

Example illustrating the proposed approach for binary image. a Binary image, b three frequent items with length 1, c two frequent items with length 2

The principle of the proposed approach is that the image is a set of transactions and items. We apply the Apriori algorithm along rows and columns of an image; this image is represented by all frequent items. Algorithm 1 presents the algorithm Apriori for a binary image. In the case of the row, the rows of the image are considered as transactions and the columns of the image are considered as items. In this case, the column, the rows of the image are considered as items and the columns of the image are considered as transactions.

figure a

Apriori algorithm for color image

The main idea is that we segment the image, and each segmented region is considered a binary image. The segmentation [45] is the first step in analyzing an image. It is a separation of the image elements into homogenous regions having the same property. These regions can be characterized by their borders, in this case, they are characterized by the pixels that compose those borders.

In general methods of image segmentation can be one of the following four cases:

  1. 1.

    Segmentation based on regions,

  2. 2.

    Segmentation based on contours,

  3. 3.

    Segmentation based on the classification of pixels as a function of intensity,

  4. 4.

    Segmentation based on cooperation of these three methods.

For simplicity in our study, we apply the first case which is based on dividing the image into small regions [46].

Results and discussions

We conducted an experiment to measure the performance of the proposed method. In the experiment, we present the tools used to evaluate our approach. Then we compare the proposed approach with Hu moments approach and we test its robustness to noise. Finally, we apply our approach to color image, in addition to an application on the MPEG7 database is presented.

We developed a java application that is used to implement our approach and a simple interface that makes it possible for the user to import a binary image for the Apriori algorithm (see Fig. 3). The user imports the binary image, an interface allows him to view the image and also in the same interface to find the Apriori algorithm result. The editor was used in this study is the Eclipse IDE. It is an integrated development environment and open source. It is also characterized by open plugin-based architecture. This is one of the most used editors by the developer’s Java IDE. Our application runs under the Windows platform, using 2.20 GHz and 1024 MB memory.

Fig. 3
figure 3

Java Interface implementing Apriori algorithm

Fig. 4
figure 4

Images for experiments

Fig. 5
figure 5

An example of noising and scaling operations on the letter “F.” Left: the original image. Middle: noised image. Right: image scaling

In this study, we extract seven moment invariants from our test image and noisy image. In Fig. 4, image (a) is the original image, image (b) is a noisy image obtained using randn() function to the image (a), and the two images (c) and (d) are two noisy image obtained by adding noise manually.

Concerning our approach, our technique is presented by:

Apriori algorithm along the row. In this case, the rows of the image are considered as transactions and the columns of the image are considered as items. We take minsup row = 50%, minsup is arbitrarily set.

Apriori algorithm along the column. In this case, it is the same procedure applied along row except Apriori algorithm is applied to columns. The rows of the image are considered as items and the columns of the image are considered as transactions.

Figure 5 gives a binary image of the English letter “F.” Figure 5a denotes the original image, Fig. 5b denotes the noised image, and Fig. 5c denotes the scaled image. Different kinds of imaging systems might give us different noises. In this study, we used Gaussian noise [47].

In the case of color images, we segment the image; the color image is presented in RGB space, so we apply our technique using the following steps:

  1. 1.

    Segmentation of the image into regions using k_means,

  2. 2.

    Each segmented region is considered as binary image to which we apply Apriori algorithm,

  3. 3.

    Extract the frequent items for each image.

In this study, we test our technique in image color, as shown in Fig. 6.

Fig. 6
figure 6

a Original image, b blue image, c green image, d red image

Table 1 shows the values of seven moment invariants for the images (a–d) of Fig. 4. We remark that only the value of the moment is different for the original image and the corresponding noisy image.

Table 1 The value of Hu moment invariants

Tables 2 and 3 show the obtained results when we use our method. We note that the results are the same for the three images, so there is no influence of the noise.

Table 2 shows only the maximum length frequent items but in our experience, the program displays all levels of frequent items. Rows 8, 9, 10, 11, 12, 13 and 14 constitute a frequent itemset of length 7 for each image. In addition, Table 4 gives the experimental results for the binary image with scaling and with added noise. The scale is solved by normalizing image to a fixed size.

Notice that, Hu moment invariants depend on noise. Despite the presence of noise two images have the same frequent columns or rows concerning the proposed approach.

From these results, it can be observed that the proposed approach is better than Hu moment invariants.

Table 2 Results found by our approach along rows
Table 3 Results found by our approach along columns
Table 4 Results found by our approach along rows and columns for binary images with scaling and Gaussian noise

Images in Fig. 7 are used to show that our approach can be used to compare binary images of different levels. The results found with the Apriori algorithm along rows for images given in Fig. 7 are shown in Table 5. Image (a) depicts four frequent items in different length, from the longest to the shortest item. Image (b) has one frequent item having a maximum length. If we take only items of maximum length, images (a) and (b) will be similar.

Fig. 7
figure 7

Two images line used in experiments

Table 5 Results found by our approach along rows for binary images

The results found with the Apriori algorithm along rows for binary images Fig. 6b–d are shown in Table 6.

Table 6 Results found by our approach along rows for segmented binary image

Application on MPEG7 database

In order to test our approach; we apply the Apriori algorithm on the MPEG7 database [48]. In this database there are 70 classes of shapes, each one has 20 members as shown in Fig. 9. We extract the frequent items from the image by counting the frequency of non black pixel. An MPEG7 image of size \(256 \times 256\) is presented in Fig. 8.

Fig. 8
figure 8

‘Apple’ MPEG 7 image

Fig. 9
figure 9

Example of MPEG7 image database

We have found that is easy to extract frequent item from image. In Figs. 10 and 11 the authors show the frequency of non black pixel respectively in each row and each column in the image. As can be seen in Figs. 10 and 11 we can notice that those graphs present mostly the shape of ‘Apple’ image. Frequency of non black pixel for each row and frequency of non black pixel for each column are given in Tables 7 and 8 respectively. As row ‘141’ contains ‘222’ white pixel it can be considered as frequent if min-sup is fixed to 10.

Fig. 10
figure 10

Frequency of non black pixel in each row of the image

Fig. 11
figure 11

Frequency of non black pixel in each column of the image

Table 7 Frequency of non black pixel for row
Table 8 Frequency of non black pixel for column

Conclusion and future work

In this work, we have proposed a technique to compare binary images. We considered an image as a set of transactions and items. The proposed technique depends on the Apriori algorithm, we apply Apriori along rows or columns of an image. In addition, we apply our technique to color image. The result shows that our proposed approach based on the Apriori algorithm is agreeable particularly this technique is robust for noise.

In the future, we will integrate this technique to solve the rotation problem. Then, we will integrate methods, such as FP-growth, FTWeightedHashT Apriori and MISFP-growth to solve the problem of large data (Large Binary Image).