A scalable and effective rough set theory-based approach for big data pre-processing

Chelly Dagdia, Zaineb; Zarges, Christine; Beck, Gaël; Lebbah, Mustapha

doi:10.1007/s10115-020-01467-y

A scalable and effective rough set theory-based approach for big data pre-processing

Regular Paper
Open access
Published: 02 May 2020

Volume 62, pages 3321–3386, (2020)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

A scalable and effective rough set theory-based approach for big data pre-processing

Download PDF

Zaineb Chelly Dagdia ORCID: orcid.org/0000-0002-2551-6586^1,2,3,
Christine Zarges²,
Gaël Beck⁴ &
…
Mustapha Lebbah⁴

3284 Accesses
14 Citations
2 Altmetric
Explore all metrics

Abstract

A big challenge in the knowledge discovery process is to perform data pre-processing, specifically feature selection, on a large amount of data and high dimensional attribute set. A variety of techniques have been proposed in the literature to deal with this challenge with different degrees of success as most of these techniques need further information about the given input data for thresholding, need to specify noise levels or use some feature ranking procedures. To overcome these limitations, rough set theory (RST) can be used to discover the dependency within the data and reduce the number of attributes enclosed in an input data set while using the data alone and requiring no supplementary information. However, when it comes to massive data sets, RST reaches its limits as it is highly computationally expensive. In this paper, we propose a scalable and effective rough set theory-based approach for large-scale data pre-processing, specifically for feature selection, under the Spark framework. In our detailed experiments, data sets with up to 10,000 attributes have been considered, revealing that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance. Thus, making it relevant to big data.

In-Database Feature Selection Using Rough Set Theory

PICKT: A Solution for Big Data Analysis

A group incremental feature selection based on knowledge granularity under the context of clustering

Article 29 March 2024

1 Introduction

In a broad variety of domains, data are being gathered and stored at an intense pace due to the Internet and the widespread use of databases [7]. This ongoing rapid growth of data that led to a new terminology ‘big data’ has created an immense need for a novel generation of computational techniques, theories and approaches to extract useful information, i.e., knowledge, from these voluminous gathered data. These theories and approaches are the key elements of the emerging domain of knowledge discovery in databases (KDD) [21]. More precisely, big data arise with many challenges such as in clustering [35], in classification [34], in mining [38] but mainly in dimensionality reduction and more precisely in feature selection as this is usually a source of potential data loss [13]. This motivated researchers to build an efficient and an automated knowledge discovery process with a special focus on its third step, namely data reduction.

Data reduction is an important point of interest as many real-world applications may have a very large number of features (attributes) [41]. For instance, among the most practically relevant and high-impact applications are biochemistry, genetics and molecular biology. In these biological sciences, the collected data, e.g., gene expression data, may easily have a number of attributes which is more than 10,000 [1]. However, not all of these attributes are crucial and needed since many of them can be redundant in the context of some other features or even completely irrelevant and insignificant to the task being handled. Therefore, several important issues arise when learning in such a situation among these are the problems of over-fitting to insignificant aspects of the given data, as well as the computational burden due to the process of several similar attributes that give some redundant information [17]. These problems may decrease the performance of any learning technique, e.g., a classification algorithm. Hence, to solve these problems, it is an important and significant research direction to automatically look for and only select a small subset of relevant attributes from the initial large set of attributes; that is, to perform feature selection. In fact, by removing the irrelevant and redundant attributes, feature selection is capable of reducing the dimensionality of the input data while speeding up the learning process, simplifying the learned model as well as increasing the performance [9, 17].

At an abstract level, to reduce the high dimensionality of data sets, suitable techniques can be applied with respect to the requirements of the future KDD process. The taxonomy of these techniques falls into two main groups namely feature selection techniques and feature extraction techniques [17]. The main difference between the two approaches is that techniques for feature selection select a subset from the initial features while techniques for feature extraction create new attributes from the initial feature set. More precisely, feature extraction techniques transform the underlying semantic (meaning) of the attributes while feature selection techniques preserve the data set semantics in the process of reduction. In knowledge discovery, feature selection techniques are notably desirable as these ease the interpretability of the output knowledge. In this paper, we mainly focus on the feature selection category for big data pre-processing.

Technically, feature selection is a challenging process due to the very large search space that reflects the combinatorially large number of all possible feature combinations to select from. This task is becoming more difficult as the total number of attributes is increasing in many big data application domains combined with the increased complexity of those problems. Therefore, to cope with the vast amount of the given data, most of the state-of-the-art techniques employ some degree of reduction, and thus, an effective feature reduction technique is needed.

As one of data analysis techniques, rough set theory (RST) [27]-based approaches have been successfully and widely applied in data mining and knowledge discovery [16], and particularly for feature selection [31]. Nonetheless, in spite of being powerful rough set-based feature selection techniques, most of the classical algorithms are sequential ones, computationally expensive and can only handle non-large data sets. The fact that the RST-based algorithms are computationally expensive and the reason behind the methods’ incapacity to handle high dimensional data is explained by the need to first generate all the possible combinations of attributes at once, then process these in turn to finally select the most pertinent and relevant set of attributes.

Nevertheless, as previously mentioned, since the number of attributes is becoming very large this task becomes more critical and challenging, and at this point the RST-based approaches reach their limits. More precisely, it is unfeasible to generate all the possible attribute combinations at once because of both hardware and memory constraints.

This leads us to advance in this disjointed field and broaden the application of the theory of rough sets in the domain of data mining and knowledge discovery for big data. This paper proposes a scalable and effective algorithm based on rough sets for large-scale data pre-processing, and specifically for big data feature selection. Based on a distributed implementation design using both Scala and the Apache Spark framework [36], our proposed distributed algorithm copes with the RST computational inefficiencies and its restriction to be only applied to non-large data sets. To deeply analyze the proposed distributed approach, experiments on big data sets with up to 10,000 features will be carried out for feature selection and classification. Results demonstrate that our proposed solution achieves a good speedup and performs its feature selection task well without sacrificing performance, making it relevant to big data.

The rest of this paper is structured as follows. Section 2 presents preliminary information and related work. Section 3 reviews the fundamentals of rough set theory for feature selection. Section 4 formalizes the motivation of this work and introduces our novel distributed algorithm based on rough sets for large-scale data pre-processing. The experimental setup is introduced in Sect. 5. The results of the performance analysis are given in Sect. 6, and the conclusion is given in Sect. 7.

2 Literature review

Feature selection is defined as the process that selects a subset of the most relevant and pertinent attributes from a large input set of original attributes. For example, feature selection is the task of finding key genes (i.e., biomarkers) from the very huge number of candidate genes in biological and biomedical problems [3]. It is also the task of discovering core indicators (i.e., attributes) to describe the dynamic business environment [25], or to select key terms (e.g., words or phrases) in text mining [2] or to construct essential visual contents (e.g., pixel, color, texture, or shape) in image analysis [15].

In many data mining and machine learning real-world problems, feature selection became a crucial and highly important data pre-processing step due to the abundance of noisy, irrelevant and/or misleading features that are in big data. To cope with this, the usefulness of a feature can be measured by its relevancy as well as its redundancy. In fact, a feature is considered to be relevant if it can predict the decision feature(s); otherwise, it is said to be irrelevant as it provides no useful information with reference to any context. On the other hand, a feature is considered to be redundant if it provides the same piece of information for the currently selected features; this means that it is highly correlated with them. Hence, feature selection must provide beneficial results from big data as it should detect those attributes that present a high correlation with the decision feature(s), but at the same time are uncorrelated with each other.

In the literature, feature selection techniques can be broadly grouped into two main approaches which are filter approaches and wrapper approaches [9, 17]. The key difference between the two approaches is that wrapper approaches involve a specific learning algorithm, e.g., classification algorithm, when it comes to evaluating the attribute subset. The applied learning algorithm is mainly used as a black box by the wrapper approach to evaluate the quality (i.e., the classification performance) of the selected attribute set. Technically, when an algorithm performs feature selection in an independent way of any learning algorithm, the approach is defined as a filter where the set of the irrelevant features are filtered out before the induction process. Filter approaches tend to be applicable to most real-world domains since they are independent from any specific induction algorithm. On the other side, if the evaluation task is linked or dependent to the task of the learning algorithm then the feature selection approach is a wrapper technique. This approach searches through the attribute subsets space using the training (or validation) accuracy value of a specific induction algorithm as the measure of utility for a candidate subset. Therefore, these approaches may generate subsets that are overly explicit and specific to the used learning algorithm, and hence, any modification in the learning model might render the attribute set suboptimal.

Each of these two feature selection categories has its advantages and shortcomings where the main distinguishing aspects are the computational speed and the possibility of over-fitting. Overall, in terms of speed of computation, filter algorithms are usually computationally less expensive and more general than the wrapper techniques. Wrappers are computationally expensive and can easily break down when dealing with a very large number of attributes. This is due to the adoption of a learning algorithm in the evaluation process of subsets [24, 26]. In terms of over-fitting, the wrapper techniques have a higher learning capability so are more likely to overfit than filter techniques. It is important to mention that in the literature, some researchers classified feature selection techniques into three separate categories, namely the wrapper techniques, the embedded techniques and the filter techniques [24]. The embedded approaches tend to fuse feature selection and the learning approach into a single process. For large-scaled data sets having large number of features, the filter methods are usually a good option. Focusing on this category is the main scope of this paper.

Meanwhile, in the context of big data, it is worth mentioning that a detailed study was conducted in [6] where authors performed a deep analysis of the scalability of the state-of-the-art feature selection techniques that belong to the filter, the embedded and the wrapper techniques. In [6], it was demonstrated that the state-of-the-art feature selection techniques will obviously have scalability issues when dealing with big data. Authors have proved that the existent techniques will be inadequate for handling a high number of attributes in terms of training time and/or effectiveness in selecting the relevant set of features. Thus, the adaptation of feature selection techniques for big data problems seems essential and it may require the redesign of these algorithms and their incorporation in parallel and distributed environments/frameworks. Among the possible alternatives is the MapReduce paradigm [10] which was introduced by Google and which offers a robust and efficient framework to deal with big data analysis. Several recent works have been concentrated on parallelizing and distributing machine learning techniques using the MapReduce paradigm [40, 43, 44]. Recently, a set of new and more flexible paradigms have been proposed aiming at extending the standard MapReduce approach, mainly Apache Spark^{Footnote 1} [36] which has been applied with success over a number of data mining and machine learning real-world problems [36]. Further details and descriptions of such distributed processing frameworks will be given in Sect. 4.1.

With the aim of choosing the most relevant and pertinent subset of features, a variety of feature reduction techniques were proposed within the Apache Spark framework to deal with big data in a distributed way. Among these are several feature extraction methods such as nn-gram, principal component analysis, discrete cosine transform, tokenizer, PolynomialExpansion, ElementwiseProduct, etc., and very few feature selection techniques which are the VectorSlicer, the RFormula and the ChiSqSelector. To further expand this restricted research, i.e., the development of parallel feature selection methods, lately, some other feature selection techniques were proposed in the literature which are based on evolutionary algorithms [30]. Specifically, the evolutionary algorithms were implemented based on the MapReduce paradigm to obtain subsets of features from big data sets.^{Footnote 2} These include a generic implementation of greedy information theoretic feature selection methods^{Footnote 3} which are based on the common theoretic framework presented in [29], and an improved implementation of the classical minimum Redundancy and Maximum Relevance feature selection method [29]. This implementation includes several optimizations such as cache marginal probabilities, accumulation of redundancy (greedy approach) and a data-access by columns.^{Footnote 4} Nevertheless, most of these techniques suffer from some shortcomings. For instance, they usually require the user or expert to deal with the algorithms’ parameterisation, noise levels specification, where some other techniques simply order the attributes set and let the user choose his/her own subset. There are some other feature selection techniques that require the user to indicate how many attributes should be selected, or they must give a threshold that determines when the algorithm should end, which are all counted as significant drawbacks. All of these require users to make a decision based on their own (possibly subjective) perception. To overcome the shortcomings of the state-of-the-art techniques, it seemed to be crucial to look for a filter approach that does not require any external or supplementary information to function properly. Rough set theory (RST) can be used as such a technique [39].

The use of rough set theory in data mining and knowledge discovery, specifically for feature selection, has proved to be very successful in many application domains such as in classification [22], clustering [23] and in supply chain [5]. This success is explained by the several aspects of the theory in dealing with data. For example, the theory is able to analyze the facts hidden in data, does not need any supplementary information about the given data such as thresholds or expert knowledge on a particular domain and is also capable to find a minimal knowledge representation [11]. This is achieved by making use of the granularity structure of the provided data only.

Although algorithms based on rough sets have been widely used as efficient filter feature selectors, most of the classical rough set algorithms are sequential ones, computationally expensive and can only deal with non-large data sets. The prohibitive complexity of these algorithms comes from the search for an optimal attribute subset through the computation of an exponential number of candidate subsets. Although it is an exhaustive method, this is quite impractical for most data sets specifically for big data as it becomes clearly unmanageable to build the set of all possible combinations of features.

In order to overcome these weaknesses, a set of parallel and distributed rough set methods has been proposed in the literature to ensure feature selection but in different contexts. For example, some of these distributed methods adopt some evolutionary algorithms, such as the work proposed in [12], where authors defined a hierarchical MapReduce implementation of a parallel genetic algorithm for determining the minimum rough set reduct, i.e., the set of the selected features. Within another context, the context of limited labeled big data, in [32], authors introduced a theoretic framework called local rough set and developed a series of corresponding concept approximation and attribute reduction algorithms with linear time complexity, which can efficiently and effectively work in limited labeled big data. In the context of distributed decision information systems, i.e., several separate data sets dealing with different contents/topics but concerning the same data items, in [19], authors proposed a distributed definition of rough sets to deal with the reduction of these information systems.

In this paper, and in contrast to the state-of-the-art methods, we mainly focus on the formalization of rough set theory in a distributed manner by using its granular concepts only, and without making use of any heuristics, e.g., evolutionary algorithms. We also focus on a single information system, i.e., a single big data set, which covers a single content/topic and which is characterized by a full and complete labeled data. Within this focus, and in the literature, a first attempt presenting a parallel rough set model was given in [8]. The main idea in [8] is to split the given big data set into several partitions, each with a smaller number of features which are all then processed in a parallel way. This is to minimize the computational effort of the RST computations when dealing with a very large number of features particularly. However, it is important to mention that the scalability of [8] was only validated in terms of sizeup and scaleup with a change in the standard metrics definitions (the standard definitions are given in Sect. 6.2). Actually, the used definition of these two metrics was based on the number of features per partition instead of the standard definition where the evaluation has to be based on the total number of features in the database used.

In this paper, we propose a redesign of rough set theory for feature selection by giving a better definition of the work presented in [8], specifically when it comes to the validation of the method (Sect. 6). Our work, which is an extension of [8], is based on a distributed partitioning procedure, within a Spark/MapReduce paradigm, that makes our proposed solution scalable and effective in dealing with big data. For the validation of our method, and in contrast to [8], we believe that using the overall number of attributes is a much more natural setup as it will give insights into the performance depending on the input data set rather than the partitions.

3 Rough sets for feature selection

Rough set theory (RST) [27, 28] is a formal approximation of the conventional set theory that supports approximations in decision making. This approach can extract knowledge from a problem domain in a concise way and retain the information content while reducing the involved amount of data [39]. This section focuses mainly on highlighting the fundamentals of RST for feature selection.

3.1 Preliminaries

In rough set theory, the training data set is called an information table or an information system. It is represented by a table where rows represent objects or instances and columns represent attributes or features. The information table can be defined as a tuple $S = (U, A)$, where $U = \{u_1, u_2, \ldots , u_N\}$ is a non-empty finite set of N instances (or objects), called universe, and A is a non-empty set of $(n + k)$ attributes. The feature set $A = C \cup D$ can be partitioned into two subsets, namely the conditional feature set $C = \{a_1, a_2, \ldots , a_n\}$ consisting of n conditional attributes or predictors and the decision attribute $D = \{d_1, d_2, \ldots , d_k\}$ consisting of k decision attributes or output variables. Each feature $a \in A $ is described with a set of possible values $V_a$ named the domain of a.

For each non-empty subset of attributes $P \subset C$, a binary relation called P-indiscernibility relation, which is the central concept of rough set theory, is defined as follows:

$$\begin{aligned} IND (P) = \left\{ (u_1, u_2) \in U \times U{:}\,\forall a \in P, a(u_1) = a(u_2)\right\} . \end{aligned}$$

(1)

where $a(u_i)$ refers to the value of attribute a for the instance $u_i$. This means if $(u_1, u_2) \in IND (P)$, then $u_1$ is indistinguishable (indiscernible) from $u_2$ by the attributes P. This relation is reflexive, symmetric and transitive.

The induced set of equivalence classes is denoted as $[u]_P$ where $u \in U$, and it partitions U into different blocks denoted as U/P.

The rough set approximates a concept or a target set of objects $X \subseteq U$ using the equivalence classes induced using P as follows:

$$\begin{aligned} {\underline{P}}(X)= & {} \{u{:}\,[u]_P \subseteq X \}. \end{aligned}$$

(2)

$$\begin{aligned} {\overline{P}}(X)= & {} \{u{:}\,[u]_P \cap X \ne \emptyset \}. \end{aligned}$$

(3)

where ${\underline{P}}(X)$ and ${\overline{P}}(X)$ denote the P-lower (certainly classified as members of X) and P-upper (possibly classified as members of X) approximations of X, respectively. The notation $\cap $ denotes the intersection operation.

The concept that defines the set of instances that are not certainly, but can possibly be classified in a specific way is named the boundary region and is defined as the difference between the two approximations. X is a crisp set if the boundary region is an empty set, i.e., accurate approximation, ${\overline{P}}(X) = {\underline{P}}(X)$; otherwise, it is a rough set.

To compare subsets of attributes, a dependency measure is defined. For instance, the dependency measure of an attribute subset Q on another attribute subset P is given as:

$$\begin{aligned} \gamma _P(Q) = \frac{| POS _P(Q)|}{|U|}. \end{aligned}$$

(4)

where $0 \le \gamma _P(Q) \le 1$, $\cup $ denotes the union operation, | | denotes the set cardinality, and $ POS _P(Q)$ is defined as:

$$\begin{aligned} POS _P(Q) = \bigcup _{X \in [u]_Q} {\underline{P}}(X). \end{aligned}$$

(5)

${POS}_P(Q)$ is the positive region of Q with respect to P and is the set of all elements of U that can be uniquely classified to blocks of the partition $[u]_Q$, by means of P. The closer $\gamma _P(Q)$ is to 1, the more Q depends on P.

Based on these basics, RST defines two important concepts for feature selection which are the Core and the Reduct.

3.2 Reduction process

The theory of rough sets aims at finding the smallest subset of the conditional attribute set in a way that the resulting reduced database remains consistent with respect to the decision attribute. A database is considered to be consistent in case where for every set of objects, having identical feature values, the corresponding decision features are the same. To achieve this, the theory defines the Reduct concept and the Core concept.

Formally, in an information table, the unnecessary attributes can be categorized into either irrelevant features or into redundant features. The point is to define an heuristic that defines a measure to evaluate the necessity of a feature. Nevertheless, it is not easy to define an heuristic based on these qualitative definitions of irrelevance and redundancy. Therefore, authors in [20] defined strong relevance and weak relevance of an attribute based on the probability of the target concept occurrence given this attribute. The set of the strong relevant attributes presents the indispensable features in the sense that they cannot be removed from the information table without causing a loss of the prediction accuracy. On the other hand, the set of the weak relevant features can in some cases contribute to the prediction accuracy. Based on these definitions, both of the strong and the weak relevance concepts can provide good basics upon which the description of the importance of each feature can be defined. In the rough set terminology, the set of strong relevant attributes can be mapped to the Core concept while the Reduct concept defines a mixture of all strong relevant attributes and some weak relevant attributes.

To define these key concept, RST sets the following formalizations: A subset $R \subseteq C$ is said to be a reduct of C in the case where

$$\begin{aligned} \gamma _R(D) = \gamma _C(D) \end{aligned}$$

(6)

and there is no $R' \subset R$ such that $ \gamma _{R^{'}}(D) = \gamma _R(D)$. Based on this formula, the Reduct can be defined as the minimal set of selected features that preserve the same dependency degree as the whole set of features.

In practice, from the given information table, it is possible that the theory generates a set of reducts: $ RED ^{F}_{C}(D)$. In this situation, any reduct in $ RED ^{F}_{C}(D)$ can be selected to describe the original information table.

The theory also defines the Core concept which is the set of features that are enclosed in all reducts. The Core concept is defined as

$$\begin{aligned} CORE _{C}(D) = \bigcap RED ^{F}_{C}(D). \end{aligned}$$

(7)

More precisely, the Core is defined as the set of features that cannot be omitted from the information table without inducing a collapse of the equivalence class structure. Thus, the Core is the most important subset of attributes, since none of its elements can be removed without affecting the classification power of attributes. This means that all the features which are in the Core are indispensable.

4 Parallel computing frameworks and the MapReduce programming model

In this section, we highlight the main solutions for big data processing. We, also, give a description of the MapReduce paradigm.

4.1 Parallel computing frameworks

With the dramatic increase of the amount of data, it has become crucial to implement a new set of technologies and tools that permit improved decision making and insight discovery. In this context, different techniques [33] have been developed to handle high dimensional data sets where most of these proposed tools are based on distributed processing, e.g., the Message Passing Interface (MPI) programming paradigm [37].

The encountered challenges in this concern are essentially linked to the access to the given big data, to the transparency of the development process of the software with respect to its prerequisites, as well as to the available programming paradigms [14]. For example, standard techniques require that all the given data should be loaded into the main machine’s memory. This obviously presents a technical issue in big data since the data, which is given as input, is usually stored in different locations causing an intensive communication in the network as well as some supplementary input and output costs. It is true that it is possible to afford this, but it is also important to mention that it will be crucial to afford an intensively large main memory to be able to retain all the pre-loaded given data for computing and processing purposes.

To overcome these serious limitations, a new set of highly efficient and fault-tolerant parallel frameworks has been developed and set in the market. These distributed frameworks can be categorized with respect to the nature or type of the data they are able to process. Actually, there are some frameworks that can only process batch data. Within this schema, the parallel processing system functions over a high dimensional and static data set. At a later level of the distributed processing, the system returns the output result(s) when all the process of computations is successfully achieved. Among the well-known open-source distributed processing frameworks dedicated for batch processing, we mention Hadoop.^{Footnote 5} Hadoop is based on simple programming paradigms that allow a highly scalable and reliable parallel processing of high-dimensional data sets. The framework offers a cost-effective solution to store and process different types of data such as structured, semi-structured and unstructured data without any specific format specifications. Technically, Hadoop works on top of the Hadoop distributed file system (HDFS) which duplicates the input data files in various storage machines (nodes). In this manner, the framework facilitates a fast transfer rate of the data among nodes set in the cluster and allows the system to operate without any interruption if one or a number of nodes fail. MapReduce is the core of the Hadoop framework. This paradigm offers an intensive scalability over a large number of nodes within a Hadoop cluster. The programming details of MapReduce as well as its basic concepts will be given in Sect. 4.2.

On the other hand, there are some other distributed frameworks that can only deal with streaming data. Within these frameworks’ design, the distributed calculations are performed over data (to each individual data item) at the time it enters the parallel framework. Apache Storm^{Footnote 6} and Apache Samza^{Footnote 7} are among the most popular stream processing frameworks.

A third category of distributed frameworks can be highlighted which is considered as hybrid systems. This is because these frameworks are capable of processing not only batch data but also stream data. In these frameworks’ designs, similar or some linked elements can be used for both types of data. This makes the diverse processing requirements of the hybrid systems much easier and simpler. Among the well-known streaming processing parallel frameworks, we mention Apache Spark^{Footnote 8} and Apache Flink.^{Footnote 9}

In this conducted research, we focus on Apache Spark. The distributed open source framework was initially developed in the UC Berkeley AMPLab for big data processing. Apache Spark is characterized by its capability of improving the system’s effectiveness—which is achieved via the use of intensive memory—,its efficiency, and its high transparency for users. These characteristics allow to perform parallel processing of diverse application domains in a simple and easy way. More precisely and in comparison to Hadoop, in Hadoop MapReduce multiple jobs would be adjusted together to build a data pipeline. In this process, and in every level of that built pipeline, MapReduce will have to read the data from the disk and then write it back to the disk again. This process was obviously ineffective as it had to read all the data and write it from and back to the disk at each level of the process. To deal with this issue, Apache Spark comes into play. Based on the same MapReduce paradigm, the Spark framework could offer an immediate 10 times increase in the system’s performance. This is explained by the non-necessity to store the given data back to the disk at every stage of the process as all activities remain in the memory [36]. Spark affords a much faster data process in contrast to transferring it through needless Hadoop MapReduce mechanisms. Adding to this specificity, the key concept that Spark offers is a resilient distributed data set (RDD), which is a set of elements that are distributed across the nodes of the used cluster that can be operated on in a parallel way. Indeed, Spark has a number of high-level libraries for stream processing, machine learning and graph processing, e.g., MLlib [18]. The choice of this specific framework to design our proposed algorithm based on rough sets for big data feature selection is essentially based on several reasons which are as follows: (1) to offer a general solution based on a hybrid parallel framework, (2) Apache Spark provides high-speed benefits with a trade-off in the usage of high memory, (3) Spark is one of the well-known and certified distributed frameworks and also a mature hybrid system specifically when comparing it to some other frameworks in the market. These are considered as more niche in terms of their usage but more importantly they are still in their initial periods of adoption.^{Footnote 10}

4.2 The MapReduce paradigm

MapReduce [10] is one of the most popular processing techniques and program models for distributed computing to deal with big data. It was proposed by Google in 2004 and designed to easily scale data processing over multiple computing nodes. The MapReduce paradigm is composed of two main tasks/phases, namely the map phase and the reduce phase. At an abstract level, the map process takes as input a set of data and transforms it into a different set where each element is represented in the form of a tuple key/value pair, producing some intermediate results. Then, the reduce process collects the output from the map task as an input and combines these given key/value tuples into a smaller set of pairs to generate the final output. A representation of the MapReduce framework is given in Fig. 1.

Technically, the MapReduce paradigm is based on a specific data structure which is the (key, value) pair. More precisely, during the map phase, on each split of the data the map function gets a unique (key, value) tuple as an input and generates a set of intermediate (key$^{\prime }$, value$^{\prime }$) pairs as output. This is represented as follows:

$$\begin{aligned} map(key, value) \rightarrow \{(key^{\prime }, value^{\prime }), \ldots \}. \end{aligned}$$

(8)

After that, the MapReduce paradigm assembles all the intermediate (key$^{\prime }$, value$^{\prime }$) pairs by key via the shuffling phase. Finally, the reduce function takes the aggregated (key$^{\prime }$, value$^{\prime }$) pairs and generates a new (key$^{\prime }$, value$^{\prime \prime }$) pair as output. This is defined as:

$$\begin{aligned} reduce(key^{\prime }, \{value^{\prime }, \ldots \}) \rightarrow (key^{\prime }, value^{\prime \prime }). \end{aligned}$$

(9)

As discussed, a variety of open source parallel computing frameworks are proposed in the market, and in this section, we have highlighted the well-known ones. However, it is important to mention that choosing a particular distributed framework is always dependent to the type or kind of the given data that the system will process. The choice also depends on how time bound the specifications of the users are, and on the types of output results that users are looking for. In this paper, we mainly focused on the use of Apache Spark.

5 The rough set distributed algorithm for big data feature selection

In this section, we will introduce our developed parallel rough set-based algorithm, that we name ‘Sp-RST,’ for big data pre-processing and specifically for feature selection. Sp-RST has a distributed architecture based on Apache Spark for a distributed and in-memory computation task. First, we will highlight the main motivation for developing the distributed Sp-RST algorithm by identifying the computational inefficiencies of the classical rough set theory which limit its application to small data sets only. Secondly, we will elucidate our Sp-RST solution as an efficient approach capable of performing big data feature selection without sacrificing performance.

5.1 Motivation and problem statement

Rough set theory for feature selection is an exhaustive search as the theory needs to compute every possible combination of attributes. The number of possible attribute subsets with m attributes from a set of N total attributes is $\left( {\begin{array}{c}N\\ m\end{array}}\right) = \frac{N!}{m!(N - m)!}$ [17]. Thus, the total number of feature subsets to generate is $\sum _{i=1}^N{\left( {\begin{array}{c}N\\ i\end{array}}\right) } = 2^N-1$. For example, for $N=30$ we have roughly 1 billion combinations. This constraint prevents us to use high-dimensional data sets as the number of feature subsets is growing exponentially in the total number of features N. Moreover, hardware constraints, specifically memory consumption, do not allow us to store a high number of entries. This is because the system has to store the entire training data set in memory, together with all the supplementary data computations as well as the generated results. All of this data can be so big that its size can easily exceed the available RAM memory. These are the main motivations for our proposed Sp-RST solution, which makes use of parallelization.

5.2 The proposed solution

To overcome the standard RST inadequacy to perform feature selection in the context of big data, we propose our distributed Sp-RST solution. Technically, to handle a large set of data it is crucial to store all the given data set in a parallel framework and perform computations in a distributed way. Based on these requirements, we first partition the overall rough set feature selection process into a set of smaller and basic tasks that each can be processed independently. After that, we combine the generated intermediate outputs to finally build the sought result, i.e., the reduct set.

5.2.1 General model formalization

For feature selection, our learning problem aims to select a set of highly discriminating attributes from the initial large-scale input data set. The input base refers to the data stored in the distributed file system (DFS). To perform distributed tasks on the given DFS, a resilient distributed data set (RDD) is built. The latter can be formalized as a given information table that we name $T_{ RDD }$. $T_{ RDD }$ is defined via a universe $U = \{x_1, x_2, \ldots , x_N\}$, which refers to the set of data instances (items), a large conditional feature set $C = \{c_1, c_2, \ldots , c_V\}$ that includes all the features of the $T_{ RDD }$ information table and finally via a decision feature D of the given learning problem. D refers to the label (also called class) of each $T_{ RDD }$ data item and is defined as follows: $D =\{d_1, d_2, \ldots , d_W\}$. C presents the conditional attribute pool from where the most significant attributes will be selected.

As explained in Sect. 5.1, the classical RST cannot deal with a very large number of features, which is defined as C in the $T_{ RDD }$ information table. Thus, to ensure the scalability of our proposed algorithm when dealing with a large number of attributes, Sp-RST first partitions the input $T_{ RDD }$ information table (the big data set) into a set of m data blocks based on splits from the conditional feature set C, i.e., m smaller data sets with a fewer number of features instead of using a single data block ($T_{ RDD }$) with an unmanageable C number of features that we note as $T_{ RDD }(C)$. The key idea is to generate m smaller data sets that we name $T_{ RDD _{(i)}}$, where $i \in \{1, \ldots , m\}$, from the big $T_{ RDD }$ data set, where each $T_{ RDD _{(i)}}$ is defined via a manageable number of features r, where $r \lll C = \{c_1, c_2, \ldots , c_V\}$ and $r \in \{1, \ldots , V\}$. The definition of the parameter r will be further explained in what follows. We note the resulting data block as $T_{ RDD _{(i)}}(C_r)$. This leads to the following formalization: $T_{ RDD } = \bigcup _{i=1}^{m}T_{ RDD _{(i)}}(C_r)$, where $r \in \{1, \ldots , V\}$. As mentioned above, r defines the number of attributes that will be considered to build every $T_{ RDD _{(i)}}$ data block. Based on this, every $T_{ RDD _{(i)}}$ is built using r random attributes which are selected from C. Each $T_{ RDD _{(i)}}$ is constructed based on r distinct features as there are no common attributes between all the built $T_{ RDD _{(i)}}$. This leads to the following formalization: $\forall T_{ RDD _{(i)}}{:}\,\not \exists \{c_r\} = \bigcap _{i=1}^{m} T_{ RDD _{(i)}}$. Figure 2 presents this data partitioning phase.

With respect to the parallel implementation design, the distributed Sp-RST algorithm will be applied to every $T_{ RDD _{(i)}}(C_r)$ while gathering all the intermediate results from the distinct m created partitions; rather than being applied to the complete $T_{ RDD }$ that encloses the whole set C of conditional features. Based on this design, we can ensure that the algorithm can perform its feature selection task on a computable number of attributes and therefore overcome the standard rough set computational inefficiencies. The pseudocode of our proposed distributed Sp-RST solution is highlighted in Algorithm 1.

To further guarantee the Sp-RST feature selection performance while avoiding any critical information loss, to evolve the algorithm and to refine it, Sp-RST runs over N iterations on the $T_{ RDD }$m data blocks, i.e., N iterations on all the m built $T_{ RDD _{(i)}}(C_r)$. Through all these N iterations, Sp-RST will first randomly build the m distinct $T_{ RDD _{(i)}}(C_r)$ as explained above. Once this is achieved and for each partition, the algorithm’s distributed tasks defined in Algorithm 1 (lines 5–10) will be performed. As noticed, line 1 in Algorithm 1 that defines the initial Sp-RST parallel job is performed outside the loop iteration. This process calculates the indiscernibility relation $ IND (D)$ of the decision class D. The main reason for this implementation is that this process is totally separated from the m created partitions. This is because the output is tied to the label of the data instances and not on the attribute set.

Out from the iteration loop (line 12), the outcome of each created partition can be either only one reduct $ RED _{i_{(D)}}(C_r)$ or a set (a family) of reducts $ RED _{i_{(D)}}^{F}(C_r)$. As previously highlighted in Sect. 3, any reduct among the $ RED _{i_{(D)}}^{F}(C_r)$ reducts can be selected to describe the $T_{ RDD _{(i)}}(C_r)$ information table. Therefore, in case where Sp-RST generates a single reduct for a specific $T_{ RDD _{(i)}}(C_r)$ partition, the final output of this attribute selection phase is the set of features defined in $ RED _{i_{(D)}}(C_r)$. These attributes represent the most informative features among the $C_r$ features and generate a new reduced $T_{ RDD _{(i)}}$ defined as: $T_{ RDD _{(i)}}(RED)$. The latter reduced base guarantees nearly the same data quality as its corresponding $T_{ RDD _{(i)}} (C_r)$ which is based on the full attribute set $C_r$. In the other case where Sp-RST generates multiple reducts, the algorithm performs a random selection of a single reduct among the generated family of reducts $ RED _{i_{(D)}}^{F}(C_r)$ to describe the corresponding $T_{ RDD _{(i)}}(C_r)$. This random selection is supported by the RST fundamentals and is explained by the same level of importance of all the reducts defined in $ RED _{i_{(D)}}^{F}(C_r)$. More precisely, any reduct included in the family of reducts $ RED _{i_{(D)}}^{F}(C_r)$ can be selected to replace the $T_{ RDD _{(i)}}\,(C_r)$ attributes.

At this level, the output of every i data block is $ RED _{i_{(D)}}(C_r)$ which refers to the selected set of features. Nevertheless, since every $T_{ RDD _{(i)}}$ is described using r distinct attributes and with respect to $T_{ RDD } = \bigcup _{i=1}^{m} T_{ RDD _{(i)}}(C_r)$, a union operator on the generated selected attributes is needed to represent the original $T_{ RDD }$. This is defined as $ Reduct _m = \bigcup _{i=1}^{m} RED _{i_{(D)}}(C_r)$ (Algorithm 1, lines 12–14). As previously highlighted, Sp-RST will perform its distributed tasks over the N iterations generating N$ Reduct _m$. Therefore, finally, an intersection operator applied on all the obtained $ Reduct _m$ is required. This is defined as $Reduct = \bigcap _{n=1}^{N} Reduct _m$. Sp-RST could diminish the dimensionality of the original data set from $T_{ RDD }(C)$ to $T_{ RDD }(Reduct)$ by removing irrelevant and redundant features at each computation level. Sp-RST could also simplify the learned model, speed up the overall learning process, and increase the performance of an algorithm, e.g., a classification algorithm, as will be discussed in the experimental setup section (Sect. 6). Figure 3 illustrates the global functioning of Sp-RST. In what follows, we will elucidate the different Sp-RST elementary distributed tasks.

5.2.2 Algorithmic details

As previously highlighted, the elementary Sp-RST distributed tasks will be executed on every $T_{ RDD _{(i)}}$ partition defined by its $C_r$ features ($T_{ RDD _{(i)}}(C_r)$), except for the first step, Algorithm 1—line 1, which deals with the calculation of the indiscernibility relation for the decision class D: $ IND (D)$. Sp-RST performs seven main distributed jobs to generate the final output, i.e., Reduct.

Sp-RST stars first of all by computing the indiscernibility relation for the decision class $D = \{d_1, d_2, \ldots , d_W\}$. We define the indiscernibility relation as $ IND (D)$: $ IND (d_i)$, where $i \in \{1, 2, \ldots , W\}$. Sp-RST will calculate $ IND (D)$ for each decision class $d_i$ by associating the same $T_{ RDD }$ data items (instances) that are expressed in the universe $U = \{x_1, \ldots , x_N\}$ and that belong to the same decision class $d_i$.

To achieve this task, Sp-RST processes a first map transformation operation taking the data in its format of ($id_i$of$x_i$, List of the features of$x_i$, Class$d_i$of$x_i$) and transforming it to a $\langle key, value \rangle $ pair: $\langle $Class $d_i$ of $x_i$, List of $id_i$ of $x_i\rangle $. Based on this transformation, the decision class $d_i$ defines the key of the generated output and the data items identifiers $id_i$ of $x_i$ of the $T_{ RDD }$ define the values. After that, the foldByKey()^{Footnote 11} transformation operation is applied to merge all values of each key in the transformed RDD output. This is to represent the sought $ IND (D)$: $ IND (d_i)$. The pseudo-code related to this distributed job is highlighted in Algorithm 2.

After that and within a specific partition i, where $i \in \{1, 2, \ldots , m\}$ and m is the number of partitions, the algorithm generates the $ AllComb _{(C_r)}$ RDD which reflects all the possible combinations of the $C_r$ set of attributes. This is based on transforming the $C_r$ RDD to the $ AllComb _{(C_r)}$ RDD using the flatmap()^{Footnote 12} transformation operation and by using the combinations() operation. This is shown in Algorithm 3.

In its third distributed job, Sp-RST calculates the indiscernibility relation $ IND ( AllComb _{(C_r)})$ for every created combination, i.e., the indiscernibility relation of every element in the output of Algorithm 3, and that we name $ AllComb _{{(C_r)}_i}$. In this task and as described in Algorithm 4, the algorithm aims at collecting all the identifiers $id_i$ of the data items $x_i$ that have identical values of the combination of attributes which are extracted from $ AllComb _{(C_r)}$. To do so, a first map operation is applied taking the data in its format of ($id_i$of$x_i$, List of the features of$x_i$, Class$d_i$of$x_i$) and transforming it to a $\langle key, value \rangle $ pair: $\langle ( AllComb _{{(C_r)}_i}$, List of the features of $x_i)$, List of $id_i$ of $x_i\rangle $. Based on this transformation, the combination of features and their vector of features define the key and the identifiers $id_i$ of the data items $x_i$ define the value. After that, the foldByKey() operation is applied to merge all values of each key in the transformed RDD output, i.e., all the identifiers $id_i$ of the data items $x_i$ that have the same combination of features with their corresponding vector of features $( AllComb _{{(C_r)}_i}$, List of the features of $x_i)$. This is to represent the sought $ IND ( AllComb _{(C_r)})$. At its third step, Sp-RST prepares the set of features that will be selected in the coming steps.

In a next stage, Sp-RST computes the dependency degrees $\gamma ( AllComb _{(C_r)})$ of each attribute combination as described in Algorithm 5. For this task, the distributed job requires three input parameters which are the calculated indiscernibility relations $ IND (D)$, the $ IND ( AllComb _{(C_r)})$ and the set of all attribute combinations $ AllComb _{(C_r)}$.

For every element $ AllComb _{{(C_r)}_i}$ in $ AllComb _{(C_r)}$, and using the intersection() transformation, the job tests first if the intersection of every $ IND (d_i)$ of $ IND (d)$ with each element $ IND ( AllComb _{{(C_r)})_i}$ in $ IND ( AllComb _{(C_r)})$ holds all the elements in the latter parameter. This process refers to the calculation of the lower approximation as detailed in Sect. 3. We name the length of the resulting intersection as LengthIntersect. If the condition is satisfied then a score, which is equal to the length of the elements resulting from the generated intersection, i.e., LengthIntersect, is assigned, else a 0 value is given.

After that a reduce function is applied over the different $ IND (D)$ elements together with a sum() function applied on the calculated scores which are based on the elements having the same $ IND (d_i)$. This operation is followed by a second reduce function which is applied over the different $ IND ( AllComb _{(C_r)})$ elements together with a sum() function applied on the previous calculated results which are indeed based on the elements having the same $ AllComb _{{(C_r)}_i}$.

The latter output refers to the dependency degrees: $\gamma ( AllComb _{(C_r)})$. This distributed job generates two outputs namely the set of dependency degrees $\gamma ( AllComb _{(C_r)})$ of the attribute combinations $ AllComb _{(C_r)}$ as well as their associated sizes $ Size _{( AllComb _{(C_r)})}$.

Once all the dependencies are calculated, in Algorithm 6, Sp-RST looks for the maximum value of the dependency among all the computed $\gamma ( AllComb _{(C_r)})$ using the max() function operated on the given RDD input and which is referred to as RDD[$ AllComb _{(C_r)}$, $ Size _{( AllComb _{(C_r)})}$, $\gamma ( AllComb _{(C_r)})$]. Specifically, the max() function will be applied on the third argument of the given RDD, i.e., $\gamma ( AllComb _{(C_r)})$.

Let us recall that based on the RST preliminaries (seen in Sect. 3), the maximum dependency refers to not only the dependency of the whole attribute set $(C_r)$ describing the $T_{ RDD _i}(C_r)$ but also to the dependency of all the possible attribute combinations satisfying the following constraint: $\gamma ( AllComb _{(C_r)})= \gamma (C_r)$. The maximum dependency MaxDependency reflects the baseline value for the feature selection task.

In a next step, Sp-RST performs a filtering process using the filter() function to only keep the set of all combinations which have the same dependency degrees, as the already selected dependency baseline value (MaxDependency), i.e., $\gamma ( AllComb _{(C_r)}) = MaxDependency$. This is described in Algorithm 7. In fact, through these computations, the algorithm removes in each level the unnecessary attributes that may negatively influence the performance of any learning algorithm.

At a final stage, and using the results generated from the previous step, which is the input of Algorithm 8, Sp-RST applies first the min() operator to look for the minimum number of features among all the $ Size _{( AllComb _{(C_r)})}$; specifically, the min() operator will be applied to the second argument of the given RDD. Once determined, a result that we name minNbF, the algorithm applies a filter() method to only keep the set of combinations having the same minimum number of features as minNbF. This is achieved by satisfying the full reduct constraints highlighted in Sect. 3: $ \gamma ( AllComb _{(C_r)}) = \gamma (C_r)$ while there is no $ AllComb _{(C_r)}^{'} \subset AllComb _{(C_r)}$ such that $\gamma ( AllComb _{(C_r)}^{'}) = \gamma ( AllComb _{(C_r)})$. Every combination that satisfies this constraint is evaluated as a possible minimum reduct set. The features defining the reduct set describe all concepts in the initial $T_{ RDD _i}(C_r)$ training data set.

5.3 Sp-RST: a working example

We apply Sp-RST to an example of an information table, $T_{ RDD }(C)$, which is presented in Table 1. By assuming that the considered $T_{ RDD }(C)$ is a big data set, the information table is defined via a universe $U = \{x_1, x_2, \ldots , x_5\}$ which refers to the set of data instances (items), a large conditional feature set C = {Headache, Muscle-pain, Temperature} that includes all the features of the $T_{ RDD }(C)$ information table and finally via a decision feature Flu of the given learning problem. Flu refers to the label (or class) of each $T_{ RDD }(C)$ data item and is defined as follows: $Flu =\{yes, no\}$. C presents the conditional attribute pool from where the most significant attributes will be selected.

Table 1 Toy data set

A scalable and effective rough set theory-based approach for big data pre-processing

Abstract

Similar content being viewed by others

In-Database Feature Selection Using Rough Set Theory

PICKT: A Solution for Big Data Analysis

A group incremental feature selection based on knowledge granularity under the context of clustering

1 Introduction

2 Literature review

3 Rough sets for feature selection

3.1 Preliminaries

3.2 Reduction process

4 Parallel computing frameworks and the MapReduce programming model

4.1 Parallel computing frameworks

4.2 The MapReduce paradigm

5 The rough set distributed algorithm for big data feature selection

5.1 Motivation and problem statement

5.2 The proposed solution

5.2.1 General model formalization

5.2.2 Algorithmic details

5.3 Sp-RST: a working example

6 Experimental setup

6.1 Benchmark

6.2 Evaluation metrics

6.3 Experimental environment

7 Discussion of results

7.1 Number of iterations in Sp-RST

7.2 Stability of feature selection

7.2.1 Iterations

7.2.2 Complete algorithm

7.3 Scalability

7.4 Comparison with other feature selection techniques

7.4.1 Naive Bayes

7.4.2 Random forest

7.4.3 Runtime comparison

8 Conclusion and future work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix A: Detailed classification and runtime results

Appendix B: Results of Wilcoxon rank sum tests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation