1 Introduction

Nowadays, digital applications are experimenting an unprecedented growth of data to be processed and, jointly, require high efficiency in terms of computational time, power consumption, and so on. The inadequacy of classical computing system design approaches to tackle with these requirements has been already identified as one of the main Big Data challenges.

Scientific literature is mainly focusing on pattern recognition, machine learning and data classification systems since, from one side, Big Data domain gives the possibility to access to heterogeneous and huge datasets, allowing for new applications, from the other side, researches investigate about technological innovations and new methodologies to overcome performance issues, in terms of both model accuracy and timing (computing throughput and latency). As for the former, accuracy plays a fundamental role since even a small percentage of classification inaccuracy affects a very great amount of processing data samples and, for this reason, new learning algorithms are being proposed, such as classifier systems based on the combination of multiple models [1,2,3].

As for the latter, Van Essen et al. [4] investigated the possibility of using FPGA technology to accelerate random forest-based classifiers, proving that this technology performs significantly better than multi-core processors or GP-GPUs; thus, a considerable effort has been made to exploit hardware accelerators for getting better performance compared to software implementations.

For instance, this is the case of hardware acceleration of decision trees (DTs) classification systems based on the field-programmable gate array devices (FPGAs) [5,6,7,8,9].

Hardware design of multiple classification systems (MCSs) enables to take both advantages, clearly high accuracy and better computational performance, as already demonstrated by authors of [10]. There are many applications that can benefit from the performance of hardware accelerators for classification. An example is the data stream mining applications [11] in which, unlike traditional data mining applications where the data are static and can be repeatedly read many times, constraints such as bounded memory, single-pass, real-time response have to be satisfied. There are also many Fog Computing applications that make use of machine learning models to the peripheral nodes of the network. In [12], for example, a fog computing topology for delivering real-time embedded machine learning features is proposed.

However, hardware design of MCSs hides some design issues, mainly related to the highly area required by logic circuits that implement the whole classifier. Indeed, while a single classification system can be someway adapted to fit onto a feasible hardware device, MCS would require to adapt not only a set of single models, but additionally the combiner needed to merge classification outcomes into the proper classification result.

In order to deal with this issue, in this paper we propose a hardware synthesis methodology that exploits approximate computing (AxC) design paradigm: by renouncing some classification accuracy, AxC is able to reduce hardware overhead with respect to full-accurate system.

In particular, starting from a trained DT model—encoded in the Predictive Model Markup Language (PMML) [13]—we manipulate it in order to produce hardware accelerators, and then, we prove that DT-based MCSs can be successfully approximated by the precision-scaling technique and exploration of approximate DTs variants can be suitably solved by means of a heuristic multi-objective optimization problem (MOP) approach. Experimental evidences, conducted over a significant dataset, highlight the efficacy of the approach by exploring approximate DTs variants by means of a genetic algorithm (GA) and by synthesizing hardware accelerator on a Xilinx Zynq 7020 FPGA device. Additionally, we compare our experimental result against exhaustive branch and bound approach proposed in [14] demonstrating a reduction in area occupancy of about 10%.

The remainder of the paper is structured as follows: Sect. 2 gives a brief overview about AxC and scientific contribution about hardware implementation of DT-based MCSs, while Sect. 3 gives details of the hardware architecture that realizes the DT-based MCS system. In particular, the proposed approach for MCS hardware implementation is discussed, highlighting how the AxC design paradigm can be used to reduce the requirements of such an implementation both in terms of silicon area and power consumption. Section 4 formalizes the approximate DT variants exploration problem as a MOP, and in Sect. 5, we report the experimental result. Finally, Sect. 6 draws the conclusion.

2 Scientific background

The scientific literature demonstrated that imprecise calculations can be selectively exploited to enhance computing system performance, defining the AxC paradigm [15]. Indeed, due to redundancy of inner calculations, some applications are characterize by an inherent resiliency to errors. Basically, relaxing functional requirements of a computing system, AxC enables to trade output accuracy off for performance, such as calculation speed, throughput and, for integrated circuits (ICs), occupied area.

Since a naive approximation approach, such as uniform approximation, is unlikely to be efficient, different AxC techniques have been proposed [16]. Some examples are precision-scaling, loop perforation [17], memoization [18], functional approximation, and so forth. In particular, precision-scaling for input data and intermediate operands has been proposed to improve efficiency for floating-point computation in many scientific applications [19,20,21,22,23].

Leveraging the full potential of AxC, however, requires addressing several challenges. First of all, output quality monitoring (i.e., introduced error) must be carefully taken into account: it is mandatory to ensure that application requirements are met [24]; hence, quality metrics must be properly selected for error assessment. The selection of appropriate metrics to be used for the error estimation is not a trivial task, as they are application-dependent and workload-dependent [16]. Moreover, the measurement of the error is typically done by running the whole approximate application, or by simulations, which may require significant effort. A different approach, based on a-priori estimation using Bayesian inference, has been proposed by Traiola et al. [25, 26].

Selecting a particular approximate configuration of a given application, generated by a given technique, is a major challenge. AxC techniques, in facts, may generate different approximate versions of the same application. For instance, let us consider loop-perforation [17]: the amount of skipped iterations—i.e., the approximate configuration of the target application—must be properly configured in order to find a good trade-off between introduced error and performance gains. In addition, among every realizable approximate version, only those characterized by an error that falls below a user-defined error-threshold must be taken in consideration.

2.1 Decision tree-based hardware classifiers

Classification systems are one of those applications that are characterized by inherent error resiliency, as models are retrieved by means of iterative training algorithms exploiting large datasets [24]. This is even true for multi-classification systems (MCSs), such as random forest classifier, which is well-known machine learning technique in which an ensemble of decision trees is used to assign a label (or classification) to an input sample [27].

Let us now briefly introduce how DTs are employed as classification system. DTs are tree-like predictive models in which each internal node specifies a test on a given variable, namely model feature, each branch a possible test outcome and each leaf gives information about decision outcome, namely predicted class. Models in which target variables can take a discrete set of values are called classification trees, while regression trees assign leaves to a probability distribution.

Algorithms for constructing DTs usually work top-down, by choosing a variable at each step that best splits the set of samples of a training set [28]. One of the most adopted training algorithm for DTs is C4.5 proposed in  [29]. It constructs, by means of an inductive approach, classification models from training databases, following a top-down and divide-and-conquer paradigm. At each step of training algorithm, the dataset is split, based on conditions defined upon a chosen feature. The feature selection involves an entropy test that establishes which feature inducts the best partitioning onto dataset. The construction recursively continues until a leaf is reached. As for testing, DT prediction algorithm performs a recursion too, as explained lately in Sect. 4.

So far, scientific literature has posed a huge effort for researching new design methodologies in order to improve both rate and accuracy of classification systems. Indeed, as for the former, new classification architectures have been proposed, mainly based on custom hardware acceleration.

The authors of [5, 6] proposed an automatic methodology to generate hardware implementation of DT-based classifiers. The proposed methodology consists of three phases: (i) structuring the data coming from different sources into a single schema; (ii) using the schema to model a predictor by exploiting the C4.5 algorithm [29]; (iii) automatically converting the DT model to VHDL hardware accelerator. They demonstrated that the approach performs dramatically better than a pure software solution, guaranteeing a significant high classification throughput implementing the DT prediction onto an FPGA. As for the latter, novel training techniques, such as multi-classification systems, have been devised to deal with accuracy. Van Essen et al. [4] quantified the performance, power, and cost of DT-based MCS, trained by means of random forest algorithm, implemented over CPUs, GP-GPUs and FPGAs. The FPGA implementation outperforms the other two solutions, accordingly to results of [5], both in terms of classification rate and power dissipation, measured in classifications per second per watt consumed.

Authors of [7, 8, 10] proposed a novel approach for an efficient hardware implementation of MCSs: parallel classification entities—for instance, DTs—execute classification in parallel, and then, an hardware combiner organizes outputs from different DT classifiers in order to make the final decision. DT nodes work in parallel and each one is implemented as a binary comparator: once it receives the feature value, it returns a Boolean value that, in turn, is fed to a Boolean network in order to compute which leaf of the tree has been reached.

2.2 Approximate classifiers

The adoption of hardware DT MCSs is actually hindered by scalability issue, as reported in [4], and AxC techniques are mainly devoted to other classification systems.

A technique to improve energy efficiency of machine-learning classifier has been proposed by Venkataramani et al. [30]. Having noticed that, typically, only a part of the data of a given dataset really needs the full computational power of a classifier, they dynamically configure the classifier making it more or less accurate, according to the difficulty in classifying the inputs. During the training phase, instead of building one complex decision model, a cascade, or series of models with progressively increasing complexity is constructed. During the testing phase, the number of decision models applied to a given input varies depending on the difficulty of the considered input instance. Inputs are processed by classifiers in a sequential way, starting from the less accurate classifier. In order to estimate the difficulty of a certain input, a confidence level for each classification is computed. If the estimated confidence falls above a certain threshold, the classification process is terminated; otherwise, a more accurate classifier is used. The experimental results presented by them show a significant reduction in energy consumption.

In Nepal et al. [31], the precision-scaling technique has been used to reduce the power consumption of a perceptron classifier [32, 33]. Starting from a behavioral description of the classifier to be approximate, several different hardware configurations have been generated, by making use of the Automated Behavioral Synthesis of Approximate Computing Systems (ABACUS) tool. Then, the Pareto front, which consists of configurations providing optimal trade-off between accuracy and gains, is computed by making use of an iterative stochastic greedy algorithm. Authors claim a 33% reduction in energy consumption, with an accuracy of 83%.

A very practical approach has been adopted by the authors of [34] and [35]. They replaced multipliers and adders of a support vector machine (SVM) classifier [36] to introduce approximation in it. In Van Leussen et al. [34], the exact Karnaugh multiplier needed by the classifier has been replaced by an inaccurate Karnaugh multiplier (IKM) [37], which shows a uniform error distribution over the entire range of input. The classifier has been, then, synthesized on a 28nm CMOS technology, in order to estimate its energy requirements and accuracy. Authors claim a saving of 14% for silicon area and a saving of 61% for power consumption, while maintaining the same classification accuracy. In Zhou et al. [35] a new approximate adder and a new approximate multiplier are proposed. In order to show the full potential of the arithmetic units being proposed, both the exact adder and multiplier needed by an SVM classifier are replaced by approximate versions of them. The resulting classifier synthesized on a 90nm CMOS technology exhibited an area reduction of 18%, and simulations showed energy consumption reduction of 32%, while keeping accuracy of 95%.

3 Hardware architecture of DT-based MCS

In order to implement MCSs, our approach replicates the one proposed by authors of [10]. Multiple DTs are used simultaneously for greater classification accuracy. The outcomes of all the DTs are evaluated by a majority voter: the final outcome will be the class predicted by the majority of the DTs. Section 3.1 details the hardware implementation of the DT visiting algorithm, while Sect. 3.2 discusses the concepts behind the hardware implementation of the majority voter used to choose the winning class.

3.1 Implementing DTs on hardware

In order to speed up DTs visiting, in this paper we adopt a speculative approach, proposed by authors of [6], that takes advantage of the inherent parallelism of the hardware. The speculative approach consists in a DT flattening so that the visiting is performed over every possible path. In particular, each DT node contains a condition that establishes if the visiting has to continue on left sub-tree or on right sub-tree, until a leaf is reached. Instead, in the speculative approach, predicates are performed concurrently, regardless of the position and depth at which nodes are located: a Boolean decision variable, which indicates whether a condition is fulfilled, is produced for each one of the evaluated predicates. In order to determine which leaf of the DT is reached, i.e., which class the input belongs to, a Boolean function, called assertion, is defined for each different class. Since a path that leads to a specific leaf is obtained by computing the logic-AND between the Boolean decision variables along that path, and since it is possible to compute the logic OR between the conditions related to different paths leading to leaves belonging to the same class, assertions can be defined as a sum of products Boolean functions. For the sake of clarity, let us consider the DT depicted in Fig. 1, which evaluates two features in order to assess which one of three classes the inputs belong to. Starting from the root node, descending the DT and visiting nodes from the left to the right, the Boolean decision variables involved in the classification process are Q1, which is produced at the root node, Q2 produced at the \(f_2 < 10.9\) node, and so on. Let us consider the \(\alpha \) class: an input vector belongs to it if \(f_1 \ge 4\)—Q1 is false—and \(f_2 \ge 10.9\)—Q2 is false—or \(f_1 < 4\)—Q1 is true—and \(f_2 \ge 27.5\)—Q3 is false—and \(f_1 < 17\)—Q4 is true. In Eq. 1, we report Boolean assertions for all the classes.

$$\begin{aligned} \alpha= & {} (\overline{Q1} \wedge \overline{Q2}) \vee (Q1 \wedge \overline{Q3} \wedge Q4) \nonumber \\ \beta= & {} (\overline{Q1} \wedge Q2) \vee (Q1 \wedge Q3 \wedge Q5) \vee (Q1 \wedge \overline{Q3} \wedge \overline{Q4}) \nonumber \\ \gamma= & {} Q1 \wedge Q3 \wedge \overline{Q5} \end{aligned}$$
(1)

Predicates are evaluated using decision boxes (DBs), i.e., comparators, while the visiting algorithm can be performed as a multi-output Boolean function. A comprehensive block schema is depicted in Fig. 2.

Fig. 1
figure 1

An example of decision tree

Fig. 2
figure 2

Hardware implementation of a decision tree

As proposed by Amato et al. [5, 6], the hardware circuit can be automatically synthesized starting from the training dataset. The predictor model, coded in PMML [13], is obtained from a labeled dataset by making use of the KNIME [38] tool. Then, in order to perform FPGA synthesis, the model is translated in VHDL, using the PMML2VHDL tool [5, 6]. This tool generates a DT for each different predicate to be evaluated. Moreover, the tool makes use of the Berkeley SIS tool to produce an optimized version of the assertion functions.

The scalability of this approach has been formally demonstrated in [8]. In particular, the number of literals in each assertion is always less or equal to twice the size of the features set.

3.2 Hardware combiner for class selection

Outcomes of the assertion functions belonging to the same class but computed by different DTs are arranged in an array of N elements, with N being the number of DTs. A majority voter is used to state which class is the winner.

Let \(d_{i, j}\) be the preference expressed by the \(i-th\) DT for the \(j-th\) class, i.e., \(d_{i, j}\) is a Boolean variable being equal to 1 i.f.f. the classifier input has been recognized by the \(i-th\) DT to belong to the \(j-th\) class; the following matrix can be defined:

$$\begin{aligned} \mathbb {D} = \begin{bmatrix} d_{0, 0} &{} d_{0, 1} &{} \cdots &{} d_{0, M-1}\\ d_{1, 0} &{} d_{1, 1} &{} \cdots &{} d_{1, M-1}\\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ d_{N-1, 0} &{} d_{N-1, 1} &{} \cdots &{} d_{N-1, M-1}\\ \end{bmatrix} \end{aligned}$$
(2)

We define \(p_j = \sum _{i = 0}^{N-1} d_i, j, \; 0 \le j < M\). Since each DT expresses just one preference (i.e., \(\sum _{j = 0}^{M-1} d_{i, j} = 1 \; 0 \le i < N\)), it follows that the class w is the most voted i.f.f. \(p_w > p_j , \; \forall j \ne w\), while we get a draw condition i.f.f. \(\exists \{i, j\} \text { s.t. } p_i = p_j = max_{0 \le k < M} \{p_k\}\).

Rather than using binary adders to state which class gets the highest score, the majority voter sorts each column of the matrix \(\mathbb {D}\) using a parallel sorting algorithm, pretty much like bubble-sort, by shifting all the high bits at the beginning of each column. This process is performed by exploiting a Boolean circuit called the sorting network, which depth is equal to n.

Let us consider a two bits array: Table 1 reports the truth table of a two bits sorting network. It is easy to recognize that \(y_0 = x_0 \vee x_1\) and \(y_1 = x_0 \wedge x_1\). Conversely, defining n-bits sorting network is cumbersome. However, such a network can be built using multiple two-bit sorting networks arranged in a n-stages pipeline, with even stages consisting in N/2 two-bit sorters—each of which compares array elements starting from even positions—and odd stages consisting in \(N/2-1\) two-bit sorters—each of which compares array elements starting from odd positions [10]. The sorting networks need at least N/2 clock cycles to provide sorted arrays. An example of such a network is provided in Fig. 3.

Table 1 Truth table of a one-bit sorting network
Fig. 3
figure 3

Four-bit sorting network

Once the votes are sorted, the score each class has received needs to be verified. Let us define a threshold indicator as follows:

$$\begin{aligned} \tau _{i, j} = {\left\{ \begin{array}{ll} 1 \,&{}\quad i \le p_j \\ 0 \,&{}\quad i > p_j \end{array}\right. }, \, \, 2 \le i \le N/2 \end{aligned}$$
(3)

Hence, to detect the most voted class we need to find the \(\tau _{i, j} = 1\) with the highest i: in case it is unique, we get a voted class, otherwise we got a draw. Exploiting the same sorting network used before, we can easily detect these two conditions.

Figure 4 shows an example of such a module for a three-class classifier.

Fig. 4
figure 4

Detailed block schema of the rejection module

4 Towards approximate DTs

The speculative hardware implementation of DT-based classifiers discussed in the previous section offers several opportunities for approximation.

Assertion functions, for instance, are good candidates but they typically involve only a few literals, so area savings may be negligible. Conversely, acting on comparators can lead to significant gains.

Concerning the approximate computing techniques, precision-scaling can be used to reduce the amount of bits required for representing model features: neglecting least significant bits of model features, while keeping the weight of the retained bits unaltered, leads to a reduction in the size of circuits due to the removal of parts of the logic needed by comparisons.

The impact of approximation on the classification accuracy can be assessed only through simulations and, in addition, among all the approximate configurations of a given classifier, only those providing a certain accuracy level can be taken into account for further considerations. Moreover, those configurations should be evaluated in terms of silicon area, in order to state which of them provides the best trade-off between performance and gains. This is a typical instance of a multi-objective optimization problem (MOP), because accuracy and area savings are conflicting objectives.

Although a full discussion on MOPs falls beyond the purpose of this paper, a brief introduction is reported in the following section.

4.1 Multi-objective optimization formalization

Multi-objective optimization (MO) is an area of operational research (OR) which concerns about mathematical optimization problems involving more than one objective function to be simultaneously optimized.

MO has a wide application field, which includes scientific, engineering, logistic or financial applications in which a compromise between two or more conflicting objective needs to be found.

Basically, MOP consists in a set

$$\begin{aligned} \gamma \left( \cdot \right) ~=~\{ \gamma _1\left( \cdot \right) , \dots , \gamma _k\left( \cdot \right) \} \end{aligned}$$

of k different objective functions, or fitness functions, to be minimized. Typically a set

$$\begin{aligned} \psi \left( \cdot \right) ~=~\{ \psi _1\left( \cdot \right) , \dots , \psi _j\left( \cdot \right) \} \end{aligned}$$

of j constraint functions defines the set of feasible solutions X, called the solution space. A vector \(x^* \in X\), which minimizes each fitness function and does not violate any constraint function, is called a feasible solution. For non-trivial MOP, \(|X|>1\), where \(|\cdot |\) expresses the size of a set, i.e., the number of elements it contains. Considering two different solutions, \(x,y \in X\), the solution x is said to Pareto dominate y if and only if

$$\begin{aligned} \gamma _i \left( x \right) \le \gamma _i \left( y \right) \,\,\,\,\, \forall i \in \left[ 1, k\right] \end{aligned}$$

and

$$\begin{aligned} \exists \,\, j \in \left[ 1, k\right] \,\,\, | \,\,\, \gamma _j \left( x \right) < \gamma _j \left( y \right) \end{aligned}$$

If a solution is not dominated by any other solution belonging to the same space, it is called a Pareto optimal solution. All Pareto optimal solutions are considered equally good.

Since there are multiple Pareto optimal solutions, solving a MOP is not as straightforward as it is for single-objective optimization problem (SOP). Therefore, there are several methods to solve a MOP. Many methods convert a MOP in multiple different SOP. This kind of approach is known as scalarization. Generally speaking, there are two different main categories of space exploration algorithms: exact and heuristic. Exact methods, such as linear optimization or branch&bound, search for a global optimum value, therefore it may be not suitable to be used in applications with a large solutions space search. On the other hand, heuristic methods aim at producing a representative set of Pareto optimal solutions, searching in a subset of the whole solution space. Evolutionary algorithms (EAs), such as Non-dominated Sorting Genetic Algorithm-II (NSGA-II), are popular approaches to generate Pareto optimal solutions to a MOP. Their use in MOP solving has been extensively researched and their efficiency demonstrated [39,40,41]. Authors of [42] reported a complete state of the art about (EA). The main advantage they offer when applied to solve MOPs is that they generate sets of candidate solutions, allowing computation of an approximation of the entire Pareto front. On the other hand, there is no upper bound to the computational time required to find such representative set of the Pareto front.

4.2 Approximate hardware implementation of DTs

In order to state the amount of error introduced by the approximation, all the combinations of precision-scaled model features have to be considered. Figure 5 sketches a detailed schematic of the proposed flow. Starting from dataset, we exploit KNIME [38] to obtain trained DT models. They are described in PMML, an XML-based predictive model interchange format. Since the KNIME tool makes use of the IEEE 754 double precision floating-point representation for features and thresholds, the size of the solution space is \(52^{|F|}\), where 52 is the number of bits for the representation of the mantissa and |F| is the size of the features set F. Therefore, an exhaustive exploration of the solution space may be unfeasible.

Fig. 5
figure 5

Experimental flow to get FPGA bitstream from the considered dataset

To overcome this issue, the evaluation of configurations is performed using an NSGA-II algorithm. An implementation of such algorithm is provided by the ParadisEO framework [43]. In order to suitably configure the MOP problem, we take into account two different fitness functions: (i) the amount of neglected bits and (ii) the accuracy of the model. Indeed, the fewer bit to compare, the better hardware accelerator overhead, even though the resulting accuracy of approximate model has to turn acceptable for the application. Each approximate configuration of the model is, in the NSGA-II terminology, a chromosome having as many genes as comparisons. The value of each gene states the amount of neglected bits at the matching comparator.

A better choice would take into account more significant hardware measurements, such as area, power consumption or maximum clock speed, rather than the number of neglected bits. Unfortunately, the estimation of such parameters is not suitable as it does not come from immediate evaluation, but requires to run synthesis tools that would make exploration not feasible in terms of computational time. Conversely, the number of neglected bits is directly expressed by chromosomes, and it is directly related to area, time and power consumption of target circuit; therefore, the considered fitness function is effective and pretty much like immediate to evaluate.

In order to evaluate choices concerning the reward fitness function, and, in particular, the correspondence between neglected bits and area savings, a classifier system has been trained on purpose. This classifier, consisting in 100 DBs, has been synthesized on a Xilinx Zynq-7020 FPGA, varying the number of neglected bits without taking accuracy into account. Figure 6 shows this preliminary result. As expected, the amount of area saved increases as the number of mantissa bits discarded grows. This is because DBs, which are bit string comparators, have fewer bits to compare; therefore, they need less combinational logic.

Fig. 6
figure 6

Area requirements (LUTs) at increasing degree of approximation

As for the accuracy, the fitness function is evaluated by performing the same approach exploited in KNIME, hence requires simulation of a test set onto the approximate model. In order to simulate reduced-mantissa floating point in software, we resort to the FLexible Arithmetic Library (FLAP), previously introduced in [44].

As result, approximate DT models are provided in PMML as well. In order to get VHDL code to synthesize the approximate hardware DT accelerator, a modified PMML2VHDL tool is involved into the flow [10].

Last step of the flow in Fig. 5 involves the synthesis of VHDL. As in this paper, we exploit FPGA devices as hardware configurable technology, and in particular a Xilinx Zynq 7020, we employ the Xilinx Vivado tool.

5 Experimental result and case study

In this section, we present experimental result of the proposed approach. In particular, we show two different experimental campaigns. The first takes into account 50 different classification problems in order to evaluate, at different workload, the robustness of our approximate computing methodology. Then, we give details about a case study based on the SPAM classification public dataset. Each experiment has been executed by means of the previously illustrated flow, reported in Fig. 5. It is worth noticing that as for NSGA-II algorithm involved into our proposed approach, there are some parameters that does not depend on the particular classification model, though they affect the result quality and computational time of experiments. Among the many, parameters that most affect the quality of the results and the execution time are the size of the initial population, the number of iterations and the mutation and crossover probabilities. Nevertheless, since there are only a few parameters, we have succeeded in a good configuration by successive attempts.

During the experimental phase, several campaigns were conducted, during which the configuration parameters of the GA were modified several times aiming at Pareto frontiers that were sufficiently diversified and populous. Though, as foreseeable, we realized that to obtain a populous Pareto frontier and avoid local sub-optimum it is necessary to increase the size of the initial population as much as possible. Then, in order to avoid long-run exploration around local sub-optimum, mutations have to take place frequently. In addition, we configured GA in order to discard solutions with a significantly high accuracy loss w.r.t. accurate classifier. Hence, we set our GA parameters as follows: initial population equals 2000 individuals, mutation and crossover probabilities set to 0.7 and 0.9, respectively, and accuracy loss threshold set to 4%.

Fig. 7
figure 7

Amount of resource gain and accuracy loss for 50 different classification problems for maximum area overhead reduction approximate solutions. Please, kindly note that the scale on the left differs from the one on the right

5.1 Approximate MCSs

To prove the robustness of the proposed methodology, we exploited PMMLGen tool [8] to provide different workloads, in terms of models, to be approximated.

In particular, we collected 51 different datasets varying on the number of features (from 1 to 50) and number of classes (from 2 to 20). Then, we trained, for each dataset, a single DT classification model and random forest classification models with different number of DTs, namely 5, 10, 15 and 20. For each of the 255 trained classifiers, we found several approximate solutions by means of approximate exploration. Then, we synthesized the ones that belongs to the Pareto frontier bounds, i.e., the ones characterized by best reward value, meant maximum reduction of area overhead, and ones affected by minimum accuracy loss. We report the amount of resource occupation gain (in terms of FPGA LUTs and registers) and the accuracy loss evaluated in percentage w.r.t. the original synthesized model in Fig. 7 and 8, respectively, for maximum area overhead reduction and minimum accuracy loss synthesized solutions. Please, kindly note that although both in percentage, the scale for overhead gains is different w.r.t the one for accuracy loss.

Fig. 8
figure 8

Amount of resource gain and accuracy loss for 50 different classification problems for minimum accuracy loss approximate solutions. Please, kindly note that the scale on the left differs from the one on the right

For both graphs, we can state that accuracy loss decreases on the number of trees involved into the classification system. This observation confirms that random forest models are characterized by inherent resiliency property, and the greater the number of trees involved into models, the lower error introduced by approximation. Then, as for the reward, significant overhead reduction can be observed for random forest with 5, 10 and 15 trees, while the single DT model and 20 trees random forest exhibit lower area reduction values. Indeed, single DT models cannot be conveniently approximate w.r.t. random forest models for the absence of a combiner that could mitigate the effect of introduced approximation. Nevertheless, contribution on area overhead of combiner circuits for random forest models with a significant number of trees makes the approximate computing technique less effective. As for solutions characterized by the minimum accuracy loss, we can see that even a small percentage of accuracy loss corresponds to a significant resource gain. As for synthesis of maximum accuracy loss solutions, we observe for some experiments area reduction of more than 50% against about 0.2% of accuracy loss.

5.2 Case study: SPAM detection

Since recognizing emails as SPAM on non-SPAM involves the classification of a large amount of information, a spam-detector case study is used to evaluate the approach introduced in this paper. The dataset used for this case study is Spambase [45], which contains 4601 emails, 1813 of which are SPAM. This dataset is freely available and makes use of 57 different features, expressed in the floating-point notation, to characterize elements that are part of the dataset. Each of the features specifies how often a word or a character appears in each element of the dataset, i.e., in an email.

During the training phase, conducted using the KNIME tool, 40 different random forest classifiers with a number of DTs ranging from 1 to 40 are trained.

The AxC exploration phase found, for each of the 40 classifiers, a certain number of approximate configurations on the Pareto frontier but for each of them only the configuration with minimum error and the one that requires less silicon area has been reported.

Figure 9 shows the area requirements in terms of LUTs, as the number of DTs used by the classifier increases. For all the measured quantities, an increasing trend, as the number of trees grows, is shown for area requirements. The growth, however, is clearly sub-linear. In addition, it can be seen that the difference between requirements of the exact classifier and the approximate one increases as the number of trees grows. This is because even if the complexity of single DTs—i.e., the number of nodes of which they consist of and the height of DTs themselves—decreases significantly as the number of trees used by the classifier increases, the total number of nodes increases, providing more approximation opportunities. This behavior can be observed also for the amount of FPGA slices and FPGA registers, and both when considering solutions providing minimum error and those requiring the minimum silicon area. Furthermore, it can be noted that the difference in terms of area requirements between the minimum error and the minimum area solutions always remains negligible.

Figure 10 compares the levels of classification accuracy, as the number of trees used by the classifier increases, provided by the precise version—without approximation—and by the approximate version that has minimum area requirements. It is evident, from the graph, that there is only a small difference in accuracy between the configurations. Moreover, it remains very small as the number of trees used for classification varies. On the other hand, the increase in the number of DTs used in the classification process makes smaller contribution as the number of DTs grows. This asymptotic behavior can be seen in exact and approximate classifiers, and it is due to the fact that by increasing the number of models, datasets involved for training turn out simpler and corresponding DTs get less branched, which leads to a saturation of the accuracy level provided by the classifier model.

Fig. 9
figure 9

Area requirements (LUTs)

Fig. 10
figure 10

Accuracy

5.3 Comparison with previous approaches

In [14], a similar approach to the one presented in this paper has been adopted, but instead of exploring the solutions space with heuristics, the use of an exact algorithm, namely Branch & Bound (B&B), was proposed. While, on the one hand, the use of an exact algorithm for the solution of a MOP allows to reach a global optimal solution, on the other hand its use becomes prohibitive with large solutions spaces. Despite numerous improvements that the authors have made to the algorithm (B&B), such as pre-pruning of the tree and grouping features by information gain, they have managed to evaluate only a few classifiers, and for each of them only a few approximate configurations. This has greatly limited the quality of the solutions obtained. Table 2 shows classification error and hardware requirements in terms of LUTs for both approaches. As it can be observed, when compared to those obtained using GA, solutions provided by the B&B approach are worse. The difference in quality does not depend on the search algorithm itself, but on the amount of approximate configurations that have been taken into account during the space exploration phase.

Table 2 Comparison of results obtained from previous approaches

6 Conclusion

This paper addresses the design of DTs-based MCSs. Leveraging the AxC design paradigm, the classification accuracy is traded off for a reduction in the silicon area requirements of hardware-implemented MCSs. By referring to automatic approaches for MCSs hardware implementation, proposed in the scientific literature, approximation has been added directly in MCS models, acting on the number of bits used to represent the value of each model feature.

To prove the validity of the proposed approach, a spam-detector case study is provided. Several classifiers, with a number of trees ranging between 1 and 40, have been trained. Then, the optimal number of bits to be used to represent each of the features of the model is searched by means of NSGA-II genetic algorithm. Among all Pareto-optimal hardware configurations, the one providing minimum classification error configuration and the one requiring the minimum amount of silicon area were taken into account for further consideration. Experimental results show a significant reduction in area requirements, for both the minimum error and minimum area configuration. Since the classification is very resistant to error, those configurations are very similar both in terms of area requirements and classification error.