Abstract

The use of ionic liquids (ILs) for biomass pretreatment to produce cellulose-rich materials (CRMs) has been well proven. In this research, due to the wide range of applications and ease of using artificial intelligence procedures, on the basis of the algorithm of stochastic gradient boosting (SGB) decision tree, an artificial intelligence approach is proposed to estimate the properties of cellulose-rich materials (CRMs). That being the case, the dataset of the empirical output values was gathered and was randomly broken down into datasets for testing and training. These results show that the best forecasting tool for calculating the properties of CRMs is the developed model. Furthermore, the accuracy of the databank of the biodiesel target values has been examined. In contrast, the influences of model contributed variables on the output have been examined as a new issue. It reveals that the most influencing variable in determining the properties of CRMs is the cellulose enrichment factor. Therefore, this research provides an innovative and accurate tool for predicting the properties of CRMs and sensitivity investigation on effective parameters to help investigators developing the optimized process.

1. Introduction

The source of many environmental issues, including rising greenhouse gases, global warming, and air pollution, is the increasing consumption of fossil fuels in the world [13]. Biomass as carbon-neutral and low-emission fuels, which are considered renewable sources, has been considered as a suitable alternative to fossil fuels in recent years [46]. The components of lignocellulosic biomass include cellulose, hemicellulose, and lignin which contain about 40–60% w/w, 15–30% w/w, and 10–25% w/w, respectively. These three components weave together, forming a strong cell wall that leads to the formation of recalcitrant biomass. The main structure of the biomass skeleton is formed by the accumulation of hard cellulose microfibers, and the inside space is filled by amorphous hemicelluloses and lignin. Eventually, all the components are joined together by covalent and hydrogen bonds [7]. Studies have shown that lignin is the main factor preventing the deconstruction of lignocellulosic biomass that leads to the conversion of fibrous cellulose into biochemicals and/or bio-fuels as well. When lignin is extracted and along with that, hemicellulose decreases, the available surface area in cellulose increases, creating areas favorable to reactions such as thermal conversion, bio digestibility, and hydrolysis [8, 9].

Ionic liquid solvents are commonly used to extract lignin from a large number of lignocellulosic biomass types in pretreatment processes [10]. The solid material obtained from the abovementioned process is a cellulose-rich material (CRM) and can be upgraded into high-quality biochemical or fuels through processes such as fermentation [11], torrefaction [12], and pyrolysis [13].

There are three different solvent systems for the ionic liquid solvents mentioned which include ionic liquids (ILs), deep eutectic solvents (DESs), and IL-containing solvent [14]. Since IL is pure salt, it has a low melting point, so its physical and chemical properties can be optimized for the intended purpose by making small changes in its anion and cation composition [15]. Due to the special properties of IL solvent, various properties can be enumerated for it, such as low vapor pressure, nonflammability, and thermal and chemical stability [16].

Moreover, recent studies have revealed that ILs which contain organic solvents can reduce the viscosity of IL by promoting the dissociation of their anion and cation and lead to further solubility of lignocellulosic biomass [17, 18]. The chemical composition of DES includes a number of chemical components in the form of hydrogen bond acceptor (HBA) and hydrogen bond donor (HBD), and the melting point of DES is significantly lower than the HBA and HBD components that form it [19]. Also, the release of the biomass recalcitrance is facilitated by the strong hydrogen bond [20]. The number of papers published on ILs in the role of pretreatment of biomass has been increased in recent years [21].

Artificial intelligence methods for understanding complex and multidimensional relationships in biomass pretreatment processes using ILs have more advantages than regression and traditional modeling, and machine learning (ML) is highly regarded among the artificial intelligence methods [10, 22, 23].

One of the advantages of ML is that it can learn and detect patterns automatically from a large dataset without the need for explicit programming. ML has made it simply possible to evaluate the importance of all input features which helps to better understand the system, leading to the model interpretation-based design of experiments. Recent studies have used ML in upgrading and conversion processes including hydrothermal, torrefaction, pyrolysis, and fermentation [2430]. Gradient boosting (GB), support vector machines (SVMs), and random forest (RF) are among the algorithms that have been widely used in the literature. The algorithms performed well in predicting and provided desirable values of two important indicators, R2 (regression coefficient) and RMSE (root mean square error). Low values of R2 and RMSE have been reported at around 0.90. Nevertheless, the feature number was less than 15, which is different from the biomass pretreatment system with ILs. One of the challenges we faced in our system was also describing features specifically for ILs. Xu et al. used an explicit inner relationship between the main variables of DESs to describe the characteristics of each DES, which showed a clear effect on the pretreatment of lignocellulosic biomass [31]. Physical and chemical parameters related to hydrogen bonds in DESs were also analyzed by partial least square (PLS) and principal component analysis (PCA). Cellulose content and solid recovery in CRMs are among the most important properties in the biomass pretreatment process with ILs. However, the application of ML in evaluating and predicting feature importance for them has not been reported yet. Therefore, in this study, we intend to use machine learning algorithms to predict cellulose enrichment factor and solid recovery in CRM, which is the result of biomass pretreatment with ILs. The dataset used in our research includes 23 features related to raw biomass characteristics, IL identities, IL treatment process conditions, and catalyst loading.

The SGB decision tree approach was considered and focused on for modeling this work. Visualization and interpretation of high-importance features by plotting their impact patterns in the form of statistical analyses led to a better understanding of the multidimensional relationships in the biomass pretreatment system with ILs. Finally, different sets of input features were examined to explore their effect on the predictive power of the model.

2. A History of the Stochastic Gradient Boosting (SGB) Decision Tree

An algorithmic boosting function enhances a predictive function’s accuracy. This technique begins by repeatedly executing the function, followed by merging each function’s throughput with weights to reduce prediction error. Boosting, which applies to both regression and classification issues, was developed during the last decade and is, thus, considered among the most robust and recent learning algorithms [32].

Friedman’s SGB technique [33] is regarded as an advanced functional approach and statistical study variation. This technique’s algorithmic function pertains to boosting regression trees and calculates the sequence of basic trees where residuals of predictors from its previous tree create the subsequent one. Tree intricacy is determined by a single split consisting of a single root node and two child nodes. Data are then partitioned incrementally using an SGB approach. After then, for each partition, the distinction among the measured values and remnants is determined. Following that, the residuals are fitted to the tree node to create a new partition, which will reduce the data’s residual variance in the previously described tree sequence. By classifying each statement using the most common technique, as a result, all the trees produced throughout the process are accrued, which reduces the SGB’s responsiveness for the imbalanced sets of data, deficient training set, and anomalies.

Additionally, ensemble education techniques such as bagging, boosting, and related methods aggregate projections derived from several designs. The techniques are significantly effective in artificial intelligence and the extraction of information [34]. The techniques include the following:where and denote each of the basic learner sizes and the ensemble, respectively, and numerous functions from x, being input variables originating from the training set.

Ensemble estimation is a linear mixture from all projections from the basic ensemble learners, where the parameters specifying this linear combination are .

Boosting predicts derived from the preceding equation once the additive enhancement is used via the subsequent formulawhere and are denoted as simple functions of x. A progressive method is utilized to fit the parameters and to expand coefficients to the training data.

This procedure begins with the calculation of an estimation , followed by .

The solution to equation (1) is accomplished in two steps using gradient boosting for the differentiable and arbitrary loss functions. Initially, the basic learner function is corresponded using the least square criterion.

To current pseudoresiduals,

The optimal coefficient value is then determined as

As a result, equation (1)’s optimization of the complex function issue is substituted with equation (2), which employs smallest squares, succeeding to equation (3)’s sole variable optimization, which is centered on the loss criterion L. SGB is concerned with remedying observations that are proximate to decision-making limits determined through the data structure when performing the boosting operation [33]. During the boosting process, it is more likely that particular findings from a decision tree proximate to other classes will be identified and corrected [35].

3. Methodology

By utilizing machine learning techniques, the following six steps are required to develop an SGB tree model: (1) preparation of the dataset, (2) calculation of the descriptors, (3) feature selection and model training, (4) validation of the model, (5) recognition of the applicability domain, and (6) interpreting the model. The following subsections provide a more detailed description of the steps mentioned above.

3.1. The Dataset upon Which the Model Was Built

A large set of 514 experimental data on CRM properties, including solid recovery (SR) and cellulose enrichment factor (CEF), which are a function of 23 effective variables, was collected from the literature. Further details on these parameters and data are provided elsewhere [10]. Three-quarters of these data was randomly used for the training phase, and the rest was retained for testing and evaluating the accuracy of the model.

3.2. Sensitivity Analysis

Sensitivity studies were performed to ascertain each component of the input’s effect on variables in the study. For quantitatively measuring the effect of each parameter, the following relevance factor was defined [3638]:

From the previous formula, denotes an overall amount of the data points, denotes the value of the input for the criterion, and denotes the output value. and denote the mean values of the input and output parameters, respectively. The relevancy factor has a value between the numbers −1 and +1, with more significant absolute numbers indicating the criterion corresponded to possess a farther significant effect on the variables in the study. A positive or negative value of the factor of relevancy indicates the straight or opposite consequence of criteria corresponded to, which suggests that the variable increments when particular input criteria with a positive relevance element are increased. At the same time, it lowers when a particular input criterion with a negative relevance factor increases.

Twenty-four different criteria for the input have been investigated in this study, all of which directly impacted the corresponding results. Figures 1 and 2 show the results of the sensitivity analysis. As shown, the largest eigenvalue n belongs to Lg, with the positive relevance factor calculated as 0.59.

3.3. The Preanalysis Phase

In this work, five distinct statistical approaches were used to estimate and validate the CEF and SR generated from the SGB tree model. The created model was run using the MATLAB program version 2018. The data gathered in the experiment phase of this study were utilized for training the model, while approximately 25% of the data obtained was used to test the models. Additionally, the data were normalized [3941].

In equation (9), x denotes the number from the parameter, and as forecasted, the absolute number of calculates as lower than 1. The CEF and SR are predicted through the primary output, whereas the remaining variables are utilized as a source of data for the SGB tree model.

3.4. Methodology for Modelling and Verifying

Verification of the model and its output and its accuracy is a critical step in the model development process. Validation is necessary as ranges of parameters are expanded and the experiments are enhanced. The equations mentioned below are utilized to assess the proposed model’s accuracy [42]:

4. Results and Discussion

In Figures 3 and 4, the experimental values are compared to the CEF and SR calculated via the SGB tree for both the testing and training datasets.

The presented model can accurately estimate outputs, as seen in the figures. Several statistical and graphical techniques were employed to evaluate the suggested model’s validity. The regression plots in Figures 5 and 6 demonstrate the proposed model’s capacity to project targets, seen by a dense clustering of the data points surrounding the Y = X line.

The proposed SBB tree’s error plot is depicted in Figures 7 and 8. It plots relative error versus the CEF and SR through both training and testing sets of data, respectively. Furthermore, estimated errors were also generally centered on the line of zero deviation. The mean of the relative error of the suggested model was worked out to be less than 17%, which displays the prediction accuracy through the suggested SGB tree model.

Table 1 contains statistical parameters derived from the suggested model. The suggested model exhibits low MSE, STD, RMSE, and MRE% values and an elevated R2, showing that it accurately predicts outputs.

Phromphithak et al. used the random forest method to estimate the two output parameters of CEF and SR [10] and concluded that their model was able to predict these parameters with an accuracy of R2 = 0.94 and 0.84, respectively, that is weaker than our proposed model.

5. Conclusions

The present work provides a novel perspective on the prediction of CEF and SR. We have developed a precise stochastic gradient-boosting decision tree model to this end. The model has been constructed and validated using the test and training sets. Additionally, extrinsic assessment sets were used to evaluate the actual projective capabilities of the model. This design is based solely on effective inputs. Thus, the estimated CEF and SR produced via the suggested method can fill in the investigational observations by assuming unknown or missing values. Furthermore, the proposed prediction mechanism may point researchers in the direction of a novel successful measurement method. This work quantified error analysis for various inputs. Forward to the results, the strategy outperformed earlier models in terms of generalizability, validity, and accuracy. Finally, the suggested method can be an efficient tool to analyze and design more effective related units.

Data Availability

The data used to support the findings of this study are provided within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the Shandong Higher Education Undergraduate Education Reform Project “Research and Practice of Automation Major Upgrading and Transformation to Serve the Transformation of Old and New Driving Forces,” Item no. Z2018S027 Dezhou 253034, Shandong, China, and the project ”Emerging-Newly Built-New: Research and Practice of Electrical Engineering and Automation Professional Upgrade and Digital Transformation,” Project no. 2020JGPY06.