A modified genetic algorithm and weighted principal component analysis based feature selection and extraction strategy in agriculture

doi:10.1016/j.knosys.2021.107460

Knowledge-Based Systems

Volume 232, 28 November 2021, 107460

https://doi.org/10.1016/j.knosys.2021.107460 Get rights and content

Abstract

Data pre-processing is a technique that transforms the raw data into a useful format for applying machine learning (ML) techniques. Feature selection (FS) and feature extraction (FeExt) form significant components of data pre-processing. FS is the identification of relevant features that enhances the accuracy of a model. Since, agricultural data contain diverse features related to climate, soil, fertilizer, FS attains significant importance as irrelevant features may adversely impact the prediction of the model built. Likewise, FeExt involves the derivation of new attributes from the prevailing attributes. All the information that the original attributes possess is present in these new features minus the duplicity. Keeping these points in mind, this work proposes a hybrid feature selection and feature extraction strategy for selecting features from the agricultural data set. A modified-Genetic Algorithm (m-GA) was developed by designing a fitness function based on “Mutual Information” (MutInf), and “Root Mean Square Error” (RtMSE) to choose the best features that affected the target attribute (crop yield in this case). These selected features were then subjected to feature extraction using “weighted principal component analysis” (wgt-PCA). The extracted features were then fed into different ML models viz. “Regression” (Reg), “Artificial Neural Networks” (ArtNN), “Adaptive Neuro Fuzzy Inference System” (ANFIS), “Ensemble of Trees” (EnT), and “Support Vector Regression” (SuVR). Trials on 3 benchmark and 8 real-world farming datasets revealed that the developed hybrid feature selection and extraction technique performed with significant improvements with respect to Rsq², RtMSE, and “mean absolute error” (MAE) in comparison to FS and FeExt methods such as Correlation Analysis (CA), Singular Valued Decomposition (SiVD), Genetic Algorithm (GA), and wgt-PCA on “benchmark” and “real-world” farming datasets.

Introduction

Data Pre-processing is an essential phase that needs to be performed before applying any ML techniques. It is a phase that transforms the original raw data into a format which can be useful to the machine learning algorithms. Data pre-processing consists of various steps like data cleaning, feature selection, feature extraction to name a few [1]. The agricultural datasets usually comprise of many features that may not be useful to the prediction task. In such a scenario, feature selection and feature extraction form important tasks that need to be performed to improve the algorithmic accuracy [2].

FS is a Data pre-processing technique where manual or automatic choice of appropriate attributes is performed. It is the determination of relevant attributes that enhance the accuracy of a model [3]. These features contribute to the variable to be predicted. Basically, it removes features which are irrelevant or may not impact the feature to be predicted. It forms one of the fundamental concepts in ML which impacts the model’s performance. Feature selection is very important since irrelevant or unnecessary attributes will negatively influence the performance of the ML model. If data has many irrelevant features, the accuracy of ML model decreases. Advantages of using feature selection are highlighted below [4]:

•
Overfitting is reduced: When irrelevant features are removed, the noise will be removed, and the model will be able to perform better on the test data also.
•
Reduces training time: When the number of dimensions in the data is reduced, the time taken by the ML model will reduce and it becomes computationally faster.
•
Improves accuracy of model: The accuracy of the model increases when the irrelevant features are eliminated from the dataset.

Several feature selection techniques exist. Some of the popular techniques are:

•
Filter Methods: Here, the feature selection happens before the ML algorithm is applied. Correlation is one of the popular examples of this technique [5].
•
Wrapper Methods: Here subset of features is used, and model is trained on them. “forward selection”, “backward feature elimination” and “recursive feature elimination” are three popular examples of this technique.
- –
  In “forward selection”, initially the model has no features. In each loop, new features are added that best enhances the model’s performance.
- –
  In “backward elimination”, initially the model starts with all attributes, and it eliminates the feature which is less important at every loop that enhances the model’s performance. This continues until no performance improvement is observed [6].
- –
  “Recursive Feature elimination” finds the optimal subset of attributes by using “greedy based approach”. Models are constructed repetitively by choosing best or the worst attributes at each loop. The next model is created using the attributes which are left over. This process continues until all the attributes are tried and tested. The attributes are then ranked based on the order of their exclusion [7].
•
“Embedded Methods”: These techniques integrate both the “wrapper” and “filter” approaches. They are employed by techniques that possess their own built-in FS methods [8].

Agriculture data contain diverse features related to climate, soil, fertilizer. In such a context, FS attains significant importance as irrelevant features may adversely impact the prediction of the model built [3]. In our work, we have used the filter methods for FS since the other two techniques are computationally expensive [7]. Specifically, a modified genetic algorithm (m-GA) was developed for FS by designing a new fitness function based on MutInf, and RtMSE,

FeExt is a technique of decreasing the number of attributes from the original data set where new features are constructed from the prevailing attributes. The data in original attributes are present in these new attributes minus the duplicity [9]. It is useful since it reduces the number of features in the raw data without loss of information. It reduces the amount of duplicate or redundant information that may exist in the raw data. FeExt can speed up the ML algorithm and its steps of generalization. “Principal Component Analysis” (PrCA) represents one of the well-known methods in FeExt. It transforms the original data from a lower to a higher dimension so that the task of ML can be performed effectively. In this work, wgt-PCA was employed which is an improved version of the original PrCA. wgt-PCA uses the weighted covariance matrix which gives emphasis on training records that are very near to the test record and diminishes the impact of other training records [10].

By combining feature selection and feature extraction techniques the benefits of both the techniques may be reaped. While FS can remove irrelevant features from the original data and FeExt can remove features without loss of information [11].

In view of this, a combination of FS and FeExt approach for agricultural data prediction is presented in this effort. The designed technique consists of 3 stages. Pre-processing of information forms the 1st stage in which prediction of missing values and the normalization of information is performed. Missing values are predicted for real world data using “mean imputation” method. Subsequently, the normalization of the information between 0 and 1 is performed to decrease the dominance of features which are high valued over the small-valued features during the prediction task. In the feature selection phase, we designed a modified genetic algorithm (m-GA) with an improved fitness function consisting of weights, MutInf and RtMSE. During the FeExt phase, wgt-PCA is employed on the attributes obtained from the GA. That is, we are performing feature selection first using m-GA and then FeExt using wgt-PCA. In stage-3, prediction of agricultural information was done. For the “real-world” farming data, the information for manures and crop were acquired from “IndiaStats” [12]. Information associated with rainfall was taken from “IndiaStats” [12] and “India Water Portal” [13]. The data for soil was acquired from the “Karnataka Atlas of Soil Fertility” [14]. The information about manures, yield of crops, rainfall, and soil were merged to create separate data files for each of the Indian crops namely “wheat, maize, bajra, jowar, cotton, groundnut, sugarcane and ragi”. The standard datasets used were “Forest Fires” (FF) [15], “weather Ankara” (wAnk) [16] and “weather Izmir” (wIzm) [17].

Key contributions of this effort are as follows:

•
Designed a modified Genetic Algorithm (m-GA) with an improved “fitness function” centred on mutual information and Root Mean Square Error.
•
Designed a hybrid approach (m-GA+wgt-PCA) by integrating Feature Selection (m-GA) and Feature Extraction (wgt-PCA) techniques.
•
Compared the designed hybrid approach (m-GA+wgt-PCA) with other feature selection and extraction methods such as Correlation Analysis, Singular Valued Decomposition, Genetic Algorithm, and weighted-Principal Component Analysis on 3 benchmark and 8 real-world agricultural datasets with respect to multiple performance metrics such as R-squared, root mean square error, and Mean Absolute Error.

The contents of this paper are outlined below. Segment 2 explains the work related to the hybrid feature selection and extraction methods. Segment 3 presents the designed approach. The results of the trials are demonstrated in Segment 4. The paper ends with conclusion and future work.

Section snippets

Related work

This segment presents feature selection and feature extraction techniques used in the field of agriculture and other domains. Following paragraphs present the works on FS and FeExt techniques in the domain of agriculture (with a focus on “crop yield prediction”):

In [18], authors used random search, best search, and genetic methods for determining the relevant features for land classification. Results revealed that electrical conductivity, exchangeable sodium percentage, soil texture and wetness

Proposed work

This segment presents the approach and algorithm of the devised hybrid FS and FeExt strategy for agricultural prediction. Fig. 1 depicts the flow of the devised technique comprising of 3 stages.

The designed approach constituted of 3 stages. In the 1st stage, pre-processing of data was performed by predicting the missing values and normalizing the information. The missing values that were in the “real-world” agricultural datasets were predicted using the “mean imputation” technique. The

Experimental setup and findings

This segment discusses the trial setup used for conducting the trials along with the results obtained from feature selection methods. “Matlab R2016a” was employed as the coding language on “windows 7 environment”.

In this work, experiments were conducted on Indian crop datasets, and benchmark datasets. On each data set we compared our hybrid feature selection method (m-GA+wgt-PCA) with other feature selection methods (m-GA, GA and Correlation), and Feature Extraction methods (SiVD, wgt-PCA) with

Discussion

In this section, justification of why the proposed hybrid feature selection method performed better than the other feature selection methods are discussed.

•
The proposed method performed better than the original data (data comprising of all the features) since the original data consisted of features that did not contribute to the target attribute. Removing the unimportant attributes from the original data made the predictive models perform with higher accuracy. For instance, with regards to

Conclusion and future scope

In this work, a hybrid feature selection technique was developed by combining feature selector m-GA with the feature extractor wgt-PCA. The m-GA was developed by designing a fitness function based on MutInf, and RtMSE to select the best features which significantly affected the yield of crops. The selected features were then subjected to feature extraction using wgt-PCA. By integrating m-GA and wgt-PCA the strengths of FS and FeExt were achieved. Extensive experiments were conducted on 8

CRediT authorship contribution statement

K. Aditya Shastry: Conceptualization, Methodology, Software, Data curation, Writing – original draft, Visualization, Investigation, Writing – review & editing. Sanjay H.A.: Conceptualization, Supervision, Validation, Reviewing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (50)

ChandrashekarGirish et al.
A survey on feature selection methods
Comput. Electr. Eng.
(2014)
GuanZ. et al.
A survey on big data pre-processing
BauckhageC. et al.
Data mining and pattern recognition in agriculture
Künstl Intell.
(2013)
VisalakshiS. et al.
A literature review of feature selection techniques and applications: Review of feature selection in data mining
L.J. Herrera, V. Lafuente, R. Ghinea, M.M. Perez, I. Negueruela, H. Pomares, I. Rojas, A. Guillén, Mutual...
CherringtonM. et al.
Feature selection: Filter methods performance challenges
El AboudiN. et al.
Review on wrapper feature selection approaches
LiuH.
Feature selection
GuyonI. et al.
An introduction to feature extraction
FanZ. et al.
Weighted principal component analysis

ShangZ. et al.

Combined feature extraction and selection in texture analysis

India Stats. Available at https://www.indiastat.com/data/agriculture. Accessed online in January...

India Water Portal. Available at...

Forest Fires Dataset. Available at: http://archive.ics.uci.edu/ml/datasets/Forest+Fires. Accessed online in March...

Weather Ankara Dataset. Available at: http://sci2s.ugr.es/keel/dataset.php?cod=41. Accessed online in April...

Weather Izmir data set. Available at: http://sci2s.ugr.es/keel/dataset.php?cod=78. Accessed online in April...

HamzehS. et al.

Feature selection as a time and cost-saving approach for land suitability classification (case study of shavur plain, Iran)

Agriculture

(2016)

MayaS. et al.

Selection of important features for optimizing crop yield prediction

Int. J. Agricult. Environ. Inf. Syst.

(2019)

SaeedKhaki et al.

Crop yield prediction using deep neural networks

Front. Plant Sci.

(2019)

ElavarasanD. et al.

A hybrid CFS filter and RF-RFE wrapper-based feature extraction for enhanced agricultural crop yield prediction modeling

Agriculture

(2020)

WolaninAleksandra et al.

Extracting important features for crop yield prediction with convolutional neural networks on remote sensing and meteorological data

Geophys. Res. Abst.

(2019)

KlompenburgThomas van et al.

Crop yield prediction using machine learning: A systematic literature review

Comput. Electron. Agric.

(2020)

Soumya Attaluri, Nowshath Batcha, Mafas Raheem, Crop plantation recommendation using feature extraction and machine...

SharmaSagarika et al.

Wheat crop yield prediction using deep LSTM model

Comput. Vis. Pattern Recognit.

(2020)

Cited by (17)

Feature-based deep neural network approach for predicting mortality risk in patients with COVID-19
2023, Engineering Applications of Artificial Intelligence
In this study, we integrate deep neural network (DNN) with hybrid approaches (feature selection and instance clustering) to build prediction models for predicting mortality risk in patients with COVID-19. Besides, we use cross-validation methods to evaluate the performance of these prediction models, including feature based DNN, cluster-based DNN, DNN, and neural network (multi-layer perceptron). The COVID-19 dataset with 12,020 instances and 10 cross-validation methods are used to evaluate the prediction models. The experimental results showed that the proposed feature based DNN model, holding Recall (98.62%), F1-score (91.99%), Accuracy (91.41%), and False Negative Rate (1.38%), outperforms than original prediction model (neural network) in the prediction performance. Furthermore, the proposed approach uses the Top 5 features to build a DNN prediction model with high prediction performance, exhibiting the well prediction as the model built by all features (57 features). The novelty of this study is that we integrate feature selection, instance clustering, and DNN techniques to improve prediction performance. Moreover, the proposed approach which is built with fewer features performs much better than the original prediction models in many metrics and can still remain high prediction performance.
A reconstructed feasible solution-based safe feature elimination rule for expediting multi-task lasso
2023, Information Sciences
Multi-task lasso (MTL) is an effective algorithm to handle multi-task problems. By introducing $ℓ_{2, 1}$ -norm, it can realize joint feature selection across a group of related tasks. But it is time-consuming when handling high-dimensional problems. Motivated by the row sparsity of the optimal solution, an improved safe feature elimination rule termed IEDPP is proposed to accelerate the training process. It could delete most of the redundant features before we solve the problem. Then the computational efficiency will be improved since only a reduced problem should be solved. Moreover, the properties of the projection operator and the reconstructed feasible solutions ensure the safety of the proposed method. That is to say, it will derive an identical optimal solution to the original problem both in theory and in practice. But our IEDPP could only be used once before solving, there are still some redundant features that are not deleted. Therefore, we further propose an integrated IIEG-ML rule by combining IEDPP with GAP. Then, more and more redundant features could be deleted as the algorithm converges. Moreover, by embedding IIEG-ML into the grid search method, the whole training process will be accelerated. Finally, an improved Nesterov's method is used to solve the reduced problems. Experimental results on different datasets confirm the effectiveness of our method.
A deep learning process anomaly detection approach with representative latent features for low discriminative and insufficient abnormal data
2023, Computers and Industrial Engineering
Citation Excerpt :
Selecting the features relevant to the finished product quality is essential in advance of anomaly detection and can improve the performance of models, reduce the risk of overfitting, and shorten the training time (Bennasar et al., 2015). Generally, feature selection algorithms can be primarily divided into filter, wrapper, and embedded models (Gu et al., 2011; Pascoal et al., 2017; Shastry & Sanjay, 2021; Vergara & Estevez, 2014). The embedded methods, integrating the feature evaluation and classifier construction, have the advantages of both filter and wrapper methods (Peralta & Soto, 2014).
Anomaly detection in industrial processes is vital for yield improvement and cost reduction. With the development of sensor system and information technology, industrial big data provide opportunities to detect the abnormalities of processes and raise alarms by using operating parameters. However, the slight deviations in operating parameters and the insufficient abnormal data may hinder the effectiveness of existing anomaly detection models. To cope with the above problems, a more effective process anomaly detection framework combining shallow feature fusion learning with unsupervised deep learning is constructed. Specifically, the extracted statistical features that can reflect the slight deviations of operating parameters and the original measured features are firstly concatenated to enrich the available information. Then, a combined feature selection method of SMOTE & Tomek Links and random forest is developed to further discover the abstract features closely relevant to the quality characteristics of finished products with imbalanced data. After that, an unsupervised anomaly detection method is developed, where only normal process data are available for training the stacked denoising autoencoder. The utilized autoencoder can alleviate the effect of imbalanced data as the reconstruction error would be larger when the abnormality occurs. Lastly, the anomaly discrimination criteria, which consist of the monitoring index construction and the threshold determination, are formulated to detect the state of the production process. The experimental results demonstrate that the proposed method can detect the abnormalities effectively and achieves better performance than other state-of-art anomaly detection methods in commutator spot welding of a practical motor manufacturing process.
Optimizing an irrigation treatment using an evolutionary algorithm and a knowledge discovery framework based on Deep models
2023, Applied Soft Computing
Nowadays, learning models have good accuracy and learn complex patterns which are hidden in the data. Occasionally, patterns which are discovered using these models cannot be discovered by a human. So representing the discovered patterns and knowledge as an explainable knowledge for human is important. In this paper, a framework is proposed for knowledge discovery from real agricultural datasets. In the proposed framework, Deep Learning (DL) models are used for learning the patterns and effective features from tabular datasets; The integrated gradient method is used as an indirect metric for representing the black-box models as an interpretable model; A fuzzy expert system is used for interpreting the discovered knowledge as an expert system; An especial evolutionary algorithm is used for extracting the knowledge as a fuzzy expert system. In order to evaluate the proposed framework, some experiments are performed on the real tabular dataset on the agricultural field for extracting knowledge about productivity. For validating the discovered knowledge, an optimization algorithm is proposed which uses the discovered expert system for designing optimal irrigation treatments on the selected dataset. DSSAT, which is a farm simulator, is used for validating discovered knowledge. The obtained results show that the discovered knowledge can improve the productivity by 30% and decrease the used water by 30%. In addition, yields of the designed treatment and discovered knowledge correlate more than 0.8.
Hybrid particle swarm optimization algorithm for text feature selection problems
2024, Neural Computing and Applications
Capability-based defense resource allocation method
2024, Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics

View all citing articles on Scopus

View full text

A modified genetic algorithm and weighted principal component analysis based feature selection and extraction strategy in agriculture

Abstract

Introduction

Section snippets

Related work

Proposed work

Experimental setup and findings

Discussion

Conclusion and future scope

CRediT authorship contribution statement

Declaration of Competing Interest

Comput. Electr. Eng.

A survey on big data pre-processing

Data mining and pattern recognition in agriculture

Künstl Intell.

A literature review of feature selection techniques and applications: Review of feature selection in data mining

Feature selection: Filter methods performance challenges

Review on wrapper feature selection approaches

Feature selection

An introduction to feature extraction

Weighted principal component analysis

Combined feature extraction and selection in texture analysis

Feature selection as a time and cost-saving approach for land suitability classification (case study of shavur plain, Iran)

Agriculture

Selection of important features for optimizing crop yield prediction

Int. J. Agricult. Environ. Inf. Syst.

Crop yield prediction using deep neural networks

Front. Plant Sci.

A hybrid CFS filter and RF-RFE wrapper-based feature extraction for enhanced agricultural crop yield prediction modeling

Agriculture

Extracting important features for crop yield prediction with convolutional neural networks on remote sensing and meteorological data

Geophys. Res. Abst.

Crop yield prediction using machine learning: A systematic literature review

Comput. Electron. Agric.

Wheat crop yield prediction using deep LSTM model

Comput. Vis. Pattern Recognit.