A modified genetic algorithm and weighted principal component analysis based feature selection and extraction strategy in agriculture
Introduction
Data Pre-processing is an essential phase that needs to be performed before applying any ML techniques. It is a phase that transforms the original raw data into a format which can be useful to the machine learning algorithms. Data pre-processing consists of various steps like data cleaning, feature selection, feature extraction to name a few [1]. The agricultural datasets usually comprise of many features that may not be useful to the prediction task. In such a scenario, feature selection and feature extraction form important tasks that need to be performed to improve the algorithmic accuracy [2].
FS is a Data pre-processing technique where manual or automatic choice of appropriate attributes is performed. It is the determination of relevant attributes that enhance the accuracy of a model [3]. These features contribute to the variable to be predicted. Basically, it removes features which are irrelevant or may not impact the feature to be predicted. It forms one of the fundamental concepts in ML which impacts the model’s performance. Feature selection is very important since irrelevant or unnecessary attributes will negatively influence the performance of the ML model. If data has many irrelevant features, the accuracy of ML model decreases. Advantages of using feature selection are highlighted below [4]:
- •
Overfitting is reduced: When irrelevant features are removed, the noise will be removed, and the model will be able to perform better on the test data also.
- •
Reduces training time: When the number of dimensions in the data is reduced, the time taken by the ML model will reduce and it becomes computationally faster.
- •
Improves accuracy of model: The accuracy of the model increases when the irrelevant features are eliminated from the dataset.
Several feature selection techniques exist. Some of the popular techniques are:
- •
Filter Methods: Here, the feature selection happens before the ML algorithm is applied. Correlation is one of the popular examples of this technique [5].
- •
Wrapper Methods: Here subset of features is used, and model is trained on them. “forward selection”, “backward feature elimination” and “recursive feature elimination” are three popular examples of this technique.
- –
In “forward selection”, initially the model has no features. In each loop, new features are added that best enhances the model’s performance.
- –
In “backward elimination”, initially the model starts with all attributes, and it eliminates the feature which is less important at every loop that enhances the model’s performance. This continues until no performance improvement is observed [6].
- –
“Recursive Feature elimination” finds the optimal subset of attributes by using “greedy based approach”. Models are constructed repetitively by choosing best or the worst attributes at each loop. The next model is created using the attributes which are left over. This process continues until all the attributes are tried and tested. The attributes are then ranked based on the order of their exclusion [7].
- –
- •
“Embedded Methods”: These techniques integrate both the “wrapper” and “filter” approaches. They are employed by techniques that possess their own built-in FS methods [8].
Agriculture data contain diverse features related to climate, soil, fertilizer. In such a context, FS attains significant importance as irrelevant features may adversely impact the prediction of the model built [3]. In our work, we have used the filter methods for FS since the other two techniques are computationally expensive [7]. Specifically, a modified genetic algorithm (m-GA) was developed for FS by designing a new fitness function based on MutInf, and RtMSE,
FeExt is a technique of decreasing the number of attributes from the original data set where new features are constructed from the prevailing attributes. The data in original attributes are present in these new attributes minus the duplicity [9]. It is useful since it reduces the number of features in the raw data without loss of information. It reduces the amount of duplicate or redundant information that may exist in the raw data. FeExt can speed up the ML algorithm and its steps of generalization. “Principal Component Analysis” (PrCA) represents one of the well-known methods in FeExt. It transforms the original data from a lower to a higher dimension so that the task of ML can be performed effectively. In this work, wgt-PCA was employed which is an improved version of the original PrCA. wgt-PCA uses the weighted covariance matrix which gives emphasis on training records that are very near to the test record and diminishes the impact of other training records [10].
By combining feature selection and feature extraction techniques the benefits of both the techniques may be reaped. While FS can remove irrelevant features from the original data and FeExt can remove features without loss of information [11].
In view of this, a combination of FS and FeExt approach for agricultural data prediction is presented in this effort. The designed technique consists of 3 stages. Pre-processing of information forms the 1st stage in which prediction of missing values and the normalization of information is performed. Missing values are predicted for real world data using “mean imputation” method. Subsequently, the normalization of the information between 0 and 1 is performed to decrease the dominance of features which are high valued over the small-valued features during the prediction task. In the feature selection phase, we designed a modified genetic algorithm (m-GA) with an improved fitness function consisting of weights, MutInf and RtMSE. During the FeExt phase, wgt-PCA is employed on the attributes obtained from the GA. That is, we are performing feature selection first using m-GA and then FeExt using wgt-PCA. In stage-3, prediction of agricultural information was done. For the “real-world” farming data, the information for manures and crop were acquired from “IndiaStats” [12]. Information associated with rainfall was taken from “IndiaStats” [12] and “India Water Portal” [13]. The data for soil was acquired from the “Karnataka Atlas of Soil Fertility” [14]. The information about manures, yield of crops, rainfall, and soil were merged to create separate data files for each of the Indian crops namely “wheat, maize, bajra, jowar, cotton, groundnut, sugarcane and ragi”. The standard datasets used were “Forest Fires” (FF) [15], “weather Ankara” (wAnk) [16] and “weather Izmir” (wIzm) [17].
Key contributions of this effort are as follows:
- •
Designed a modified Genetic Algorithm (m-GA) with an improved “fitness function” centred on mutual information and Root Mean Square Error.
- •
Designed a hybrid approach (m-GA+wgt-PCA) by integrating Feature Selection (m-GA) and Feature Extraction (wgt-PCA) techniques.
- •
Compared the designed hybrid approach (m-GA+wgt-PCA) with other feature selection and extraction methods such as Correlation Analysis, Singular Valued Decomposition, Genetic Algorithm, and weighted-Principal Component Analysis on 3 benchmark and 8 real-world agricultural datasets with respect to multiple performance metrics such as R-squared, root mean square error, and Mean Absolute Error.
The contents of this paper are outlined below. Segment 2 explains the work related to the hybrid feature selection and extraction methods. Segment 3 presents the designed approach. The results of the trials are demonstrated in Segment 4. The paper ends with conclusion and future work.
Section snippets
Related work
This segment presents feature selection and feature extraction techniques used in the field of agriculture and other domains. Following paragraphs present the works on FS and FeExt techniques in the domain of agriculture (with a focus on “crop yield prediction”):
In [18], authors used random search, best search, and genetic methods for determining the relevant features for land classification. Results revealed that electrical conductivity, exchangeable sodium percentage, soil texture and wetness
Proposed work
This segment presents the approach and algorithm of the devised hybrid FS and FeExt strategy for agricultural prediction. Fig. 1 depicts the flow of the devised technique comprising of 3 stages.
The designed approach constituted of 3 stages. In the 1st stage, pre-processing of data was performed by predicting the missing values and normalizing the information. The missing values that were in the “real-world” agricultural datasets were predicted using the “mean imputation” technique. The
Experimental setup and findings
This segment discusses the trial setup used for conducting the trials along with the results obtained from feature selection methods. “Matlab R2016a” was employed as the coding language on “windows 7 environment”.
In this work, experiments were conducted on Indian crop datasets, and benchmark datasets. On each data set we compared our hybrid feature selection method (m-GA+wgt-PCA) with other feature selection methods (m-GA, GA and Correlation), and Feature Extraction methods (SiVD, wgt-PCA) with
Discussion
In this section, justification of why the proposed hybrid feature selection method performed better than the other feature selection methods are discussed.
- •
The proposed method performed better than the original data (data comprising of all the features) since the original data consisted of features that did not contribute to the target attribute. Removing the unimportant attributes from the original data made the predictive models perform with higher accuracy. For instance, with regards to
Conclusion and future scope
In this work, a hybrid feature selection technique was developed by combining feature selector m-GA with the feature extractor wgt-PCA. The m-GA was developed by designing a fitness function based on MutInf, and RtMSE to select the best features which significantly affected the yield of crops. The selected features were then subjected to feature extraction using wgt-PCA. By integrating m-GA and wgt-PCA the strengths of FS and FeExt were achieved. Extensive experiments were conducted on 8
CRediT authorship contribution statement
K. Aditya Shastry: Conceptualization, Methodology, Software, Data curation, Writing – original draft, Visualization, Investigation, Writing – review & editing. Sanjay H.A.: Conceptualization, Supervision, Validation, Reviewing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (50)
- et al.
A survey on feature selection methods
Comput. Electr. Eng.
(2014) - et al.
A survey on big data pre-processing
- et al.
Data mining and pattern recognition in agriculture
Künstl Intell.
(2013) - et al.
A literature review of feature selection techniques and applications: Review of feature selection in data mining
- L.J. Herrera, V. Lafuente, R. Ghinea, M.M. Perez, I. Negueruela, H. Pomares, I. Rojas, A. Guillén, Mutual...
- et al.
Feature selection: Filter methods performance challenges
- et al.
Review on wrapper feature selection approaches
Feature selection
- et al.
An introduction to feature extraction
- et al.
Weighted principal component analysis
Combined feature extraction and selection in texture analysis
Feature selection as a time and cost-saving approach for land suitability classification (case study of shavur plain, Iran)
Agriculture
Selection of important features for optimizing crop yield prediction
Int. J. Agricult. Environ. Inf. Syst.
Crop yield prediction using deep neural networks
Front. Plant Sci.
A hybrid CFS filter and RF-RFE wrapper-based feature extraction for enhanced agricultural crop yield prediction modeling
Agriculture
Extracting important features for crop yield prediction with convolutional neural networks on remote sensing and meteorological data
Geophys. Res. Abst.
Crop yield prediction using machine learning: A systematic literature review
Comput. Electron. Agric.
Wheat crop yield prediction using deep LSTM model
Comput. Vis. Pattern Recognit.
Cited by (17)
Feature-based deep neural network approach for predicting mortality risk in patients with COVID-19
2023, Engineering Applications of Artificial IntelligenceA reconstructed feasible solution-based safe feature elimination rule for expediting multi-task lasso
2023, Information SciencesA deep learning process anomaly detection approach with representative latent features for low discriminative and insufficient abnormal data
2023, Computers and Industrial EngineeringCitation Excerpt :Selecting the features relevant to the finished product quality is essential in advance of anomaly detection and can improve the performance of models, reduce the risk of overfitting, and shorten the training time (Bennasar et al., 2015). Generally, feature selection algorithms can be primarily divided into filter, wrapper, and embedded models (Gu et al., 2011; Pascoal et al., 2017; Shastry & Sanjay, 2021; Vergara & Estevez, 2014). The embedded methods, integrating the feature evaluation and classifier construction, have the advantages of both filter and wrapper methods (Peralta & Soto, 2014).
Hybrid particle swarm optimization algorithm for text feature selection problems
2024, Neural Computing and ApplicationsCapability-based defense resource allocation method
2024, Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics