Fast method for GA‐PLS with simultaneous feature selection and identification of optimal preprocessing technique for datasets with many observations,Journal of Chemometrics

当前位置： X-MOL 学术 › J. Chemometr. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fast method for GA‐PLS with simultaneous feature selection and identification of optimal preprocessing technique for datasets with many observations
Journal of Chemometrics ( IF 2.4 ) Pub Date : 2020-03-01 , DOI: 10.1002/cem.3195
Petter Stefansson ₁ , Kristian H. Liland ₁ , Thomas Thiis ₁ , Ingunn Burud ₁

Affiliation

A fast and memory‐efficient new method for performing genetic algorithm partial least squares (GA‐PLS) on spectroscopic data preprocessed in multiple different ways is presented. The method, which is primarily intended for datasets containing many observations, involves preprocessing a spectral dataset with several different techniques and concatenating the different versions of the data horizontally into a design matrix X which is both tall and wide. The large matrix is then condensed into a substantially smaller covariance matrix XTX whose resulting size is unrelated to the number of observations in the dataset, i.e. the height of X. It is demonstrated that the smaller covariance matrix can be used to efficiently calibrate partial least squares (PLS) models containing feature selections from any of the involved preprocessing techniques. The method is incorporated into GA‐PLS and used to evolve variable selections for a set of different preprocessing techniques concurrently within a single algorithm. This allows a single instance of GA‐PLS to determine which preprocessing technique, within the set of considered methods, is best suited for the spectroscopic dataset. Additionally, the method allows feature selections to be evolved containing variables from a mixture of different preprocessing techniques. The benefits of the introduced GA‐PLS technique can be summarized as threefold: (1) for datasets with many observations, the proposed method is substantially faster compared to conventional GA‐PLS implementations based on NIPALS, SIMPLS, etc. (2) using a single GA‐PLS automatically reveals which of the considered preprocessing techniques results in the lowest model error. (3) it allows the exploration of highly complex solutions composed of features preprocessed using various techniques.

中文翻译：

GA-PLS 的快速方法，同时特征选择和识别具有许多观察数据的数据集的最佳预处理技术

提出了一种快速且内存高效的新方法，用于对以多种不同方式预处理的光谱数据执行遗传算法偏最小二乘法 (GA-PLS)。该方法主要用于包含许多观测值的数据集，涉及使用几种不同的技术对光谱数据集进行预处理，并将不同版本的数据水平连接成一个既高又宽的设计矩阵 X。然后将大矩阵压缩成一个显着更小的协方差矩阵 XTX，其结果大小与数据集中的观测数量（即 X 的高度）无关。事实证明，较小的协方差矩阵可用于有效校准偏最小二乘法(PLS) 模型包含从任何涉及的预处理技术中选择的特征。该方法被合并到 GA-PLS 中，并用于在单个算法中同时为一组不同的预处理技术演化变量选择。这允许 GA-PLS 的单个实例在所考虑的方法集中确定哪种预处理技术最适合光谱数据集。此外，该方法允许进化特征选择，其中包含来自不同预处理技术混合的变量。引入的 GA-PLS 技术的好处可以概括为三方面：（1）对于具有许多观察的数据集，与基于 NIPALS、SIMPLS 等的传统 GA-PLS 实现相比，所提出的方法要快得多。（2）使用单个 GA-PLS 会自动揭示所考虑的哪些预处理技术导致模型误差最低。

更新日期：2020-03-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>