当前位置: X-MOL 学术Stat. Anal. Data Min. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Evaluating causal-based feature selection for fuel property prediction models
Statistical Analysis and Data Mining ( IF 1.3 ) Pub Date : 2021-05-11 , DOI: 10.1002/sam.11511
Bernard Nguyen 1 , Leanne S. Whitmore 1, 2 , Anthe George 1 , Corey M. Hudson 1
Affiliation  

In-silico screening of novel biofuel molecules based on chemical and fuel properties is a critical first step in the biofuel evaluation process due to the significant volumes of samples required for experimental testing, the destructive nature of engine tests, and the costs associated with bench-scale synthesis of novel fuels. Predictive models are limited by training sets of few existing measurements, often containing similar classes of molecules that represent just a subset of the potential molecular fuel space. Software tools can be used to generate every possible molecular descriptor for use as input features, but most of these features are largely irrelevant and training models on datasets with higher dimensionality than size tends to yield poor predictive performance. Feature selection has been shown to improve machine learning models, but correlation-based feature selection fails to provide scientific insight into the underlying mechanisms that determine structure–property relationships. The implementation of causal discovery in feature selection could potentially inform the biofuel design process while also improving model prediction accuracy and robustness to new data. In this study, we investigate the benefits causal-based feature selection might have on both model performance and identification of key molecular substructures. We found that causal-based feature selection performed on par with alternative filtration methods, and that a structural causal model provides valuable scientific insights into the relationships between molecular substructures and fuel properties.

中文翻译:

评估基于因果关系的燃料特性预测模型的特征选择

由于实验测试需要大量样品、发动机测试的破坏性以及与台架相关的成本,基于化学和燃料特性的新型生物燃料分子的计算机筛选是生物燃料评估过程中关键的第一步。新型燃料的规模合成。预测模型受到少数现有测量的训练集的限制,通常包含类似的分子类别,这些分子仅代表潜在分子燃料空间的一个子集。软件工具可用于生成每个可能的分子描述符以用作输入特征,但这些特征中的大多数在很大程度上是不相关的,并且在维数高于大小的数据集上训练模型往往会产生较差的预测性能。特征选择已被证明可以改进机器学习模型,但基于相关性的特征选择未能提供对决定结构-性质关系的潜在机制的科学洞察。在特征选择中实施因果发现可能会为生物燃料设计过程提供信息,同时还可以提高模型预测的准确性和对新数据的鲁棒性。在这项研究中,我们调查了基于因果关系的特征选择可能对模型性能和关键分子子结构识别的好处。我们发现基于因果关系的特征选择与替代过滤方法相当,并且结构因果模型为分子子结构和燃料特性之间的关系提供了有价值的科学见解。
更新日期:2021-05-11
down
wechat
bug