当前位置: X-MOL 学术Nat. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unassisted noise reduction of chemical reaction datasets
Nature Machine Intelligence ( IF 18.8 ) Pub Date : 2021-03-29 , DOI: 10.1038/s42256-021-00319-w
Alessandra Toniato , Philippe Schwaller , Antonio Cardinale , Joppe Geluykens , Teodoro Laino

Existing deep learning models applied to reaction prediction in organic chemistry can reach high levels of accuracy (>90% for natural language processing-based ones). With no chemical knowledge embedded other than the information learnt from reaction data, the quality of the datasets plays a crucial role in the performance of the prediction models. Human curation is prohibitively expensive, so unaided approaches to remove chemically incorrect entries from existing datasets are essential to improve the performance of artificial intelligence models in synthetic chemistry tasks. Here, we propose a machine learning-based, unassisted approach to remove chemically wrong entries from chemical reaction collections. We apply this method to the Pistachio collection of chemical reactions and to an open dataset, both extracted from United States Patent and Trademark Office patents. Our results show an improved prediction quality for models trained on the cleaned and balanced datasets. For retrosynthetic models, the roundtrip accuracy metric grows by 13 percentage points and the value of the cumulative Jensen–Shannon divergence decreases by 30% compared to its original record. The coverage remains high at 97%, and the value of the class diversity is not affected by the cleaning. The proposed strategy is the first unassisted rule-free technique to address automatic noise reduction in chemical datasets.



中文翻译:

化学反应数据集的无辅助降噪

应用于有机化学反应预测的现有深度学习模型可以达到很高的准确度(基于自然语言处理的模型>90%)。除了从反应数据中学到的信息之外,没有嵌入任何化学知识,数据集的质量在预测模型的性能中起着至关重要的作用。人工管理非常昂贵,因此从现有数据集中删除化学不正确条目的独立方法对于提高人工智能模型在合成化学任务中的性能至关重要。在这里,我们提出了一种基于机器学习的无辅助方法来从化学反应集合中删除化学错误条目。我们将此方法应用于 Pistachio 化学反应集合和开放数据集,均取自美国专利商标局专利。我们的结果表明,在经过清洁和平衡的数据集上训练的模型的预测质量有所提高。对于逆合成模型,与原始记录相比,往返准确度指标增长了 13 个百分点,累积 Jensen-Shannon 散度值下降了 30%。覆盖率保持在 97% 的高水平,并且类别多样性的值不受清洗的影响。所提出的策略是第一个解决化学数据集自动降噪的无辅助无规则技术。与原始记录相比,往返准确度指标增长了 13 个百分点,累积 Jensen-Shannon 散度值下降了 30%。覆盖率保持在 97% 的高水平,并且类别多样性的值不受清洗的影响。所提出的策略是第一个解决化学数据集自动降噪的无辅助无规则技术。与原始记录相比,往返准确度指标增长了 13 个百分点,累积 Jensen-Shannon 散度值下降了 30%。覆盖率保持在 97% 的高水平,并且类别多样性的值不受清洗的影响。所提出的策略是第一个解决化学数据集自动降噪的无辅助无规则技术。

更新日期:2021-03-29
down
wechat
bug