当前位置: X-MOL 学术ChemRxiv › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unassisted Noise-Reduction of Chemical Reactions Data Sets
ChemRxiv Pub Date : 2020-06-01 , DOI: 10.26434/chemrxiv.12395120.v1
Alessandra Toniato 1 , Philippe Schwaller , Antonio Cardinale , Joppe Geluykens , Teodoro Laino

Existing deep learning models applied to reaction prediction in organic chemistry are able to reach extremely high levels of accuracy (> 90% for NLP- based ones1). With no chemical knowledge embedded than the information learnt from reaction data, the quality of the data sets plays a crucial role in the performance of the prediction models. While human curation is prohibitively expensive, the need for unaided approaches to remove chemically incorrect entries from existing data sets is essential to improve the performance of artificial intelligence models in synthetic chemistry tasks. Here we propose a machine learning-based, unassisted approach to remove chemically wrong entries (noise) from chemical reaction collections. Results show that models trained on cleaned and balanced data sets improve the quality of the predictions without a decrease in performance. For the retrosynthetic models the round-trip accuracy is enhanced by 13% and the value of the cumulative Jensen Shannon metric is lowered down to 70% of its original value, while maintaining high values of coverage (97%) and constant class-diversity (1.6) at inference.



现有的用于有机化学反应预测的深度学习模型能够达到极高的准确性水平(对于基于NLP的深度学习模型而言,该准确性> 90%1)。除了从反应数据中学到的信息外,没有化学知识的嵌入,数据集的质量在预测模型的性能中起着至关重要的作用。尽管人类策展的费用过高,但是需要独立的方法来从现有数据集中删除化学错误的条目,对于提高人工智能模型在合成化学任务中的性能至关重要。在这里,我们提出了一种基于机器学习的非辅助方法,以从化学反应集合中删除化学错误的条目(噪声)。结果表明,在清理和平衡的数据集上训练的模型可以提高预测的质量,而不会降低性能。对于逆合成模型,往返精度提高了13%,累积Jensen Shannon指标的值降低到了原始值的70%,同时保持了较高的覆盖率(97%)和不变的类别多样性( 1.6)推论。
