当前位置: X-MOL 学术Nat. Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Inferring experimental procedures from text-based representations of chemical reactions
Nature Communications ( IF 14.7 ) Pub Date : 2021-05-06 , DOI: 10.1038/s41467-021-22951-1
Alain C Vaucher 1 , Philippe Schwaller 1 , Joppe Geluykens 1 , Vishnu H Nair 1 , Anna Iuliano 2 , Teodoro Laino 1
Affiliation  

The experimental execution of chemical reactions is a context-dependent and time-consuming process, often solved using the experience collected over multiple decades of laboratory work or searching similar, already executed, experimental protocols. Although data-driven schemes, such as retrosynthetic models, are becoming established technologies in synthetic organic chemistry, the conversion of proposed synthetic routes to experimental procedures remains a burden on the shoulder of domain experts. In this work, we present data-driven models for predicting the entire sequence of synthesis steps starting from a textual representation of a chemical equation, for application in batch organic chemistry. We generated a data set of 693,517 chemical equations and associated action sequences by extracting and processing experimental procedure text from patents, using state-of-the-art natural language models. We used the attained data set to train three different models: a nearest-neighbor model based on recently-introduced reaction fingerprints, and two deep-learning sequence-to-sequence models based on the Transformer and BART architectures. An analysis by a trained chemist revealed that the predicted action sequences are adequate for execution without human intervention in more than 50% of the cases.



中文翻译:


从基于文本的化学反应表示推断实验程序



化学反应的实验执行是一个依赖于上下文且耗时的过程,通常使用数十年实验室工作中收集的经验或搜索类似的、已执行的实验方案来解决。尽管数据驱动的方案(例如逆合成模型)正在成为合成有机化学中的成熟技术,但将拟议的合成路线转化为实验程序仍然是领域专家肩上的负担。在这项工作中,我们提出了数据驱动模型,用于从化学方程式的文本表示开始预测合成步骤的整个序列,以应用于批量有机化学。我们使用最先进的自然语言模型从专利中提取和处理实验程序文本,生成了包含 693,517 个化学方程式和相关动作序列的数据集。我们使用获得的数据集来训练三个不同的模型:一个基于最近推出的反应指纹的最近邻模型,以及两个基于 Transformer 和 BART 架构的深度学习序列到序列模型。训练有素的化学家进行的分析表明,在超过 50% 的情况下,预测的动作序列足以在无需人工干预的情况下执行。

更新日期:2021-05-06
down
wechat
bug