当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DECIMER 1.0: deep learning for chemical image recognition using transformers
Journal of Cheminformatics ( IF 8.6 ) Pub Date : 2021-08-17 , DOI: 10.1186/s13321-021-00538-8
Kohulan Rajan 1 , Achim Zielesny 2 , Christoph Steinbeck 1
Affiliation  

The amount of data available on chemical structures and their properties has increased steadily over the past decades. In particular, articles published before the mid-1990 are available only in printed or scanned form. The extraction and storage of data from those articles in a publicly accessible database are desirable, but doing this manually is a slow and error-prone process. In order to extract chemical structure depictions and convert them into a computer-readable format, Optical Chemical Structure Recognition (OCSR) tools were developed where the best performing OCSR tools are mostly rule-based. The DECIMER (Deep lEarning for Chemical ImagE Recognition) project was launched to address the OCSR problem with the latest computational intelligence methods to provide an automated open-source software solution. Various current deep learning approaches were explored to seek a best-fitting solution to the problem. In a preliminary communication, we outlined the prospect of being able to predict SMILES encodings of chemical structure depictions with about 90% accuracy using a dataset of 50–100 million molecules. In this article, the new DECIMER model is presented, a transformer-based network, which can predict SMILES with above 96% accuracy from depictions of chemical structures without stereochemical information and above 89% accuracy for depictions with stereochemical information.

中文翻译:

DECIMER 1.0:使用变压器进行化学图像识别的深度学习

在过去的几十年中,关于化学结构及其性质的可用数据量稳步增加。特别是 1990 年中期之前发表的文章只能以印刷或扫描形式提供。从这些文章中提取数据并将其存储在可公开访问的数据库中是可取的,但手动执行此操作是一个缓​​慢且容易出错的过程。为了提取化学结构描述并将其转换为计算机可读格式,开发了光学化学结构识别 (OCSR) 工具,其中性能最佳的 OCSR 工具大多基于规则。DECIMER(化学图像识别深度学习)项目的启动旨在通过最新的计算智能方法解决 OCSR 问题,以提供自动化的开源软件解决方案。探索了各种当前的深度学习方法,以寻求最适合该问题的解决方案。在初步交流中,我们概述了能够使用 50-1 亿个分子的数据集以约 90% 的准确率预测化学结构描述的 SMILES 编码的前景。在本文中,提出了新的 DECIMER 模型,这是一种基于转换器的网络,它可以从没有立体化学信息的化学结构描述中以 96% 以上的准确率预测 SMILES,而对于带有立体化学信息的描述则可以达到 89% 以上的准确率。
更新日期:2021-08-17
down
wechat
bug