当前位置: X-MOL 学术J. Chem. Inf. Model. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The (Re)-Evolution of Quantitative Structure–Activity Relationship (QSAR) Studies Propelled by the Surge of Machine Learning Methods
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2022-11-28 , DOI: 10.1021/acs.jcim.2c01422
Thereza A Soares 1, 2 , Ariane Nunes-Alves 3 , Angelica Mazzolari 4 , Fiorella Ruggiu 5 , Guo-Wei Wei 6 , Kenneth Merz 7
Affiliation  

In their seminal work on quantitative structure–activity relationship (QSAR), Hansch and co-workers predicted in 1962 that Hammett functions and partition coefficients would become significant in establishing the relationship between structure and activity. (1) Over the last 60 years, QSAR has evolved from the crude regression/classification analysis of a small set of similar compounds to sophisticated Machine Learning (ML)-based techniques that can extract the chemical, physical, and biological functions embedded in a massive data set of complex molecular structures. Throughout this transformation, QSAR became a vital component of drug discovery, allowing for the highly efficient, low-cost prediction of activities and properties as well as structure-based virtual screening of potentially active hits from chemical libraries composed of millions of drug candidates. Machine learning is also applied in various other fields, (2,3) including retrosynthetic route prediction, (4,5) protein (6) and compound design, (7) conformer generation, (8) force-field optimization, (9,10) and protein structure prediction. (11) The classical QSAR approach relies on mathematical models to establish a relation between molecular structure embedded through various descriptors (i.e., two-dimensional (2D), fingerprints, graphs, or other mathematical representations) and biological activities, including absorption, distribution, metabolism, excretion, and toxicity (ADMET) profiling, (12) binding free energies, (13,14) and kinetic rates for protein–ligand complexes, (15,16) derived from a set of molecules of similar topology and functionality. As a broad repertoire of mathematical models can be used, QSAR has quite early incorporated ML algorithms to process huge multidimensional data sets through multitask models successfully applied to nonlinear structure–function relationships. JCIM was a pioneer among ACS journals publishing contents on the application of Artificial Intelligence (AI) and ML to chemistry. (4,17−20) What appears to be the first ML-related contribution published in JCIM, then under the name Journal of Chemical Information and Computer Sciences, was the construction and refinement of a database for synthetic organic chemistry via inductive and deductive ML algorithms in 1990. (4) In 1992, two papers using neural networks (NNs) were published, one analyzing the partial derivatives of neural networks for structure–activity relationship (SAR) analysis (20) and the second one predicting phosphorus NMR shifts using NNs. (19) In 1994, more papers were published; one of them compared the performance of different QSAR property prediction methods, including an attribute-value ML model for construction of regression trees, (18) and the other used NNs to obtain quantitative information from ion mobility spectrometry data. (17) Moving forward, the association between ML and QSAR became an integral part of many JCIM publications. From 1995, for almost a decade, the main ML applications involved Artificial Neural Networks (ANNs), alone or benchmarked against other algorithms, often applied to bioactivity and toxicity prediction. (21−23) ANNs became very popular in recent years. The k-nearest neighbors algorithm (K-NN) was introduced in 2000, (24) support-vector machines (SVMs) in 2003, (25) and Random Forest (RF) in 2003. (26) These methods were applied to address both regression and classification tasks in chemical information. Two of the most successful ML algorithms for general QSAR applications in drug discovery are RF and deep neural networks (DNNs). (27−29) The study by Svetnik and colleagues, (26) published in the Journal of Chemical Information and Computer Sciences in 2003, is one of the first examples of the use of RF in QSAR, which later was often used as a gold standard for comparisons to other QSAR methods. However, after the Kaggle Merck Molecular Activity Challenge 2013 (https://www.kaggle.com/c/MerckActivity, accessed 2022-09-16) and the Tox21 Data Challenge 2015 (https://ncats.nih.gov/news/releases/2015/tox21-challenge-2014-winners, accessed 2022-09-16), DNNs have emerged as the method of choice for QSAR applications in drug discovery. (27) This virtual issue consists of a collection of 13 selected papers published in JCIM from 2017 onward. It highlights the re-evolution of QSAR propelled by the surge of ML and serves as a reference for those who want to learn about modern QSAR methods and applications. It covers different ML methods (e.g., random forest, deep learning, etc.), grappling with significant issues for drug discovery, such as activity cliffs in the activity landscape of very similar molecules with very different activities, the representation of different conformations of small molecules, or how to share models without revealing the data of the training set, which can facilitate greater collaboration in the industry sector. Two Perspectives by Sheridan and co-workers (30,31) investigate the performance of multiple versions of RF models spanning a period of 10 years. Such models displayed unexpected behavior in the prediction of ADMET properties. In one of the Perspectives, (30) the predictivity of the models is shown to have a large variation for different versions. This could be explained by activity cliffs: the presence of molecules in the test set with activities different from similar molecules in the training set, leading to less accurate predictions. The second Perspective explores the reasons underlying the large changes in ADMET predictions for some molecules when different model versions were applied. (31) Interestingly, for most molecules with prediction changes, the prediction improved at later model versions. Metrics indicating which molecules will display large prediction changes were explored, leading to the observation that the molecules with large changes were associated with large prediction uncertainty in the models. However, the opposite was not true, and therefore there is room for further investigation. Essentially, the similarity between the training set and test set is directly related to predictability of ML models. The training set must cover a sufficiently large chemical space to render a reliable ML model. Two Research articles have also addressed the hurdles of predicting activity cliffs and how to improve the accuracy of ADMET properties for some end points. (32,33) Coley and co-workers addressed the identification of activity cliffs associated with the roughness of the property landscape of molecular data sets. (32) The authors proposed a new measure of the structure–property landscape roughness for molecular data sets, the Roughness Index. This measure can be applied to regression tasks for any property and to binary classification, extending beyond quantitative measure of structure–property of small organic molecules to more complex chemical systems such as crystalline materials. Lim et al. have generated new deep learning models for the prediction of quantum mechanics (QM) descriptors that, combined with graph convolutional neural network models, resulted in some improvement for selected ADME properties. The take home message is that the integration of QM descriptors into ML approaches to QSAR is a path to be explored, but further development is required. (33) The inclusion of three-dimensional descriptors in QSAR contributes information on the spatial structure of molecules, improving the accuracy of models for protein–ligand recognition. However, the choice of the bioactive conformation to be encoded in the QSAR model is not trivial. Znakov et al. tackles this issue via multi-instance learning approaches, which makes it possible to represent each molecule in the data set by multiple conformations with automatic selection of the most relevant ones. (34) A comparison of the single- and multi-instance machine learning algorithms is performed for the prediction of the biological activity of chemical compounds from 175 data sets extracted from the ChEMBL23 database. The multi-instance QSAR model outperformed the single-instance ones in most cases while alleviating the difficulty of selecting the bioactive conformation out of all possible conformations throughout model building. The prediction of molecular activities is at the core of QSAR models, and it has been greatly boosted by the integration of ML algorithms, particularly DNNs. (29,35) Wallqvist and co-workers examined the performance of ML methods (DNN, RF, and variable nearest neighbor) to predict the molecular activities (36) of 21 data sets from the Leadscope Toxicity Database and Merck Molecular Activity Challenge. Although the different algorithms led to accurate predictions for molecules structurally related to the ones in the training sets, performance was poorer for molecules increasingly dissimilar from the ones used as training sets. The work argued that the foremost source of error for predictions of molecular properties is not the ML algorithm but the degree of similarity between molecules in the test and training sets. Different contributions addressed the use of large data sets. The method DLCA (deep learning consensus architecture) (37) is a new deep learning architecture that incorporates consensus modeling inside a neural net. It was tested in a regression task using a data set with 251 998 compounds with half-maximal inhibitory concentration (IC50) values against protein targets available and in a classification task using a data set with 7857 compounds, which were classified as toxic or nontoxic. DLCA showed a better performance for the two data sets, compared to other consensus approaches. In another contribution, an automated workflow was created to build a classification-based model for diverse and imbalanced data sets. (38) This workflow was tested using a data set composed of 196 173 compounds, with 1063 compounds displaying antileishmanial activity. Six different methods were tested to build a consensus model, and the model using decision trees had the best performance. Another collection of papers reported on the construction of classification-based QSAR models. The open-source software QSAR-Co (QSAR with conditions) (39) was developed to set up models that handle response data with diverse experimental or theoretical conditions. Classification can be performed using two methods, namely, two-class linear discriminant analysis or RF. The new algorithm RINH (rivality of the neighborhood) (40) uses a rivality index to build models, allowing one to obtain robust measurements of the reliability of the predictions. The method was compared against 12 different algorithms, including Support Vector Machine and RF, generating classification models as accurate as those obtained by other ML algorithms. The Review by Polishchuk (27) provides a critical assessment of different approaches for the interpretation of QSAR models. Model-dependent approaches are discussed for different methods, such as multiple linear regression, partial least-squares regression, and decision trees as well methods considered as black boxes, such as neural nets. Model-independent approaches, which investigate how the model output changes with variation in the input, are also presented, such as sensitivity analysis, variable importance, and partial derivatives. Interestingly, the author claims that any contemporary QSAR model can be considered as interpretable, even the ones considered to be black boxes. The Review concludes by providing a list of criteria to avoid misinterpretation of QSAR models, which stays highly relevant as QSAR models (re)evolve around ML methods. In the last several decades, QSAR became an integral tool in the drug discovery pipeline of major pharmaceutical and start-up companies. The fusion of QSAR and ML methods shifted the concept of drug discovery from rule-based to data-driven, advancing the discovery of new compounds but also unveiling new challenges for the industry. Martin and Zhu argued that, by sharing assay data and training models, companies can obtain better predictions than what is individually achieved. (41) To make their case, the authors adapted the RF-based profile-QSAR (42) for model sharing and explored the benefits of collaboration between companies via partial multitask model sharing. The authors concluded that the sharing of individual assay models, without the sharing of compounds, targets, or activity data, could expand the applicability of the models past what was individually available to each company. Pande and co-workers argued that, despite the performance of deep learning for drug discovery data sets, the pharma and biotech industries struggle to transition from prototypes into production. (35) This is pinpointed to two factors, the challenge of implementing deep architectures and the poor understanding of the failure modes of multitask deep networks relative to other ML methods. To promote the adoption of deep networks in commercial drug discovery, multitask deep networks were implemented in the DeepChem library for drug discovery together with an analysis of statistical robustness tested for the Kaggle, Factors, Kinase, and UV data sets. (35) Recent years have witnessed the development of advanced mathematical tools for QSAR. Techniques derived from differential geometry, algebraic topology (i.e., topological data analysis (TDA)), algebraic graph, and combinatorial graph have been devised to offer benchmark predictions for docking, virtual scoring, and toxicity analysis. As QSAR re-evolves around ML, it will remain at the core of the drug discovery process but not without some old challenges. ML methods are dependent on experimental data, which can often be sparse, imbalanced, and noisy, as experiments and conditions are not standardized across different laboratories. Large and consistent experimental data sets and new ML algorithms should be the way ahead. It is, however, clear that the big surge of ML in chemistry has just begun and has already expanded the horizons of potential QSAR application far beyond what was initially foreseen by Hansch and co-workers. (1) It worth noting that the recent development of ML, including DL, has gone way beyond the QSAR. For example, advanced natural language processing (NLP)-based autoencoders offer sequence embeddings from unlabeled data for accurate molecular property predictions. Additionally, Transformers utilize self-supervised learning (SSL) strategies to access hundreds of millions molecular and biomolecular sequence information and enable structure-free activity predictions. In terms of tasks, ML and DL concern not only regression and classification but also clustering and dimensionality reduction, which have wide applications in omics data analysis. A vast variety of ML and DL techniques have been developed and applied to chemical science, including generative adversarial networks (GANs), U-Net, long short-term memory (LSTM), graph neural networks (GNNs), reinforcement learning (RL), Boltzmann machine, etc. Although transfer learning continues to be a popular issue, active learning has been implemented to chemical information. Various ML and DL strategies have been developed to extract chemical activity under adversarial conditions, such as diverse data, imbalanced data, data imputation, noisy data, and small data challenges. The field of chemical information and modeling has evolved beyond chemistry, with players and stockholders from computer science, mathematics, chemical, biological, and/or medical engineering industry, to mention only a few. JCIM has a long tradition in advancing the integration of ML in chemistry, as outlined by this and other editorials. (28,43) It has been publishing manuscripts in the field before the recent explosion in interest (Figure 1). Given our long-term commitment to QSAR and its applications, JCIM encourages the submission of manuscripts on a wide variety of ML-related topics, (43) including new developments contributing to the re-evolution of quantitative structure–activity relationships. Figure 1. Histograms of the number of times the expressions “quantitative structure activity relationship”, “machine learning”, or “deep learning” were mentioned in the title of manuscripts published in JCIM (under its current name or under its previous names, Journal of Chemical Documentation or Journal of Chemical Information and Computer Sciences) since 1960. Data extracted from pubs.acs.org using the search tool in October of 2022. A.N.A. is thankful for the funding from DFG under Germany’s Excellence Strategy–EXC 2008/1-390540038–UniSysCat. T.A.S. acknowledges financial support from FAPESP (Grant No. 2021/04283-3), CNPq (Grant No. 307193/2021-7), and RCN (Grant No. 262695). This article references 43 other publications. This article has not yet been cited by other publications. Figure 1. Histograms of the number of times the expressions “quantitative structure activity relationship”, “machine learning”, or “deep learning” were mentioned in the title of manuscripts published in JCIM (under its current name or under its previous names, Journal of Chemical Documentation or Journal of Chemical Information and Computer Sciences) since 1960. Data extracted from pubs.acs.org using the search tool in October of 2022. This article references 43 other publications.

中文翻译:

机器学习方法激增推动的定量构效关系 (QSAR) 研究的(重新)演变

在他们关于定量结构-活性关系 (QSAR) 的开创性工作中,Hansch 及其同事在 1962 年预测哈米特函数和分配系数将在建立结构和活性之间的关系中发挥重要作用。(1) 在过去的 60 年里,QSAR 已经从对一小组相似化合物的粗略回归/分类分析发展为基于机器学习 (ML) 的复杂技术,这些技术可以提取化合物中嵌入的化学、物理和生物功能复杂分子结构的海量数据集。在整个转型过程中,QSAR 成为药物发现的重要组成部分,允许高效、活性和特性的低成本预测以及基于结构的虚拟筛选来自由数百万候选药物组成的化学库的潜在活性命中。机器学习还应用于其他各个领域,(2,3) 包括逆合成路线预测,(4,5) 蛋白质 (6) 和化合物设计,(7) 构象异构体生成,(8) 力场优化,(9, 10) 和蛋白质结构预测。(11) 经典的 QSAR 方法依赖于数学模型来建立通过各种描述符(即二维 (2D)、指纹、图形或其他数学表示)嵌入的分子结构与生物活动(包括吸收、分布、代谢、排泄和毒性 (ADMET) 分析,(12) 结合自由能,(13,14) 和蛋白质-配体复合物的动力学速率,(15,16) 来自一组具有相似拓扑结构和功能的分子。由于可以使用广泛的数学模型,QSAR 很早就结合了 ML 算法,通过成功应用于非线性结构 - 函数关系的多任务模型来处理巨大的多维数据集。JCIM 是 ACS 期刊中出版有关人工智能 (AI) 和 ML 在化学中应用内容的先驱。(4,17−20) 似乎是在 JCIM 中发表的第一个与 ML 相关的贡献,然后以名称 QSAR 很早就结合了 ML 算法,通过成功应用于非线性结构-函数关系的多任务模型来处理巨大的多维数据集。JCIM 是 ACS 期刊中出版有关人工智能 (AI) 和 ML 在化学中应用内容的先驱。(4,17−20) 似乎是在 JCIM 中发表的第一个与 ML 相关的贡献,然后以名称 QSAR 很早就结合了 ML 算法,通过成功应用于非线性结构-函数关系的多任务模型来处理巨大的多维数据集。JCIM 是 ACS 期刊中出版有关人工智能 (AI) 和 ML 在化学中应用内容的先驱。(4,17−20) 似乎是在 JCIM 中发表的第一个与 ML 相关的贡献,然后以名称化学信息与计算机科学杂志, 是 1990 年通过归纳和演绎 ML 算法构建和完善有机合成化学数据库。 (4) 1992 年发表了两篇使用神经网络 (NN) 的论文,其中一篇分析了神经网络的偏导数结构-活性关系 (SAR) 分析 (20) 和第二个使用 NN 预测磷 NMR 位移的方法。(19) 1994年,发表论文较多;其中一个比较了不同 QSAR 属性预测方法的性能,包括用于构建回归树的属性值 ML 模型,(18) 和另一个使用神经网络从离子迁移率光谱数据中获取定量信息。(17) 展望未来,ML 和 QSAR 之间的关联成为许多 JCIM 出版物的组成部分。从1995年开始,近十年来,ML 的主要应用涉及人工神经网络 (ANN),单独使用或针对其他算法进行基准测试,通常应用于生物活性和毒性预测。(21−23) ANN 近年来变得非常流行。这k最近邻算法 ( K -NN) 于 2000 年推出,(24) 支持向量机 (SVM) 于 2003 年推出,(25) 和随机森林 (RF) 于 2003 年推出。(26) 这些方法用于解决这两个问题化学信息中的回归和分类任务。用于药物发现中一般 QSAR 应用的两种最成功的 ML 算法是 RF 和深度神经网络 (DNN)。(27−29) Svetnik 及其同事的研究,(26) 发表在《化学信息与计算机科学杂志》上2003 年,是在 QSAR 中使用 RF 的首批示例之一,后来经常被用作与其他 QSAR 方法进行比较的黄金标准。然而,在 2013 年 Kaggle Merck 分子活性挑战赛(https://www.kaggle.com/c/MerckActivity,2022-09-16 访问)和 2015 年 Tox21 数据挑战赛(https://ncats.nih.gov/news /releases/2015/tox21-challenge-2014-winners,访问时间 2022-09-16),DNN 已成为药物发现中 QSAR 应用的首选方法。(27) 这个虚拟问题包括 2017 年以来在 JCIM 上发表的 13 篇精选论文。它突出了ML浪潮推动的QSAR的再演进,为那些想了解现代QSAR方法和应用的人提供了参考。它涵盖了不同的 ML 方法(例如,随机森林、深度学习等),努力解决药物发现的重大问题,例如具有非常不同活性的非常相似分子的活性景观中的活性悬崖,小分子不同构象的表示,或者如何在不泄露训练集数据的情况下共享模型,这可以促进工业领域的更大合作。Sheridan 和同事的两个观点 (30,31) 调查了跨越 10 年的多个版本的 RF 模型的性能。这些模型在预测 ADMET 属性时表现出意想不到的行为。在其中一篇 Perspectives 中,(30) 模型的预测性被证明对于不同的版本有很大的变化。这可以用活动悬崖来解释:测试集中存在的分子与训练集中类似分子的活动不同,导致预测不太准确。第二篇 Perspective 探讨了应用不同模型版本时某些分子的 ADMET 预测发生巨大变化的潜在原因。(31) 有趣的是,对于大多数具有预测变化的分子,预测在后来的模型版本中有所改善。探索了指示哪些分子将显示大的预测变化的指标,导致观察到具有大变化的分子与模型中的大预测不确定性相关。然而,事实并非如此,因此还有进一步调查的余地。本质上,训练集和测试集之间的相似性与 ML 模型的可预测性直接相关。训练集必须覆盖足够大的化学空间以呈现可靠的 ML 模型。两篇研究文章还解决了预测活动悬崖的障碍以及如何提高某些端点的 ADMET 属性的准确性。(32,33) Coley 及其同事解决了与分子数据集属性景观的粗糙度相关的活动悬崖的识别问题。(32) 作者提出了一种新的分子数据集结构-性质景观粗糙度测量方法,即粗糙度指数。该度量可应用于任何属性的回归任务和二元分类,从小有机分子的结构-属性的定量度量扩展到更复杂的化学系统,如晶体材料。林等。已经为量子力学 (QM) 描述符的预测生成了新的深度学习模型,结合图卷积神经网络模型,对选定的 ADME 属性进行了一些改进。带回家的消息是,将 QM 描述符集成到 QSAR 的 ML 方法中是一条有待探索的路径,但需要进一步开发。(33) QSAR 中包含的三维描述符提供了分子空间结构的信息,提高了蛋白质-配体识别模型的准确性。然而,要在 QSAR 模型中编码的生物活性构象的选择并非微不足道。兹纳科夫等人。通过多实例学习方法解决这个问题,这使得通过自动选择最相关的多个构象来表示数据集中的每个分子成为可能。(34) 对从 ChEMBL23 数据库中提取的 175 个数据集中的化合物的生物活性预测进行了单实例和多实例机器学习算法的比较。多实例 QSAR 模型在大多数情况下优于单实例模型,同时减轻了在整个模型构建过程中从所有可能的构象中选择生物活性构象的难度。分子活动的预测是 QSAR 模型的核心,并且通过 ML 算法(尤其是 DNN)的集成得到了极大的推动。(29,35) Wallqvist 和同事检查了 ML 方法(DNN、RF、和变量最近邻)来预测来自 Leadscope 毒性数据库和默克分子活性挑战的 21 个数据集的分子活性 (36)。尽管不同的算法可以准确预测与训练集中分子结构相关的分子,但对于与用作训练集的分子越来越不同的分子,性能会更差。这项工作认为,预测分子特性的最重要错误来源不是 ML 算法,而是测试和训练集中分子之间的相似程度。不同的贡献解决了大数据集的使用。方法 DLCA(深度学习共识架构)(37) 是一种新的深度学习架构,它在神经网络中整合了共识建模。50) 针对可用蛋白质目标的值以及在使用包含 7857 种化合物的数据集的分类任务中,这些化合物被归类为有毒或无毒。与其他共识方法相比,DLCA 对这两个数据集表现出了更好的性能。在另一项贡献中,创建了一个自动化工作流来为多样化和不平衡的数据集构建基于分类的模型。(38) 该工作流程使用由 196 173 种化合物组成的数据集进行了测试,其中 1063 种化合物显示出抗利什曼氏动物活性。测试了六种不同的方法来构建共识模型,使用决策树的模型具有最佳性能。另一篇论文集报道了基于分类的 QSAR 模型的构建。开源软件 QSAR-Co(QSAR with conditions)(39) 的开发是为了建立模型来处理具有不同实验或理论条件的响应数据。可以使用两种方法进行分类,即两类线性判别分析或 RF。新算法 RINH(邻域竞争性)(40) 使用竞争性指数来构建模型,使人们能够获得对预测可靠性的稳健测量。该方法与 12 种不同的算法(包括支持向量机和 RF)进行了比较,生成的分类模型与其他 ML 算法获得的分类模型一样准确。Polishchuk 的评论 (27) 对解释 QSAR 模型的不同方法进行了批判性评估。针对不同的方法讨论了依赖于模型的方法,例如多元线性回归、偏最小二乘回归和决策树以及被视为黑盒的方法,例如神经网络。还介绍了与模型无关的方法,这些方法研究模型输出如何随着输入的变化而变化,例如灵敏度分析、变量重要性和偏导数。有趣的是,作者声称任何当代 QSAR 模型都可以被认为是可解释的,即使是那些被认为是黑盒子的模型。该评论最后提供了一个标准列表,以避免对 QSAR 模型的误解,随着 QSAR 模型围绕 ML 方法(重新)发展,这些标准仍然具有高度相关性。在过去的几十年里,QSAR 已成为主要制药公司和初创公司药物发现管道中不可或缺的工具。QSAR 和 ML 方法的融合将药物发现的概念从基于规则的转变为数据驱动,推动了新化合物的发现,但也为行业带来了新的挑战。Martin 和 Zhu 认为,通过共享化验数据和培训模型,公司可以获得比单独取得的更好的预测。(41) 为了证明他们的情况,作者将基于 RF 的配置文件 QSAR (42) 用于模型共享,并探索了公司之间通过部分多任务模型共享进行协作的好处。作者得出结论,共享单独的分析模型,而不共享化合物、目标或活动数据,可以扩展模型的适用性,使其超越每个公司单独可用的模型。Pande 和同事认为,尽管药物发现数据集的深度学习表现出色,但制药和生物技术行业仍在努力从原型过渡到生产。(35) 这被指出为两个因素,实施深度架构的挑战以及对多任务深度网络相对于其他 ML 方法的故障模式的理解不足。为了促进在商业药物发现中采用深度网络,在 DeepChem 库中实施了多任务深度网络以进行药物发现,同时分析了针对 Kaggle、因子、激酶和 UV 数据集测试的统计稳健性。(35) 近年来见证了用于 QSAR 的高级数学工具的发展。源自微分几何、代数拓扑(即拓扑数据分析(TDA))、代数图的技术,和组合图已被设计用于为对接、虚拟评分和毒性分析提供基准预测。随着 QSAR 围绕 ML 重新发展,它将仍然是药物发现过程的核心,但并非没有一些老挑战。ML 方法依赖于实验数据,这些数据通常可能是稀疏的、不平衡的和嘈杂的,因为不同实验室的实验和条件没有标准化。大型且一致的实验数据集和新的 ML 算法应该是前进的方向。然而,很明显,ML 在化学领域的激增才刚刚开始,并且已经扩大了潜在 QSAR 应用的范围,远远超出了 Hansch 及其同事最初的预见。(1) 值得注意的是,包括 DL 在内的 ML 最近的发展已经远远超出了 QSAR。例如,基于高级自然语言处理 (NLP) 的自动编码器提供来自未标记数据的序列嵌入,以进行准确的分子特性预测。此外,Transformers 利用自我监督学习 (SSL) 策略来访问数亿个分子和生物分子序列信息,并实现无结构活动预测。在任务方面,ML 和 DL 不仅涉及回归和分类,还涉及聚类和降维,这些在组学数据分析中具有广泛的应用。Transformers 利用自我监督学习 (SSL) 策略来访问数亿个分子和生物分子序列信息,并实现无结构活动预测。在任务方面,ML 和 DL 不仅涉及回归和分类,还涉及聚类和降维,这些在组学数据分析中具有广泛的应用。Transformers 利用自我监督学习 (SSL) 策略来访问数亿个分子和生物分子序列信息,并实现无结构活动预测。在任务方面,ML 和 DL 不仅涉及回归和分类,还涉及聚类和降维,这些在组学数据分析中具有广泛的应用。已经开发出各种各样的 ML 和 DL 技术并将其应用于化学科学,包括生成对抗网络 (GAN)、U-Net、长短期记忆 (LSTM)、图神经网络 (GNN)、强化学习 (RL) ,玻尔兹曼机等。虽然迁移学习仍然是一个热门问题,但主动学习已经被应用于化学信息。已经开发了各种 ML 和 DL 策略来提取对抗条件下的化学活性,例如多样化数据、不平衡数据、数据插补、噪声数据和小数据挑战。化学信息和建模领域已经超越了化学领域,参与者和股东来自计算机科学、数学、化学、生物和/或医学工程行业,仅举几例。正如这篇社论和其他社论所概述的,JCIM 在推进 ML 在化学中的整合方面有着悠久的传统。(28,43) 在最近的兴趣激增之前,它一直在该领域发表手稿(图 1)。鉴于我们对 QSAR 及其应用的长期承诺,JCIM 鼓励提交关于各种 ML 相关主题的手稿,(43) 包括有助于重新演变定量结构-活动关系的新发展。图1. JCIM(现名或旧名,43) 在最近的兴趣激增之前,它一直在该领域发表手稿(图 1)。鉴于我们对 QSAR 及其应用的长期承诺,JCIM 鼓励提交关于各种 ML 相关主题的手稿,(43) 包括有助于重新演变定量结构-活动关系的新发展。图1. JCIM(现名或旧名,43) 在最近的兴趣激增之前,它一直在该领域发表手稿(图 1)。鉴于我们对 QSAR 及其应用的长期承诺,JCIM 鼓励提交关于各种 ML 相关主题的手稿,(43) 包括有助于重新演变定量结构-活动关系的新发展。图1. JCIM(现名或旧名,(43) 包括有助于重新演变定量结构-活动关系的新发展。图1. JCIM(现名或旧名,(43) 包括有助于重新演变定量结构-活动关系的新发展。图1. JCIM(现名或旧名,Journal of Chemical DocumentationJournal of Chemical Information and Computer Sciences ) 自 1960 年以来。数据从pubs.acs.org中提取在 2022 年 10 月使用搜索工具。全日空感谢 DFG 根据德国卓越战略 – EXC 2008/1-390540038 – UniSysCat 提供的资金。TAS 感谢 FAPESP(赠款号 2021/04283-3)、CNPq(赠款号 307193/2021-7)和 RCN(赠款号 262695)的财政支持。本文引用了 43 篇其他出版物。这篇文章尚未被其他出版物引用。图 1. JCIM(以现名或旧名, Journal)发表的论文标题中提及“定量结构活动关系”、“机器学习”或“深度学习”的次数直方图化学文献化学信息和计算机科学杂志)自 1960 年以来。数据提取自pubs.acs.org于 2022 年 10 月使用搜索工具。本文引用了 43 份其他出版物。
更新日期:2022-11-29
down
wechat
bug