当前期刊: Journal of Cheminformatics Go to current issue    加入关注   
显示样式:        排序: 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Building attention and edge message passing neural networks for bioactivity and physical–chemical property prediction
    J. Cheminfom. (IF 4.154) Pub Date : 2020-01-08
    M. Withnall; E. Lindelöf; O. Engkvist; H. Chen

    Neural Message Passing for graphs is a promising and relatively recent approach for applying Machine Learning to networked data. As molecules can be described intrinsically as a molecular graph, it makes sense to apply these techniques to improve molecular property prediction in the field of cheminformatics. We introduce Attention and Edge Memory schemes to the existing message passing neural network framework, and benchmark our approaches against eight different physical–chemical and bioactivity datasets from the literature. We remove the need to introduce a priori knowledge of the task and chemical descriptor calculation by using only fundamental graph-derived properties. Our results consistently perform on-par with other state-of-the-art machine learning approaches, and set a new standard on sparse multi-task virtual screening targets. We also investigate model performance as a function of dataset preprocessing, and make some suggestions regarding hyperparameter selection.

    更新日期:2020-01-09
  • Mol-CycleGAN: a generative model for molecular optimization
    J. Cheminfom. (IF 4.154) Pub Date : 2020-01-08
    Łukasz Maziarka; Agnieszka Pocha; Jan Kaczmarczyk; Krzysztof Rataj; Tomasz Danel; Michał Warchoł

    Designing a molecule with desired properties is one of the biggest challenges in drug development, as it requires optimization of chemical compound structures with respect to many complex properties. To improve the compound design process, we introduce Mol-CycleGAN—a CycleGAN-based model that generates optimized compounds with high structural similarity to the original ones. Namely, given a molecule our model generates a structurally similar one with an optimized value of the considered property. We evaluate the performance of the model on selected optimization objectives related to structural properties (presence of halogen groups, number of aromatic rings) and to a physicochemical property (penalized logP). In the task of optimization of penalized logP of drug-like molecules our model significantly outperforms previous results.

    更新日期:2020-01-09
  • CyBy2: a strongly typed, purely functional framework for chemical data management
    J. Cheminfom. (IF 4.154) Pub Date : 2019-12-30
    Stefan Höck; Rainer Riedl

    We present the development of CyBy2, a versatile framework for chemical data management written in purely functional style in Scala, a modern multi-paradigm programming language. Together with the core libraries we provide a fully functional example implementation of a HTTP server together with a single page web client with powerful querying and visualization capabilities, providing essential functionality for people working in the field of organic and medicinal chemistry. The main focus of CyBy2 are the diverse needs of different research groups in the field and therefore the flexibility required from the underlying data model. Techniques for writing type level specifications giving strong guarantees about the correctness of the implementation are described, together with the resulting gain in confidence during refactoring. Finally we talk about the advantages of using a single code base from which the server, the client and the software’s documentation pages are being generated. We conclude with a comparison with existing open source solutions. All code described in this article is published under version 3 of the GNU General Public License and available from GitHub including an example implementation of both backend and frontend together with documentation how to download and compile the software (available at https://github.com/stefan-hoeck/cyby2).

    更新日期:2019-12-31
  • Learning important features from multi-view data to predict drug side effects
    J. Cheminfom. (IF 4.154) Pub Date : 2019-12-16
    Xujun Liang; Pengfei Zhang; Jun Li; Ying Fu; Lingzhi Qu; Yongheng Chen; Zhuchu Chen

    The problem of drug side effects is one of the most crucial issues in pharmacological development. As there are many limitations in current experimental and clinical methods for detecting side effects, a lot of computational algorithms have been developed to predict side effects with different types of drug information. However, there is still a lack of methods which could integrate heterogeneous data to predict side effects and select important features at the same time. Here, we propose a novel computational framework based on multi-view and multi-label learning for side effect prediction. Four different types of drug features are collected and graph model is constructed from each feature profile. After that, all the single view graphs are combined to regularize the linear regression functions which describe the relationships between drug features and side effect labels. L1 penalties are imposed on the regression coefficient matrices in order to select features relevant to side effects. Additionally, the correlations between side effect labels are also incorporated into the model by graph Laplacian regularization. The experimental results show that the proposed method could not only provide more accurate prediction for side effects but also select drug features related to side effects from heterogeneous data. Some case studies are also supplied to illustrate the utility of our method for prediction of drug side effects.

    更新日期:2019-12-17
  • ChemScanner: extraction and re-use(ability) of chemical information from common scientific documents containing ChemDraw files
    J. Cheminfom. (IF 4.154) Pub Date : 2019-12-11
    An Nguyen; Yu-Chieh Huang; Pierre Tremouilhac; Nicole Jung; Stefan Bräse

    We developed ChemScanner, a software that can be used for the extraction of chemical information from ChemDraw binary (CDX) or ChemDraw XML-based (CDXML) files and to retrieve the ChemDraw scheme from DOC, DOCX or XML documents. This can facilitate the reuse of chemical information embedded into diverse documents used as standard storage and communication instrument in chemical sciences (e.g. for student’s theses, PhD theses, or publications). The extracted information is processed to reactions, molecules, as well as additional text and values and can be accessed via the ChemScanner UI. ChemScanner supports the export to Excel and CML, the direct import of the extracted data to the Open Source ELN Chemotion or the use via “copy and paste” of selected information. The software was designed with a focus on the processing of documents with embedded molecular structure information as CDX or CDXML as these are the most common file formats for chemical drawings. The project aims to support the chemists in their efforts to re-use chemistry research data by providing them missing tools for an automated assembly of reaction data.

    更新日期:2019-12-11
  • IntelliPatent: a web-based intelligent system for fast chemical patent claim drafting
    J. Cheminfom. (IF 4.154) Pub Date : 2019-12-11
    Pei-Hua Wang; Yufeng Jane Tseng

    The first step of automating composition patent drafting is to draft the claims around a Markush structure with substituents. Currently, this process depends heavily on experienced attorneys or patent agents, and few tools are available. IntelliPatent was created to accelerate this process. Users can simply upload a series of analogs of interest, and IntelliPatent will automatically extract the general structural scaffold and generate the patent claim text. The program can also extend the patent claim by adding commonly seen R groups from historical lists of the top 30 selling drugs in the US for all R substituents. The program takes MDL SD file formats as inputs, and the invariable core structure and variable substructures will be identified as the initial scaffold and R groups in the output Markush structure. The results can be downloaded in MS Word format (.docx). The suggested claims can be quickly generated with IntelliPatent. This web-based tool is freely accessible at https://intellipatent.cmdm.tw/.

    更新日期:2019-12-11
  • HastaLaVista, a web-based user interface for NMR-based untargeted metabolic profiling analysis in biomedical sciences: towards a new publication standard
    J. Cheminfom. (IF 4.154) Pub Date : 2019-12-05
    Julien Wist

    Metabolic profiling has been shown to be useful to improve our understanding of complex metabolic processes. Shared data are key to the analysis and validation of metabolic profiling and untargeted spectral analysis and may increase the pace of new discovery. Improving the existing portfolio of open software may increase the fraction of shared data by decreasing the amount of effort required to publish them in a manner that is useful to others. However, a weakness of open software, when compared to commercial ones, is the lack of user-friendly graphical interface that may discourage inexperienced researchers. Here, a web-browser-oriented solution is presented and demonstrated for metabolic profiling analysis that combines the power of R for back-end statistical analyses and of JavaScript for front-end visualisations and user interactivity. This unique combination of statistical programming and web-browser visualisation brings enhanced data interoperability and interactivity into the open source realm. It is exemplified by characterizing the extent to which bariatric surgery perturbs the metabolisms of rats, showing the value of the approach in iterative analysis by the end-user to establish a deeper understanding of the system perturbation. HastaLaVista is available at: (https://github.com/jwist/hastaLaVista, https://doi.org/10.5281/zenodo.3544800) under MIT license. The approach described in this manuscript can be extended to connect the interface to other scripting languages such as Python, and to create interfaces for other types of data analysis.

    更新日期:2019-12-05
  • The chemfp project
    J. Cheminfom. (IF 4.154) Pub Date : 2019-12-05
    Andrew Dalke

    The chemfp project has had four main goals: (1) promote the FPS format as a text-based exchange format for dense binary cheminformatics fingerprints, (2) develop a high-performance implementation of the BitBound algorithm that could be used as an effective baseline to benchmark new similarity search implementations, (3) experiment with funding a pure open source software project through commercial sales, and (4) publish the results and lessons learned as a guide for future implementors. The FPS format has had only minor success, though it did influence development of the FPB binary format, which is faster to load but more complex. Both are summarized. The chemfp benchmark and the no-cost/open source version of chemfp are proposed as a reference baseline to evaluate the effectiveness of other similarity search tools. They are used to evaluate the faster commercial version of chemfp, which can test 130 million 1024-bit fingerprint Tanimotos per second on a single core of a standard x86-64 server machine. When combined with the BitBound algorithm, a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 970 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU-based similarity search implementations. Modern CPUs are fast enough that memory bandwidth and latency are now important factors. Single-threaded search uses most of the available memory bandwidth. Sorting the fingerprints by popcount improves memory coherency, which when combined with 4 OpenMP threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 min. These observations may affect the interpretation of previous publications which assumed that search was strongly CPU bound. The chemfp project funding came from selling a purely open-source software product. Several product business models were tried, but none proved sustainable. Some of the experiences are discussed, in order to contribute to the ongoing conversation on the role of open source software in cheminformatics.

    更新日期:2019-12-05
  • A de novo molecular generation method using latent vector based generative adversarial network
    J. Cheminfom. (IF 4.154) Pub Date : 2019-12-03
    Oleksii Prykhodko; Simon Viet Johansson; Panagiotis-Christos Kotsias; Josep Arús-Pous; Esben Jannik Bjerrum; Ola Engkvist; Hongming Chen

    Deep learning methods applied to drug discovery have been used to generate novel structures. In this study, we propose a new deep learning architecture, LatentGAN, which combines an autoencoder and a generative adversarial neural network for de novo molecular design. We applied the method in two scenarios: one to generate random drug-like compounds and another to generate target-biased compounds. Our results show that the method works well in both cases. Sampled compounds from the trained model can largely occupy the same chemical space as the training set and also generate a substantial fraction of novel compounds. Moreover, the drug-likeness score of compounds sampled from LatentGAN is also similar to that of the training set. Lastly, generated compounds differ from those obtained with a Recurrent Neural Network-based generative model approach, indicating that both methods can be used complementarily.

    更新日期:2019-12-04
  • ACID: a free tool for drug repurposing using consensus inverse docking strategy
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-27
    Fan Wang; Feng-Xu Wu; Cheng-Zhang Li; Chen-Yang Jia; Sun-Wen Su; Ge-Fei Hao; Guang-Fu Yang

    Drug repurposing offers a promising alternative to dramatically shorten the process of traditional de novo development of a drug. These efforts leverage the fact that a single molecule can act on multiple targets and could be beneficial to indications where the additional targets are relevant. Hence, extensive research efforts have been directed toward developing drug based computational approaches. However, many drug based approaches are known to incur low successful rates, due to incomplete modeling of drug-target interactions. There are also many technical limitations to transform theoretical computational models into practical use. Drug based approaches may, thus, still face challenges for drug repurposing task. Upon this challenge, we developed a consensus inverse docking (CID) workflow, which has a ~ 10% enhancement in success rate compared with current best method. Besides, an easily accessible web server named auto in silico consensus inverse docking (ACID) was designed based on this workflow (http://chemyang.ccnu.edu.cn/ccb/server/ACID).

    更新日期:2019-11-28
  • Efficient learning of non-autoregressive graph variational autoencoders for molecular graph generation
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-21
    Youngchun Kwon; Jiho Yoo; Youn-Suk Choi; Won-Joon Son; Dongseon Lee; Seokho Kang

    With the advancements in deep learning, deep generative models combined with graph neural networks have been successfully employed for data-driven molecular graph generation. Early methods based on the non-autoregressive approach have been effective in generating molecular graphs quickly and efficiently but have suffered from low performance. In this paper, we present an improved learning method involving a graph variational autoencoder for efficient molecular graph generation in a non-autoregressive manner. We introduce three additional learning objectives and incorporate them into the training of the model: approximate graph matching, reinforcement learning, and auxiliary property prediction. We demonstrate the effectiveness of the proposed method by evaluating it for molecular graph generation tasks using QM9 and ZINC datasets. The model generates molecular graphs with high chemical validity and diversity compared with existing non-autoregressive methods. It can also conditionally generate molecular graphs satisfying various target conditions.

    更新日期:2019-11-22
  • Randomized SMILES strings improve the quality of molecular generative models
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-21
    Josep Arús-Pous; Simon Viet Johansson; Oleksii Prykhodko; Esben Jannik Bjerrum; Christian Tyrchan; Jean-Louis Reymond; Hongming Chen; Ola Engkvist

    Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. Herein we perform an extensive benchmark on models trained with subsets of GDB-13 of different sizes (1 million, 10,000 and 1000), with different SMILES variants (canonical, randomized and DeepSMILES), with two different recurrent cell types (LSTM and GRU) and with different hyperparameter combinations. To guide the benchmarks new metrics were developed that define how well a model has generalized the training set. The generated chemical space is evaluated with respect to its uniformity, closedness and completeness. Results show that models that use LSTM cells trained with 1 million randomized SMILES, a non-unique molecular string representation, are able to generalize to larger chemical spaces than the other approaches and they represent more accurately the target chemical space. Specifically, a model was trained with randomized SMILES that was able to generate almost all molecules from GDB-13 with a quasi-uniform probability. Models trained with smaller samples show an even bigger improvement when trained with randomized SMILES models. Additionally, models were trained on molecules obtained from ChEMBL and illustrate again that training with randomized SMILES lead to models having a better representation of the drug-like chemical space. Namely, the model trained with randomized SMILES was able to generate at least double the amount of unique molecules with the same distribution of properties comparing to one trained with canonical SMILES.

    更新日期:2019-11-22
  • A comprehensive analysis of the history of DFT based on the bibliometric method RPYS
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-21
    Robin Haunschild; Andreas Barth; Bernie French

    This bibliometric study aims at providing a comprehensive analysis of the history of density functional theory (DFT) from a perspective of chemistry by using reference publication year spectroscopy (RPYS). 114,138 publications with their 4,412,152 non-distinct cited references are analyzed. The RPYS analysis revealed three different groups of seminal papers which researchers in DFT have drawn from: (i) some long-known experimental studies from the 19th century about physical and chemical phenomena were referenced rather frequently in contemporary DFT publications. (ii) Fundamental quantum-chemical papers from the time period 1900–1950 which predate DFT form another group of seminal papers. (iii) Finally, various very frequently employed DFT approximations, basis sets, and other techniques (e.g., implicit descriptions of solvents) constitute another group of seminal papers. The earliest cited reference we found was published in 1806. The references to papers published in the 19th century mainly served the purpose of referring to long-known physical and chemical phenomena which were used to test if DFT approximations deliver correct results (e.g., Van der Waals interactions). The foundational papers of DFT by Hohenberg and Kohn as well as Kohn and Sham do not seem to be affected by obliteration by incorporation as they appear as pronounced peaks in our RPYS analysis. Since the 1990s, only very few pronounced peaks occur as most years were referenced nearly equally often. Exceptions are 1993 and 1996 due to seminal papers by Axel Becke, John P. Perdew and co-workers, and Georg Kresse and co-workers.

    更新日期:2019-11-22
  • Multi-task learning with a natural metric for quantitative structure activity relationship learning
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-12
    Noureddin Sadawi; Ivan Olier; Joaquin Vanschoren; Jan N. van Rijn; Jeremy Besnard; Richard Bickerton; Crina Grosan; Larisa Soldatova; Ross D. King

    The goal of quantitative structure activity relationship (QSAR) learning is to learn a function that, given the structure of a small molecule (a potential drug), outputs the predicted activity of the compound. We employed multi-task learning (MTL) to exploit commonalities in drug targets and assays. We used datasets containing curated records about the activity of specific compounds on drug targets provided by ChEMBL. Totally, 1091 assays have been analysed. As a baseline, a single task learning approach that trains random forest to predict drug activity for each drug target individually was considered. We then carried out feature-based and instance-based MTL to predict drug activities. We introduced a natural metric of evolutionary distance between drug targets as a measure of tasks relatedness. Instance-based MTL significantly outperformed both, feature-based MTL and the base learner, on 741 drug targets out of 1091. Feature-based MTL won on 179 occasions and the base learner performed best on 171 drug targets. We conclude that MTL QSAR is improved by incorporating the evolutionary distance between targets. These results indicate that QSAR learning can be performed effectively, even if little data is available for specific drug targets, by leveraging what is known about similar drug targets.

    更新日期:2019-11-13
  • Dataset’s chemical diversity limits the generalizability of machine learning predictions
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-12
    Marta Glavatskikh; Jules Leguy; Gilles Hunault; Thomas Cauchy; Benoit Da Mota

    The QM9 dataset has become the golden standard for Machine Learning (ML) predictions of various chemical properties. QM9 is based on the GDB, which is a combinatorial exploration of the chemical space. ML molecular predictions have been recently published with an accuracy on par with Density Functional Theory calculations. Such ML models need to be tested and generalized on real data. PC9, a new QM9 equivalent dataset (only H, C, N, O and F and up to 9 “heavy” atoms) of the PubChemQC project is presented in this article. A statistical study of bonding distances and chemical functions shows that this new dataset encompasses more chemical diversity. Kernel Ridge Regression, Elastic Net and the Neural Network model provided by SchNet have been used on both datasets. The overall accuracy in energy prediction is higher for the QM9 subset. However, a model trained on PC9 shows a stronger ability to predict energies of the other dataset.

    更新日期:2019-11-13
  • Identifying new topoisomerase II poison scaffolds by combining publicly available toxicity data and 2D/3D-based virtual screening
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-09
    Anna Lovrics; Veronika F. S. Pape; Dániel Szisz; Adrián Kalászi; Petra Heffeter; Csaba Magyar; Gergely Szakács

    Molecular descriptor (2D) and three dimensional (3D) shape based similarity methods are widely used in ligand based virtual drug design. In the present study pairwise structure comparisons among a set of 4858 DTP compounds tested in the NCI60 tumor cell line anticancer drug screen were computed using chemical hashed fingerprints and 3D molecule shapes to calculate 2D and 3D similarities, respectively. Additionally, pairwise biological activity similarities were calculated by correlating the 60 element vectors of pGI50 values corresponding to the cytotoxicity of the compounds across the NCI60 panel. Subsequently, we compared the power of 2D and 3D structural similarity metrics to predict the toxicity pattern of compounds. We found that while the positive predictive value and sensitivity of 3D and molecular descriptor based approaches to predict biological activity are similar, a subset of molecule pairs yielded contradictory results. By simultaneously requiring similarity of biological activities and 3D shapes, and dissimilarity of molecular descriptor based comparisons, we identify pairs of scaffold hopping candidates displaying characteristic core structural changes such as heteroatom/heterocycle change and ring closure. Attempts to discover scaffold hopping candidates of mitoxantrone recovered known Topoisomerase II (Top2) inhibitors, and also predicted new, previously unknown chemotypes possessing in vitro Top2 inhibitory activity.

    更新日期:2019-11-11
  • A multiple classifier system identifies novel cannabinoid CB2 receptor ligands
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-07
    David Ruano-Ordás; Lindsey Burggraaff; Rongfang Liu; Cas van der Horst; Laura H. Heitman; Michael T. M. Emmerich; Jose R. Mendez; Iryna Yevseyeva; Gerard J. P. van Westen

    Drugs have become an essential part of our lives due to their ability to improve people’s health and quality of life. However, for many diseases, approved drugs are not yet available or existing drugs have undesirable side effects, making the pharmaceutical industry strive to discover new drugs and active compounds. The development of drugs is an expensive process, which typically starts with the detection of candidate molecules (screening) after a protein target has been identified. To this end, the use of high-performance screening techniques has become a critical issue in order to palliate the high costs. Therefore, the popularity of computer-based screening (often called virtual screening or in silico screening) has rapidly increased during the last decade. A wide variety of Machine Learning (ML) techniques has been used in conjunction with chemical structure and physicochemical properties for screening purposes including (i) simple classifiers, (ii) ensemble methods, and more recently (iii) Multiple Classifier Systems (MCS). Here, we apply an MCS for virtual screening (D2-MCS) using circular fingerprints. We applied our technique to a dataset of cannabinoid CB2 ligands obtained from the ChEMBL database. The HTS collection of Enamine (1,834,362 compounds), was virtually screened to identify 48,232 potential active molecules using D2-MCS. Identified molecules were ranked to select 21 promising novel compounds for in vitro evaluation. Experimental validation confirmed six highly active hits (> 50% displacement at 10 µM and subsequent Ki determination) and an additional five medium active hits (> 25% displacement at 10 µM). Hence, D2-MCS provided a hit rate of 29% for highly active compounds and an overall hit rate of 52%.

    更新日期:2019-11-07
  • Reply to “Missed opportunities in large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery”
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-06
    Nicolas Bosc; Francis Atkinson; Eloy Félix; Anna Gaulton; Anne Hersey; Andrew R. Leach

    In response to Krstajic’s letter to the editor concerning our published paper, we here take the opportunity to reply, to re-iterate that no errors in our work were identified, to provide further details, and to re-emphasise the outputs of our study. Moreover, we highlight that all of the data are freely available for the wider scientific community (including the aforementioned correspondent) to undertake follow-on studies and comparisons.

    更新日期:2019-11-06
  • Missed opportunities in large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery
    J. Cheminfom. (IF 4.154) Pub Date : 2019-11-06
    Damjan Krstajic

    Recently Bosc et al. (J Cheminform 11(1): 4, 2019), published an article describing a case study that directly compares conformal predictions with traditional QSAR methods for large-scale predictions of target-ligand binding. We consider this study to be very important. Unfortunately, we have found several issues in the authors’ approach as well as in the presentation of their findings.

    更新日期:2019-11-06
  • 更新日期:2019-11-01
  • 更新日期:2019-11-01
  • 更新日期:2019-11-01
  • 更新日期:2019-11-01
  • Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-01-12
    Nicolas Bosc,Francis Atkinson,Eloy Felix,Anna Gaulton,Anne Hersey,Andrew R Leach

    Structure-activity relationship modelling is frequently used in the early stage of drug discovery to assess the activity of a compound on one or several targets, and can also be used to assess the interaction of compounds with liability targets. QSAR models have been used for these and related applications over many years, with good success. Conformal prediction is a relatively new QSAR approach that provides information on the certainty of a prediction, and so helps in decision-making. However, it is not always clear how best to make use of this additional information. In this article, we describe a case study that directly compares conformal prediction with traditional QSAR methods for large-scale predictions of target-ligand binding. The ChEMBL database was used to extract a data set comprising data from 550 human protein targets with different bioactivity profiles. For each target, a QSAR model and a conformal predictor were trained and their results compared. The models were then evaluated on new data published since the original models were built to simulate a "real world" application. The comparative study highlights the similarities between the two techniques but also some differences that it is important to bear in mind when the methods are used in practical drug discovery applications.

    更新日期:2019-11-01
  • Consensus queries in ligand-based virtual screening experiments.
    J. Cheminfom. (IF 4.154) Pub Date : 2017-12-01
    Francois Berenger,Oanh Vu,Jens Meiler

    BACKGROUND In ligand-based virtual screening experiments, a known active ligand is used in similarity searches to find putative active compounds for the same protein target. When there are several known active molecules, screening using all of them is more powerful than screening using a single ligand. A consensus query can be created by either screening serially with different ligands before merging the obtained similarity scores, or by combining the molecular descriptors (i.e. chemical fingerprints) of those ligands. RESULTS We report on the discriminative power and speed of several consensus methods, on two datasets only made of experimentally verified molecules. The two datasets contain a total of 19 protein targets, 3776 known active and ~ 2 × 106 inactive molecules. Three chemical fingerprints are investigated: MACCS 166 bits, ECFP4 2048 bits and an unfolded version of MOLPRINT2D. Four different consensus policies and five consensus sizes were benchmarked. CONCLUSIONS The best consensus method is to rank candidate molecules using the maximum score obtained by each candidate molecule versus all known actives. When the number of actives used is small, the same screening performance can be approached by a consensus fingerprint. However, if the computational exploration of the chemical space is limited by speed (i.e. throughput), a consensus fingerprint allows to outperform this consensus of scores.

    更新日期:2019-11-01
  • chemmodlab: a cheminformatics modeling laboratory R package for fitting and assessing machine learning models.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-11-30
    Jeremy R Ash,Jacqueline M Hughes-Oliver

    The goal of chemmodlab is to streamline the fitting and assessment pipeline for many machine learning models in R, making it easy for researchers to compare the utility of these models. While focused on implementing methods for model fitting and assessment that have been accepted by experts in the cheminformatics field, all of the methods in chemmodlab have broad utility for the machine learning community. chemmodlab contains several assessment utilities, including a plotting function that constructs accumulation curves and a function that computes many performance measures. The most novel feature of chemmodlab is the ease with which statistically significant performance differences for many machine learning models is presented by means of the multiple comparisons similarity plot. Differences are assessed using repeated k-fold cross validation, where blocking increases precision and multiplicity adjustments are applied. chemmodlab is freely available on CRAN at https://cran.r-project.org/web/packages/chemmodlab/index.html .

    更新日期:2019-11-01
  • The nature of ligand efficiency.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-02-02
    Peter W Kenny

    Ligand efficiency is a widely used design parameter in drug discovery. It is calculated by scaling affinity by molecular size and has a nontrivial dependency on the concentration unit used to express affinity that stems from the inability of the logarithm function to take dimensioned arguments. Consequently, perception of efficiency varies with the choice of concentration unit and it is argued that the ligand efficiency metric is not physically meaningful nor should it be considered to be a metric. The dependence of ligand efficiency on the concentration unit can be eliminated by defining efficiency in terms of sensitivity of affinity to molecular size and this is illustrated with reference to fragment-to-lead optimizations. Group efficiency and fit quality are also examined in detail from a physicochemical perspective. The importance of examining relationships between affinity and molecular size directly is stressed throughout this study and an alternative to ligand efficiency for normalization of affinity with respect to molecular size is presented.

    更新日期:2019-11-01
  • Configurable web-services for biomedical document annotation.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-24
    Sérgio Matos

    The need to efficiently find and extract information from the continuously growing biomedical literature has led to the development of various annotation tools aimed at identifying mentions of entities and relations. Many of these tools have been integrated in user-friendly applications facilitating their use by non-expert text miners and database curators. In this paper we describe the latest version of Neji, a web-services ready text processing and annotation framework. The modular and flexible architecture facilitates adaptation to different annotation requirements, while the built-in web services allow its integration in external tools and text mining pipelines. The evaluation of the web annotation server on the technical interoperability and performance of annotation servers track of BioCreative V.5 further illustrates the flexibility and applicability of this framework.

    更新日期:2019-11-01
  • A probabilistic molecular fingerprint for big data settings.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-20
    Daniel Probst,Jean-Louis Reymond

    BACKGROUND Among the various molecular fingerprints available to describe small organic molecules, extended connectivity fingerprint, up to four bonds (ECFP4) performs best in benchmarking drug analog recovery studies as it encodes substructures with a high level of detail. Unfortunately, ECFP4 requires high dimensional representations (≥ 1024D) to perform well, resulting in ECFP4 nearest neighbor searches in very large databases such as GDB, PubChem or ZINC to perform very slowly due to the curse of dimensionality. RESULTS Herein we report a new fingerprint, called MinHash fingerprint, up to six bonds (MHFP6), which encodes detailed substructures using the extended connectivity principle of ECFP in a fundamentally different manner, increasing the performance of exact nearest neighbor searches in benchmarking studies and enabling the application of locality sensitive hashing (LSH) approximate nearest neighbor search algorithms. To describe a molecule, MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set. MHFP6 outperforms ECFP4 in benchmarking analog recovery studies. By leveraging locality sensitive hashing, LSH approximate nearest neighbor search methods perform as well on unfolded MHFP6 as comparable methods do on folded ECFP4 fingerprints in terms of speed and relative recovery rate, while operating in very sparse and high-dimensional binary chemical space. CONCLUSION MHFP6 is a new molecular fingerprint, encoding circular substructures, which outperforms ECFP4 for analog searches while allowing the direct application of locality sensitive hashing algorithms. It should be well suited for the analysis of large databases. The source code for MHFP6 is available on GitHub ( https://github.com/reymond-group/mhfp ).

    更新日期:2019-11-01
  • "We were here before the Web and hype…": a brief history of and tribute to the Computational Chemistry List.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-20
    Frédéric Wieber,Alejandro Pisanty,Alexandre Hocquet

    The Computational Chemistry List is a mailing list, portal, and community which brings together people interested in computational chemistry, mostly practitioners. It was formed in 1991 and continues to exist as a vibrant discussion space, highly valued by its members, and serving both its original and new functions. Its duration has been unusual for online communities. We analyze some of its characteristics, the reasons for its duration, value, and resilience, the ways it embodies and preceded the affordances of online communities recognized elsewhere long after its foundations, and project some aspects into the future. We also highlight its value as a corpus for historians of science.

    更新日期:2019-11-01
  • A neural network approach to chemical and gene/protein entity recognition in patents.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-20
    Ling Luo,Zhihao Yang,Pei Yang,Yin Zhang,Lei Wang,Jian Wang,Hongfei Lin

    In biomedical research, patents contain the significant amount of information, and biomedical text mining has received much attention in patents recently. To accelerate the development of biomedical text mining for patents, the BioCreative V.5 challenge organized three tracks, i.e., chemical entity mention recognition (CEMP), gene and protein related object recognition (GPRO) and technical interoperability and performance of annotation servers, to focus on biomedical entity recognition in patents. This paper describes our neural network approach for the CEMP and GPRO tracks. In the approach, a bidirectional long short-term memory with a conditional random field layer is employed to recognize biomedical entities from patents. To improve the performance, we explored the effect of additional features (i.e., part of speech, chunking and named entity recognition features generated by the GENIA tagger) for the neural network model. In the official results, our best runs achieve the highest performances (a precision of 88.32%, a recall of 92.62%, and an F-score of 90.42% in the CEMP track; a precision of 76.65%, a recall of 81.91%, and an F-score of 79.19% in the GPRO track) among all participating teams in both tracks.

    更新日期:2019-11-01
  • Statistical principle-based approach for gene and protein related object recognition.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-19
    Po-Ting Lai,Ming-Siang Huang,Ting-Hao Yang,Wen-Lian Hsu,Richard Tzong-Han Tsai

    The large number of chemical and pharmaceutical patents has attracted researchers doing biomedical text mining to extract valuable information such as chemicals, genes and gene products. To facilitate gene and gene product annotations in patents, BioCreative V.5 organized a gene- and protein-related object (GPRO) recognition task, in which participants were assigned to identify GPRO mentions and determine whether they could be linked to their unique biological database records. In this paper, we describe the system constructed for this task. Our system is based on two different NER approaches: the statistical-principle-based approach (SPBA) and conditional random fields (CRF). Therefore, we call our system SPBA-CRF. SPBA is an interpretable machine-learning framework for gene mention recognition. The predictions of SPBA are used as features for our CRF-based GPRO recognizer. The recognizer was developed for identifying chemical mentions in patents, and we adapted it for GPRO recognition. In the BioCreative V.5 GPRO recognition task, SPBA-CRF obtained an F-score of 73.73% on the evaluation metric of GPRO type 1 and an F-score of 78.66% on the evaluation metric of combining GPRO types 1 and 2. Our results show that SPBA trained on an external NER dataset can perform reasonably well on the partial match evaluation metric. Furthermore, SPBA can significantly improve performance of the CRF-based recognizer trained on the GPRO dataset.

    更新日期:2019-11-01
  • JPlogP: an improved logP predictor trained using predicted data.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-16
    Jeffrey Plante,Stephane Werner

    The partition coefficient between octanol and water (logP) has been an important descriptor in QSAR predictions for many years and therefore the prediction of logP has been examined countless times. One of the best performing models is to predict the logP using multiple methods and average the result. We have used those averaged predictions to develop a training-set which was able to distil the information present across the disparate logP methods into one single model. Our model was built using extendable atom-types, where each atom is distilled down into a 6 digit number, and each individual atom is assumed to have a small additive effect on the overall logP of the molecule. Beyond the simple coefficient model a consensus model is evaluated, which uses known compounds as a starting point in the calculation and modifies the experimental logP using the same coefficients as in the first model. We then test the performance of our models against two different datasets, one where many different models routinely perform well against, and another designed to more represent pharmaceutical space. The true strength of the model is represented in the pharmaceutical benchmark set, where both models perform better than any previously developed models.

    更新日期:2019-11-01
  • SIA: a scalable interoperable annotation server for biomedical named entities.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-16
    Johannes Kirschnick,Philippe Thomas,Roland Roller,Leonhard Hennig

    Recent years showed a strong increase in biomedical sciences and an inherent increase in publication volume. Extraction of specific information from these sources requires highly sophisticated text mining and information extraction tools. However, the integration of freely available tools into customized workflows is often cumbersome and difficult. We describe SIA (Scalable Interoperable Annotation Server), our contribution to the BeCalm-Technical interoperability and performance of annotation servers (BeCalm-TIPS) task, a scalable, extensible, and robust annotation service. The system currently covers six named entity types (i.e., chemicals, diseases, genes, miRNA, mutations, and organisms) and is freely available under Apache 2.0 license at https://github.com/Erechtheus/sia .

    更新日期:2019-11-01
  • Chaos-embedded particle swarm optimization approach for protein-ligand docking and virtual screening.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-16
    Hio Kuan Tai,Siti Azma Jusoh,Shirley W I Siu

    BACKGROUND Protein-ligand docking programs are routinely used in structure-based drug design to find the optimal binding pose of a ligand in the protein's active site. These programs are also used to identify potential drug candidates by ranking large sets of compounds. As more accurate and efficient docking programs are always desirable, constant efforts focus on developing better docking algorithms or improving the scoring function. Recently, chaotic maps have emerged as a promising approach to improve the search behavior of optimization algorithms in terms of search diversity and convergence speed. However, their effectiveness on docking applications has not been explored. Herein, we integrated five popular chaotic maps-logistic, Singer, sinusoidal, tent, and Zaslavskii maps-into PSOVina[Formula: see text], a recent variant of the popular AutoDock Vina program with enhanced global and local search capabilities, and evaluated their performances in ligand pose prediction and virtual screening using four docking benchmark datasets and two virtual screening datasets. RESULTS Pose prediction experiments indicate that chaos-embedded algorithms outperform AutoDock Vina and PSOVina in ligand pose RMSD, success rate, and run time. In virtual screening experiments, Singer map-embedded PSOVina[Formula: see text] achieved a very significant five- to sixfold speedup with comparable screening performances to AutoDock Vina in terms of area under the receiver operating characteristic curve and enrichment factor. Therefore, our results suggest that chaos-embedded PSOVina methods might be a better option than AutoDock Vina for docking and virtual screening tasks. The success of chaotic maps in protein-ligand docking reveals their potential for improving optimization algorithms in other search problems, such as protein structure prediction and folding. The Singer map-embedded PSOVina[Formula: see text] which is named PSOVina-2.0 and all testing datasets are publicly available on https://cbbio.cis.umac.mo/software/psovina .

    更新日期:2019-11-01
  • Chemlistem: chemical named entity recognition using recurrent neural networks.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-14
    Peter Corbett,John Boyle

    Chemical named entity recognition (NER) has traditionally been dominated by conditional random fields (CRF)-based approaches but given the success of the artificial neural network techniques known as "deep learning" we decided to examine them as an alternative to CRFs. We present here several chemical named entity recognition systems. The first system translates the traditional CRF-based idioms into a deep learning framework, using rich per-token features and neural word embeddings, and producing a sequence of tags using bidirectional long short term memory (LSTM) networks-a type of recurrent neural net. The second system eschews the rich feature set-and even tokenisation-in favour of character labelling using neural character embeddings and multiple LSTM layers. The third system is an ensemble that combines the results of the first two systems. Our original BioCreative V.5 competition entry was placed in the top group with the highest F scores, and subsequent using transfer learning have achieved a final F score of 90.33% on the test data (precision 91.47%, recall 89.21%).

    更新日期:2019-11-01
  • MER: a shell script and annotation server for minimal named entity recognition and linking.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-12-07
    Francisco M Couto,Andre Lamurias

    Named-entity recognition aims at identifying the fragments of text that mention entities of interest, that afterwards could be linked to a knowledge base where those entities are described. This manuscript presents our minimal named-entity recognition and linking tool (MER), designed with flexibility, autonomy and efficiency in mind. To annotate a given text, MER only requires: (1) a lexicon (text file) with the list of terms representing the entities of interest; (2) optionally a tab-separated values file with a link for each term; (3) and a Unix shell. Alternatively, the user can provide an ontology from where MER will automatically generate the lexicon and links files. The efficiency of MER derives from exploring the high performance and reliability of the text processing command-line tools grep and awk, and a novel inverted recognition technique. MER was deployed in a cloud infrastructure using multiple Virtual Machines to work as an annotation server and participate in the Technical Interoperability and Performance of annotation Servers task of BioCreative V.5. The results show that our solution processed each document (text retrieval and annotation) in less than 3 s on average without using any type of cache. MER was also compared to a state-of-the-art dictionary lookup solution obtaining competitive results not only in computational performance but also in precision and recall. MER is publicly available in a GitHub repository ( https://github.com/lasigeBioTM/MER ) and through a RESTful Web service ( http://labs.fc.ul.pt/mer/ ).

    更新日期:2019-11-01
  • Statistical-based database fingerprint: chemical space dependent representation of compound databases.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-11-24
    Norberto Sánchez-Cruz,José L Medina-Franco

    BACKGROUND Simplified representation of compound databases has several applications in cheminformatics. Herein, we introduce an alternative and general method to build single fingerprint representations of compound databases. The approach is inspired on the previously published modal fingerprints that are aimed to capture the most significant bits of a fingerprint representation for a compound data set. The novelty of the herein proposed statistical-based database fingerprint (SB-DFP) is that it is generated based on binomial proportions comparisons taking as reference the distribution of "1" bits on a large representative set of the chemical space. RESULTS To illustrate the Method, SB-DFPs were constructed for 28 epigenetic target data sets retrieved from a recently published epigenomics database of interest in probe and drug discovery. For each target data set, the SB-DFPs were built based on two representative fingerprints of different design using as reference a data set with more than 15 million compounds from ZINC. The application of SB-DFP was illustrated and compared to other methods through association relationships of the 28 epigenetic data sets and similarity searching. It was found that SB-DFPs captured overall, the common features between data sets and the distinct features of each set. In similarity searching SB-DFP equaled or outperformed other approaches for at least 20 out of the 28 sets. CONCLUSIONS SB-DFP is a general approach based on binomial proportion comparisons to represent a compound data set with a single fingerprint. SB-DFP can be developed, at least in principle, based on any fingerprint and reference data set. SB-DFP is a good alternative for exploration of relationships between targets through its associated compound data sets and performing similarity searching.

    更新日期:2019-11-01
  • Implicit-descriptor ligand-based virtual screening by means of collaborative filtering.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-11-24
    Raghuram Srinivas,Pavel V Klimovich,Eric C Larson

    Current ligand-based machine learning methods in virtual screening rely heavily on molecular fingerprinting for preprocessing, i.e., explicit description of ligands' structural and physicochemical properties in a vectorized form. Of particular importance to current methods are the extent to which molecular fingerprints describe a particular ligand and what metric sufficiently captures similarity among ligands. In this work, we propose and evaluate methods that do not require explicit feature vectorization through fingerprinting, but, instead, provide implicit descriptors based only on other known assays. Our methods are based upon well known collaborative filtering algorithms used in recommendation systems. Our implicit descriptor method does not require any fingerprint similarity search, which makes the method free of the bias arising from the empirical nature of the fingerprint models. We show that implicit methods significantly outperform traditional machine learning methods, and the main strengths of implicit methods are their resilience to target-ligand sparsity and high potential for spotting promiscuous ligands.

    更新日期:2019-11-01
  • Improved understanding of aqueous solubility modeling through topological data analysis.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-11-22
    Mariam Pirashvili,Lee Steinberg,Francisco Belchi Guillamon,Mahesan Niranjan,Jeremy G Frey,Jacek Brodzki

    Topological data analysis is a family of recent mathematical techniques seeking to understand the 'shape' of data, and has been used to understand the structure of the descriptor space produced from a standard chemical informatics software from the point of view of solubility. We have used the mapper algorithm, a TDA method that creates low-dimensional representations of data, to create a network visualization of the solubility space. While descriptors with clear chemical implications are prominent features in this space, reflecting their importance to the chemical properties, an unexpected and interesting correlation between chlorine content and rings and their implication for solubility prediction is revealed. A parallel representation of the chemical space was generated using persistent homology applied to molecular graphs. Links between this chemical space and the descriptor space were shown to be in agreement with chemical heuristics. The use of persistent homology on molecular graphs, extended by the use of norms on the associated persistence landscapes allow the conversion of discrete shape descriptors to continuous ones, and a perspective of the application of these descriptors to quantitative structure property relations is presented.

    更新日期:2019-11-01
  • Cheminformatics-based enumeration and analysis of large libraries of macrolide scaffolds.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-11-14
    Phyo Phyo Kyaw Zin,Gavin Williams,Denis Fourches

    We report on the development of a cheminformatics enumeration technology and the analysis of a resulting large dataset of virtual macrolide scaffolds. Although macrolides have been shown to have valuable biological properties, there is no ready-to-screen virtual library of diverse macrolides in the public domain. Conducting molecular modeling (especially virtual screening) of these complex molecules is highly relevant as the organic synthesis of these compounds, when feasible, typically requires many synthetic steps, and thus dramatically slows the discovery of new bioactive macrolides. Herein, we introduce a cheminformatics approach and associated software that allows for designing and generating libraries of virtual macrocycle/macrolide scaffolds with user-defined constitutional and structural constraints (e.g., types and numbers of structural motifs to be included in the macrocycle, ring size, maximum number of compounds generated). To study the chemical diversity of such generated molecules, we enumerated V1M (Virtual 1 million Macrolide scaffolds) library, each containing twelve common structural motifs. For each macrolide scaffold, we calculated several key properties, such as molecular weight, hydrogen bond donors/acceptors, topological polar surface area. In this study, we discuss (1) the initial concept and current features of our PKS (polyketides) Enumerator software, (2) the chemical diversity and distribution of structural motifs in V1M library, and (3) the unique opportunities for future virtual screening of such enumerated ensembles of macrolides. Importantly, V1M is provided in the Supplementary Material of this paper allowing other researchers to conduct any type of molecular modeling and virtual screening studies. Therefore, this technology for enumerating extremely large libraries of macrolide scaffolds could hold a unique potential in the field of computational chemistry and drug discovery for rational designing of new antibiotics and anti-cancer agents.

    更新日期:2019-11-01
  • An automated framework for NMR chemical shift calculations of small organic molecules.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-10-28
    Yasemin Yesiltepe,Jamie R Nuñez,Sean M Colby,Dennis G Thomas,Mark I Borkum,Patrick N Reardon,Nancy M Washton,Thomas O Metz,Justin G Teeguarden,Niranjan Govind,Ryan S Renslow

    When using nuclear magnetic resonance (NMR) to assist in chemical identification in complex samples, researchers commonly rely on databases for chemical shift spectra. However, authentic standards are typically depended upon to build libraries experimentally. Considering complex biological samples, such as blood and soil, the entirety of NMR spectra required for all possible compounds would be infeasible to ascertain due to limitations of available standards and experimental processing time. As an alternative, we introduce the in silico Chemical Library Engine (ISiCLE) NMR chemical shift module to accurately and automatically calculate NMR chemical shifts of small organic molecules through use of quantum chemical calculations. ISiCLE performs density functional theory (DFT)-based calculations for predicting chemical properties-specifically NMR chemical shifts in this manuscript-via the open source, high-performance computational chemistry software, NWChem. ISiCLE calculates the NMR chemical shifts of sets of molecules using any available combination of DFT method, solvent, and NMR-active nuclei, using both user-selected reference compounds and/or linear regression methods. Calculated NMR chemical shifts are provided to the user for each molecule, along with comparisons with respect to a number of metrics commonly used in the literature. Here, we demonstrate ISiCLE using a set of 312 molecules, ranging in size up to 90 carbon atoms. For each, calculation of NMR chemical shifts have been performed with 8 different levels of DFT theory, and with solvation effects using the implicit solvent Conductor-like Screening Model. The DFT method dependence of the calculated chemical shifts have been systematically investigated through benchmarking and subsequently compared to experimental data available in the literature. Furthermore, ISiCLE has been applied to a set of 80 methylcyclohexane conformers, combined via Boltzmann weighting and compared to experimental values. We demonstrate that our protocol shows promise in the automation of chemical shift calculations and, ultimately, the expansion of chemical shift libraries.

    更新日期:2019-11-01
  • Choquet integral-based fuzzy molecular characterizations: when global definitions are computed from the dependency among atom/bond contributions (LOVIs/LOEIs).
    J. Cheminfom. (IF 4.154) Pub Date : 2018-10-27
    César R García-Jacas,Lisset Cabrera-Leyva,Yovani Marrero-Ponce,José Suárez-Lezcano,Fernando Cortés-Guzmán,Mario Pupo-Meriño,Ricardo Vivas-Reyes

    BACKGROUND Several topological (2D) and geometric (3D) molecular descriptors (MDs) are calculated from local vertex/edge invariants (LOVIs/LOEIs) by performing an aggregation process. To this end, norm-, mean- and statistic-based (non-fuzzy) operators are used, under the assumption that LOVIs/LOEIs are independent (orthogonal) values of one another. These operators are based on additive and/or linear measures and, consequently, they cannot be used to encode information from interrelated criteria. Thus, as LOVIs/LOEIs are not orthogonal values, then non-additive (fuzzy) measures can be used to encode the interrelation among them. RESULTS General approaches to compute fuzzy 2D/3D-MDs from the contribution of each atom (LOVIs) or covalent bond (LOEIs) within a molecule are proposed, by using the Choquet integral as fuzzy aggregation operator. The Choquet integral-based operator is rather different from the other operators often used for the 2D/3D-MDs calculation. It performs a reordering step to fuse the LOVIs/LOEIs according to their magnitudes and, in addition, it considers the interrelation among them through a fuzzy measure. With this operator, fuzzy definitions can be derived from traditional or recent MDs; for instance, fuzzy Randic-like connectivity indices, fuzzy Balaban-like indices, fuzzy Kier-Hall connectivity indices, among others. To demonstrate the feasibility of using this operator, the QuBiLS-MIDAS 3D-MDs were used as study case and, as a result, a module was built into the corresponding software to compute them ( http://tomocomd.com/qubils-midas ). Thus, it is the only software reported in the literature that can be employed to determine Choquet integral-based fuzzy MDs. Moreover, regression models were created on eight chemical datasets. In this way, a comparison between the results achieved by the models based on the non-fuzzy QuBiLS-MIDAS 3D-MDs with regard to the ones achieved by the models based on the fuzzy QuBiLS-MIDAS 3D-MDs was made. As a result, the models built with the fuzzy QuBiLS-MIDAS 3D-MDs achieved the best performance, which was statistically corroborated through the Wilcoxon signed-rank test. CONCLUSIONS All in all, it can be concluded that the Choquet integral constitutes a prominent alternative to compute fuzzy 2D/3D-MDs from LOVIs/LOEIs. In this way, better characterizations of the compounds can be obtained, which will be ultimately useful in enhancing the modelling ability of existing traditional 2D/3D-MDs.

    更新日期:2019-11-01
  • A new chemoinformatics approach with improved strategies for effective predictions of potential drugs.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-10-13
    Ming Hao,Stephen H Bryant,Yanli Wang

    BACKGROUND Fast and accurate identification of potential drug candidates against therapeutic targets (i.e., drug-target interactions, DTIs) is a fundamental step in the early drug discovery process. However, experimental determination of DTIs is time-consuming and costly, especially for testing the associations between the entire chemical and genomic spaces. Therefore, computationally efficient algorithms with accurate predictions are required to achieve such a challenging task. In this work, we design a new chemoinformatics approach derived from neighbor-based collaborative filtering (NBCF) to infer potential drug candidates for targets of interest. One of the fundamental steps of NBCF in the application of DTI predictions is to accurately measure the similarity between drugs solely based on the DTI profiles of known knowledge. However, commonly used similarity calculation methods such as COSINE may be noise-prone due to the extremely sparse property of the DTI bipartite network, which decreases the model performance of NBCF. We herein propose three strategies to remedy such a dilemma, which include: (1) adopting a positive pointwise mutual information (PPMI)-based similarity metric, which is noise-immune to some extent; (2) performing low-rank approximation of the original prediction scores; (3) incorporating auxiliary (complementary) information to produce the final predictions. RESULTS We test the proposed methods in three benchmark datasets and the results indicate that our strategies are helpful to improve the NBCF performance for DTI predictions. Comparing to the prior algorithm, our methods exhibit better results assessed by a recall-based evaluation metric. CONCLUSIONS A new chemoinformatics approach with improved strategies was successfully developed to predict potential DTIs. Among them, the model based on the sparsity resistant PPMI similarity metric exhibits the best performance, which may be helpful to researchers for identifying potential drugs against therapeutic targets of interest, and can also be applied to related research such as identifying candidate disease genes.

    更新日期:2019-11-01
  • Evaluating parameters for ligand-based modeling with random forest on sparse data sets.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-10-12
    Alexander Kensert,Jonathan Alvarsson,Ulf Norinder,Ola Spjuth

    Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints ([Formula: see text]), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint's radius.

    更新日期:2019-11-01
  • Life beyond the Tanimoto coefficient: similarity measures for interaction fingerprints.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-10-06
    Anita Rácz,Dávid Bajusz,Károly Héberger

    BACKGROUND Interaction fingerprints (IFP) have been repeatedly shown to be valuable tools in virtual screening to identify novel hit compounds that can subsequently be optimized to drug candidates. As a complementary method to ligand docking, IFPs can be applied to quantify the similarity of predicted binding poses to a reference binding pose. For this purpose, a large number of similarity metrics can be applied, and various parameters of the IFPs themselves can be customized. In a large-scale comparison, we have assessed the effect of similarity metrics and IFP configurations to a number of virtual screening scenarios with ten different protein targets and thousands of molecules. Particularly, the effect of considering general interaction definitions (such as Any Contact, Backbone Interaction and Sidechain Interaction), the effect of filtering methods and the different groups of similarity metrics were studied. RESULTS The performances were primarily compared based on AUC values, but we have also used the original similarity data for the comparison of similarity metrics with several statistical tests and the novel, robust sum of ranking differences (SRD) algorithm. With SRD, we can evaluate the consistency (or concordance) of the various similarity metrics to an ideal reference metric, which is provided by data fusion from the existing metrics. Different aspects of IFP configurations and similarity metrics were examined based on SRD values with analysis of variance (ANOVA) tests. CONCLUSION A general approach is provided that can be applied for the reliable interpretation and usage of similarity measures with interaction fingerprints. Metrics that are viable alternatives to the commonly used Tanimoto coefficient were identified based on a comparison with an ideal reference metric (consensus). A careful selection of the applied bits (interaction definitions) and IFP filtering rules can improve the results of virtual screening (in terms of their agreement with the consensus metric). The open-source Python package FPKit was introduced for the similarity calculations and IFP filtering; it is available at: https://github.com/davidbajusz/fpkit .

    更新日期:2019-11-01
  • Exploring non-linear distance metrics in the structure-activity space: QSAR models for human estrogen receptor.
    J. Cheminfom. (IF 4.154) Pub Date : 2018-09-20
    Ilya A Balabin,Richard S Judson

    BACKGROUND Quantitative structure-activity relationship (QSAR) models are important tools used in discovering new drug candidates and identifying potentially harmful environmental chemicals. These models often face two fundamental challenges: limited amount of available biological activity data and noise or uncertainty in the activity data themselves. To address these challenges, we introduce and explore a QSAR model based on custom distance metrics in the structure-activity space. METHODS The model is built on top of the k-nearest neighbor model, incorporating non-linearity not only in the chemical structure space, but also in the biological activity space. The model is tuned and evaluated using activity data for human estrogen receptor from the US EPA ToxCast and Tox21 databases. RESULTS The model closely trails the CERAPP consensus model (built on top of 48 individual human estrogen receptor activity models) in agonist activity predictions and consistently outperforms the CERAPP consensus model in antagonist activity predictions. DISCUSSION We suggest that incorporating non-linear distance metrics may significantly improve QSAR model performance when the available biological activity data are limited.

    更新日期:2019-11-01
  • rBAN: retro-biosynthetic analysis of nonribosomal peptides.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-02-10
    Emma Ricart,Valérie Leclère,Areski Flissi,Markus Mueller,Maude Pupin,Frédérique Lisacek

    Proteinogenic and non-proteinogenic amino acids, fatty acids or glycans are some of the main building blocks of nonribsosomal peptides (NRPs) and as such may give insight into the origin, biosynthesis and bioactivities of their constitutive peptides. Hence, the structural representation of NRPs using monomers provides a biologically interesting skeleton of these secondary metabolites. Databases dedicated to NRPs such as Norine, already integrate monomer-based annotations in order to facilitate the development of structural analysis tools. In this paper, we present rBAN (retro-biosynthetic analysis of nonribosomal peptides), a new computational tool designed to predict the monomeric graph of NRPs from their atomic structure in SMILES format. This prediction is achieved through the "in silico" fragmentation of a chemical structure and matching the resulting fragments against the monomers of Norine for identification. Structures containing monomers not yet recorded in Norine, are processed in a "discovery mode" that uses the RESTful service from PubChem to search the unidentified substructures and suggest new monomers. rBAN was integrated in a pipeline for the curation of Norine data in which it was used to check the correspondence between the monomeric graphs annotated in Norine and SMILES-predicted graphs. The process concluded with the validation of the 97.26% of the records in Norine, a two-fold extension of its SMILES data and the introduction of 11 new monomers suggested in the discovery mode. The accuracy, robustness and high-performance of rBAN were demonstrated in benchmarking it against other tools with the same functionality: Smiles2Monomers and GRAPE.

    更新日期:2019-11-01
  • Programming languages in chemistry: a review of HTML5/JavaScript.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-02-06
    Kevin J Theisen

    This is one part of a series of reviews concerning the application of programming languages in chemistry, edited by Dr. Rajarshi Guha. This article reviews the JavaScript technology as it applies to the chemistry discipline. A discussion of the history, scope and technical details of the programming language is presented.

    更新日期:2019-11-01
  • Implementing cheminformatics.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-02-06
    Rajarshi Guha

    更新日期:2019-11-01
  • Chemoinformatics and structural bioinformatics in OCaml.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-02-06
    Francois Berenger,Kam Y J Zhang,Yoshihiro Yamanishi

    BACKGROUND OCaml is a functional programming language with strong static types, Hindley-Milner type inference and garbage collection. In this article, we share our experience in prototyping chemoinformatics and structural bioinformatics software in OCaml. RESULTS First, we introduce the language, list entry points for chemoinformaticians who would be interested in OCaml and give code examples. Then, we list some scientific open source software written in OCaml. We also present recent open source libraries useful in chemoinformatics. The parallelization of OCaml programs and their performance is also shown. Finally, tools and methods useful when prototyping scientific software in OCaml are given. CONCLUSIONS In our experience, OCaml is a programming language of choice for method development in chemoinformatics and structural bioinformatics.

    更新日期:2019-11-01
  • Avoiding hERG-liability in drug design via synergetic combinations of different (Q)SAR methodologies and data sources: a case study in an industrial setting.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-02-04
    Thierry Hanser,Fabian P Steinmetz,Jeffrey Plante,Friedrich Rippmann,Mireille Krier

    In this paper, we explore the impact of combining different in silico prediction approaches and data sources on the predictive performance of the resulting system. We use inhibition of the hERG ion channel target as the endpoint for this study as it constitutes a key safety concern in drug development and a potential cause of attrition. We will show that combining data sources can improve the relevance of the training set in regard of the target chemical space, leading to improved performance. Similarly we will demonstrate that combining multiple statistical models together, and with expert systems, can lead to positive synergistic effects when taking into account the confidence in the predictions of the merged systems. The best combinations analyzed display a good hERG predictivity. Finally, this work demonstrates the suitability of the SOHN methodology for building models in the context of receptor based endpoints like hERG inhibition when using the appropriate pharmacophoric descriptors.

    更新日期:2019-11-01
  • OGER++: hybrid multi-type entity recognition.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-01-23
    Lenz Furrer,Anna Jancso,Nicola Colic,Fabio Rinaldi

    BACKGROUND We present a text-mining tool for recognizing biomedical entities in scientific literature. OGER++ is a hybrid system for named entity recognition and concept recognition (linking), which combines a dictionary-based annotator with a corpus-based disambiguation component. The annotator uses an efficient look-up strategy combined with a normalization method for matching spelling variants. The disambiguation classifier is implemented as a feed-forward neural network which acts as a postfilter to the previous step. RESULTS We evaluated the system in terms of processing speed and annotation quality. In the speed benchmarks, the OGER++ web service processes 9.7 abstracts or 0.9 full-text documents per second. On the CRAFT corpus, we achieved 71.4% and 56.7% F1 for named entity recognition and concept recognition, respectively. CONCLUSIONS Combining knowledge-based and data-driven components allows creating a system with competitive performance in biomedical text mining.

    更新日期:2019-11-01
  • Universal nanohydrophobicity predictions using virtual nanoparticle library.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-01-20
    Wenyi Wang,Xiliang Yan,Linlin Zhao,Daniel P Russo,Shenqing Wang,Yin Liu,Alexander Sedykh,Xiaoli Zhao,Bing Yan,Hao Zhu

    To facilitate the development of new nanomaterials, especially nanomedicines, a novel computational approach was developed to precisely predict the hydrophobicity of gold nanoparticles (GNPs). The core of this study was to develop a large virtual gold nanoparticle (vGNP) library with computational nanostructure simulations. Based on the vGNP library, a nanohydrophobicity model was developed and then validated against externally synthesized and tested GNPs. This approach and resulted model is an efficient and effective universal tool to visualize and predict critical physicochemical properties of new nanomaterials before synthesis, guiding nanomaterial design.

    更新日期:2019-11-01
  • QBMG: quasi-biogenic molecule generator with deep recurrent neural network.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-01-19
    Shuangjia Zheng,Xin Yan,Qiong Gu,Yuedong Yang,Yunfei Du,Yutong Lu,Jun Xu

    Biogenic compounds are important materials for drug discovery and chemical biology. In this work, we report a quasi-biogenic molecule generator (QBMG) to compose virtual quasi-biogenic compound libraries by means of gated recurrent unit recurrent neural networks. The library includes stereo-chemical properties, which are crucial features of natural products. QMBG can reproduce the property distribution of the underlying training set, while being able to generate realistic, novel molecules outside of the training set. Furthermore, these compounds are associated with known bioactivities. A focused compound library based on a given chemotype/scaffold can also be generated by this approach combining transfer learning technology. This approach can be used to generate virtual compound libraries for pharmaceutical lead identification and optimization.

    更新日期:2019-11-01
  • LSTMVoter: chemical named entity recognition using a conglomerate of sequence labeling tools.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-01-12
    Wahed Hemati,Alexander Mehler

    BACKGROUND Chemical and biomedical named entity recognition (NER) is an essential preprocessing task in natural language processing. The identification and extraction of named entities from scientific articles is also attracting increasing interest in many scientific disciplines. Locating chemical named entities in the literature is an essential step in chemical text mining pipelines for identifying chemical mentions, their properties, and relations as discussed in the literature. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of chemical named entities. For this purpose, we transform the task of NER into a sequence labeling problem. We present a series of sequence labeling systems that we used, adapted and optimized in our experiments for solving this task. To this end, we experiment with hyperparameter optimization. Finally, we present LSTMVoter, a two-stage application of recurrent neural networks that integrates the optimized sequence labelers from our study into a single ensemble classifier. RESULTS We introduce LSTMVoter, a bidirectional long short-term memory (LSTM) tagger that utilizes a conditional random field layer in conjunction with attention-based feature modeling. Our approach explores information about features that is modeled by means of an attention mechanism. LSTMVoter outperforms each extractor integrated by it in a series of experiments. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieves an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieves an F1-score of 89.01%. AVAILABILITY AND IMPLEMENTATION Data and code are available at https://github.com/texttechnologylab/LSTMVoter .

    更新日期:2019-11-01
  • BioTransformer: a comprehensive computational tool for small molecule metabolism prediction and metabolite identification.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-01-07
    Yannick Djoumbou-Feunang,Jarlei Fiamoncini,Alberto Gil-de-la-Fuente,Russell Greiner,Claudine Manach,David S Wishart

    BACKGROUND A number of computational tools for metabolism prediction have been developed over the last 20 years to predict the structures of small molecules undergoing biological transformation or environmental degradation. These tools were largely developed to facilitate absorption, distribution, metabolism, excretion, and toxicity (ADMET) studies, although there is now a growing interest in using such tools to facilitate metabolomics and exposomics studies. However, their use and widespread adoption is still hampered by several factors, including their limited scope, breath of coverage, availability, and performance. RESULTS To address these limitations, we have developed BioTransformer, a freely available software package for accurate, rapid, and comprehensive in silico metabolism prediction and compound identification. BioTransformer combines a machine learning approach with a knowledge-based approach to predict small molecule metabolism in human tissues (e.g. liver tissue), the human gut as well as the environment (soil and water microbiota), via its metabolism prediction tool. A comprehensive evaluation of BioTransformer showed that it was able to outperform two state-of-the-art commercially available tools (Meteor Nexus and ADMET Predictor), with precision and recall values up to 7 times better than those obtained for Meteor Nexus or ADMET Predictor on the same sets of pharmaceuticals, pesticides, phytochemicals or endobiotics under similar or identical constraints. Furthermore BioTransformer was able to reproduce 100% of the transformations and metabolites predicted by the EAWAG pathway prediction system. Using mass spectrometry data obtained from a rat experimental study with epicatechin supplementation, BioTransformer was also able to correctly identify 39 previously reported epicatechin metabolites via its metabolism identification tool, and suggest 28 potential metabolites, 17 of which matched nine monoisotopic masses for which no evidence of a previous report could be found. CONCLUSION BioTransformer can be used as an open access command-line tool, or a software library. It is freely available at https://bitbucket.org/djoumbou/biotransformerjar/ . Moreover, it is also freely available as an open access RESTful application at www.biotransformer.ca , which allows users to manually or programmatically submit queries, and retrieve metabolism predictions or compound identification data.

    更新日期:2019-11-01
  • A retrosynthetic analysis algorithm implementation.
    J. Cheminfom. (IF 4.154) Pub Date : 2019-01-04
    Ian A Watson,Jibo Wang,Christos A Nicolaou

    The need for synthetic route design arises frequently in discovery-oriented chemistry organizations. While traditionally finding solutions to this problem has been the domain of human experts, several computational approaches, aided by the algorithmic advances and the availability of large reaction collections, have recently been reported. Herein we present our own implementation of a retrosynthetic analysis method and demonstrate its capabilities in an attempt to identify synthetic routes for a collection of approved drugs. Our results indicate that the method, leveraging on reaction transformation rules learned from a large patent reaction dataset, can identify multiple theoretically feasible synthetic routes and, thus, support research chemist everyday efforts.

    更新日期:2019-11-01
  • Erratum to: ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics.
    J. Cheminfom. (IF 4.154) Pub Date : 2017-11-01
    Jiangming Sun,Nina Jeliazkova,Vladimir Chupakhin,Jose-Felipe Golib-Dzib,Ola Engkvist,Lars Carlsson,Jörg Wegner,Hugo Ceulemans,Ivan Georgiev,Vedrin Jeliazkov,Nikolay Kochev,Thomas J Ashby,Hongming Chen

    更新日期:2019-11-01
  • Erratum to: The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching.
    J. Cheminfom. (IF 4.154) Pub Date : 2017-11-01
    Egon L Willighagen,John W Mayfield,Jonathan Alvarsson,Arvid Berg,Lars Carlsson,Nina Jeliazkova,Stefan Kuhn,Tomáš Pluskal,Miquel Rojas-Chertó,Ola Spjuth,Gilleain Torrance,Chris T Evelo,Rajarshi Guha,Christoph Steinbeck

    更新日期:2019-11-01
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
2020新春特辑
限时免费阅读临床医学内容
ACS材料视界
科学报告最新纳米科学与技术研究
清华大学化学系段昊泓
自然科研论文编辑服务
中国科学院大学楚甲祥
中国科学院微生物研究所潘国辉
中国科学院化学研究所
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug