Elsevier

World Patent Information

Volume 57, June 2019, Pages 59-69
World Patent Information

Automatic generation of Markush structures from specific compounds

https://doi.org/10.1016/j.wpi.2019.03.006Get rights and content

Abstract

Markush structures play an important role in cheminformatics, especially in chemical patents. This paper presents a novel algorithm for automatically generating Markush structures from series of specific compounds. This method can effectively be used to assist patent drafting or to compose combinatorial libraries based on several molecules of interest. According to the authors’ knowledge, the presented algorithm is the first solution to this problem. It is available in multiple software products of ChemAxon.

Introduction

Markush structures are generic chemical descriptions used for specifying collections of related compounds. They are widely applied in chemical patents and other chemistry texts as well as for describing large combinatorial libraries [[1], [2], [3], [4], [5], [6], [7]]. Markush structures are named after Eugene A. Markush, who was awarded a patent for pyrazolone dyes in 1924 by the U.S. Patent Office (now the U.S. Patent and Trademark Office, USPTO) [8]. He claimed generic chemical structures in this patent, which set a precedent and the technique became a common practice [1,5,6].

A typical patent Markush structure represents a large or even unlimited number of substances under a single disclosure. The invariable part of the structure, called the scaffold, embodies the common structural features of the specific molecules. For describing the differences, the following kinds of variations are commonly used in patent claims [1,[9], [10], [11], [12]].

  • Substituent variation: a marked point of the structure can be substituted with a substructure selected from a specified set. It is typically represented by so-called R-groups, which are often denoted as R1, R2, etc., or with letters like X or Y. For example, “R1 is selected from H, CH3, and OH”.

  • Position variation: a substituent can be connected to various alternative positions of the structure, e.g., “monochlorophenyl”.

  • Frequency variation: a specific part of the structure can be repeated to form a chain or a part of a ring, e.g., “(CH2)n, where n is 1–3”.

  • Homology variation: a general nomenclatural term that represents a large or theoretically infinite number of substructures with similar structural features (i.e., they are homologous to each other), for example, “alkyl” or “aryl”. Additional constraints are often specified along with homology groups to limit the number of carbon atoms or heteroatoms, make topological restrictions, or describe optional substituents (e.g., “C13 alkyl”).

These variations can even be nested, that is, the alternative substructures may also contain further nested variations of any kind. Nested R-groups are especially common, they are widely used in chemical patents and, together with homology variation, account for Markush structures’ complexity [13]. Other types of variations are also used in some applications, for example, atom and bond variation: a specific atom or bond in the Markush structure can be of multiple types. However, these variations can easily be described using R-groups, as well.

Combinatorial libraries that are to be synthesized typically contain only a modest number of compounds (a few thousands or less), but diversity analyses are often carried out on larger virtual libraries (e.g., in the order of 1012 or more). These libraries are usually represented by combinatorial Markush structures that do not involve homology groups, so every represented molecule can be enumerated (provided that the required time and space is available). In contrast, patent Markush structures tend to describe much larger or infinite chemical spaces due to the common usage of homology groups [3,14].

Patent information is essential in the pharmaceutical industry and biotechnology, not only because of the protection of intellectual property, but also because a major part of novel chemistry is published only in patents [1,5,15]. Markush structures are effectively used in chemical patents, as they provide the inventor a broad scope of legal coverage. This kind of broad coverage is of particular importance, because similar compounds usually show similar biological activity, thus inventors would like to also protect compounds that are derived from the original active compounds by minor modifications. On the other hand, Markush structures also introduce several difficulties, with regard to drafting and analyzing patent claims, computer representation, and searching. Mainly because the chemical language and notations used in patent claims are not standardized and sometimes ambiguous or too vague. Furthermore, the size and complexity of patent Markush structures have been increased over the decades [4].

During the process of patent drafting, an important step is the generation of a Markush structure based on a list of specific compounds. The goal of patent applicants and attorneys is to cover all compounds delivered by the research group, as well as other structurally similar compounds, while retaining appropriate extent of specificity that is required by patent offices. This critical step is typically carried out manually, although it is difficult, time-consuming, and error-prone.

In the last decades, the digital representation and analysis of Markush structures have been the subjects of intensive research [[1], [2], [3], [4], [5],[9], [10], [11], [12],14,[16], [17], [18], [19], [20], [21]]. Various tools for representing, visualizing, and searching Markush structures are described in the literature [3,14,[20], [21], [22], [23], [24], [25], [26], [27], [28], [29], [30]]. According to the authors’ knowledge, however, no complete algorithmic solution was published until now to the generation of Markush structures based on the list of target compounds.

This paper presents a novel algorithm for automatic construction of Markush structures, which can be used both for patent drafting and for the composition of (virtual) combinatorial libraries. A simple example is presented in Fig. 1. Given a series of specific compounds, the algorithm generates a Markush structure that encompasses all compounds along with other similar ones due to additional combinations of the variable parts derived from the original compounds. As Fig. 1 demonstrates, substituent variation (R-groups), position variation, and atom variation are applied by the algorithm. Other types of variations, such as homology groups, are not supported currently, but the algorithm enables further extension to support them.

A similar concept called R-group decomposition is available in different commercial and open-source cheminformatics software packages [[31], [32], [33]]. In contrast with the presented algorithm, however, R-group decomposition is typically used for generating structure activity relationship (SAR) tables, not for generating Markush structures. R-group decomposition algorithms can decompose molecules into fragments based on a predefined scaffold, but they cannot identify the scaffold automatically and can only generate single-level R-groups. They do not support nested R-groups and other features like position variation. Therefore, R-group decomposition cannot be considered as an actual alternative to the solution discussed in this paper.

The presented algorithm, which we call “Markush Composer”, was developed at ChemAxon. It is integrated into desktop applications with graphical user interface (Markush Editor, ChemCurator) [34,35] and also available as Java API [36] and KNIME component [37]. The algorithm strongly depends on maximum common substructure (MCS) search, which is a well-known NP-hard problem [38] with numerous applications in the field of computational chemistry and, increasingly, computational biology [6,[39], [40], [41]]. We use an efficient MCS algorithm, which applies the clique-based approach combined with various heuristics [42].

Section snippets

Methods

Fig. 2 provides a brief overview of the algorithm. It consists of five consecutive phases, which will be discussed in detail. The main concept is to first generate a single-level Markush structure and then call the algorithm recursively for each R-group, using the represented substituents as input, in order to generate nested variations.

The input molecules may not share a common scaffold of reasonable size (e.g., at least 2–3 atoms). This can be the case when the algorithm is applied

Experimental evaluation

According to the authors’ knowledge, the presented algorithm is the first solution to automatic Markush generation, so it cannot be compared with existing methods. However, experimental tests were carried out to evaluate the algorithm using real-world input data: combinatorial libraries and exemplified structures of different patents. The parameters of the heuristics and scoring methods were configured based on result statistics, chemical intuition, and the common practice of Markush

Conclusions and future work

Composing Markush structures from specific compounds is a difficult and time-consuming task, but it is typically performed manually without significant computer assistance. This paper introduces a novel method for Markush generation, which can effectively be used for multiple purposes, including patent drafting and composing (virtual) combinatorial libraries. The presented algorithm is the first automatic solution to this problem, according to the authors’ knowledge. It can construct a Markush

Acknowledgment

The authors express their appreciation to Ákos Tarcsay, János Kendi, and Gábor Hornyák for their comments and suggestions.

Péter Kovács received his M.Sc. in Computer Science from Eötvös Loránd University, Budapest, Hungary in 2007. He was a member of the Egerváry Research Group on Combinatorial Optimization (EGRES) at Eötvös Loránd University. Since 2010, he has been working at ChemAxon as a software developer and researcher in the field of cheminformatics. His main research interests are optimization problems and algorithms related to networks, graphs, and computational chemistry.

References (51)

  • L.J. Brown

    The Markush challenge

    J. Chem. Inf. Comput. Sci.

    (1991)
  • M.J. White

    Chemical patents

  • N. Brown

    Chemoinformatics–An introduction for computer scientists

    ACM Comput. Surv.

    (2009)
  • T. Engel

    Basic overview of chemoinformatics

    J. Chem. Inf. Model.

    (2006)
  • E.A. Markush

    Pyrazolone Dye and Process of Making the Same

    (1924)
  • M.F. Lynch et al.

    Computer storage and retrieval of generic chemical structures in patents, 1. Introduction and general strategy

    J. Chem. Inf. Comput. Sci.

    (1981)
  • J.M. Barnard et al.

    Computer storage and retrieval of generic chemical structures in patents, 2. GENSAL, a formal language for the description of generic chemical structures

    J. Chem. Inf. Comput. Sci.

    (1981)
  • S.M. Welford et al.

    Computer storage and retrieval of generic chemical structures in patents, 3. Chemical grammars and their role in the manipulation of chemical structures

    J. Chem. Inf. Comput. Sci.

    (1981)
  • M.F. Lynch et al.

    The Sheffield generic structures project–a retrospective review

    J. Chem. Inf. Comput. Sci.

    (1996)
  • D.A. Cosgrove et al.

    A system for encoding and searching Markush structures

    J. Chem. Inf. Model.

    (2012)
  • Chemical Structures: the International Language of Chemistry

    (1988)
  • Chemical Structures 2: the International Language of Chemistry

    (1993)
  • E.S. Simmons

    The grammar of Markush structure searching: vocabulary vs. syntax

    J. Chem. Inf. Comput. Sci.

    (1991)
  • H. Tokuno

    Comparison of Markush structure databases

    J. Chem. Inf. Comput. Sci.

    (1993)
  • A. Barth et al.

    A novel concept for the search and retrieval of the Derwent Markush resource database

    J. Chem. Inf. Model.

    (2016)
  • Péter Kovács received his M.Sc. in Computer Science from Eötvös Loránd University, Budapest, Hungary in 2007. He was a member of the Egerváry Research Group on Combinatorial Optimization (EGRES) at Eötvös Loránd University. Since 2010, he has been working at ChemAxon as a software developer and researcher in the field of cheminformatics. His main research interests are optimization problems and algorithms related to networks, graphs, and computational chemistry.

    Gábor Botka received his M.Sc. in Computer Science from Eötvös Loránd University, Budapest, Hungary in 2013. After graduation, he started to work at ChemAxon's newly formed Markush technology team, which has been developing patent information related solutions.

    Árpád Figyelmesi received his M.Sc. in Chemical Engineering from the Budapest University of Technology and Economics, Hungary and completed his postgraduate diploma in IT Management from Corvinus University of Budapest, Hungary. After years of industrial experience, he joined the Hungarian Intellectual Property Office as a patent examiner and later as a business analyst. Since 2013, he has been working at ChemAxon, leading the development of Markush technology and patent information related solutions.

    View full text