Elsevier

Computers & Graphics

Volume 91, October 2020, Pages 189-198
Computers & Graphics

Special Section on 3DOR 2020
SHREC 2020: Multi-domain protein shape retrieval challenge

https://doi.org/10.1016/j.cag.2020.07.013Get rights and content

Highlights

  • Proteins are 3D molecular shapes of outmost importance in vivo.

  • Methods to compare protein shape need to be evaluated on a benchmark dataset.

  • This Shape Retrieval Contest aims to assess performances of shape comparison methods.

  • The trade-off between performance and computational cost is evaluated.

Abstract

Proteins are natural modular objects usually composed of several domains, each domain bearing a specific function that is mediated through its surface, which is accessible to vicinal molecules. This draws attention to an understudied characteristic of protein structures: surface, that is mostly unexploited by protein structure comparison methods. In the present work, we evaluated the performance of six shape comparison methods, among which three are based on machine learning, to distinguish between 588 multi-domain proteins and to recreate the evolutionary relationships at the proteinand species levels of the SCOPe database.

The six groups that participated in the challenge submitted a total of 15 sets of results. We observed that the performance of all the methods significantly decreases at the species level, suggesting that shape-only protein comparison is challenging for closely related proteins. Even if the dataset is limited in size (only 588 proteins are considered whereas more than 160,000 protein structures are experimentally solved), we think that this work provides useful insights into the current shape comparison methods performance, and highlights possible limitations to large-scale applications due to the computational cost.

Introduction

Proteins are complex macro-molecular molecules with various shapes and sizes ranging from hundreds to millions of atoms [1]. The 3D arrangement of protein atoms is directly linked to specific functions that are mostly mediated through the protein surface. Protein surfaces are of great interest in drug discovery pipelines, adverse drug reaction or the characterization of cellular processes at the molecular level. However, challenges in protein surfaces comparison may arise from (a) the dynamical, non-rigid nature of the proteins that allows protein conformational changes, i.e., surficial modifications and therefore specific functions, (b) the intrinsic structure of multi-domain proteins, i.e., the fusion of multiple, individual domains into one protein throughout evolution, and (c) the similarity between distinct protein structures and surfaces inherited from their evolutionary relationships.

The SHape REtrieval Challenges (SHREC) are time-restricted challenges, which aim to evaluate the effectiveness of 3D-shape retrieval algorithms. Typically, a challenge is opened by proposing a dataset of related shapes to participants while retaining the class membership. In the SHape REtrieval Challenge 2020 (SHREC2020) track on multi-domain protein shapes, the participants had 7 weeks from the dataset publication to send their results with a description of the methods used to generate the results (see Section 4). This SHREC2020 track on multi-domain protein shapes evaluates the current ability of shape comparison methods proposed by 6 different groups to tackle the protein surface comparison problem. The participants were asked to send their results in the form of matrices containing all-to-all dissimilarity scores. The results were analyzed and the overall retrieval performances are presented here.

The dataset includes 588 proteins consisting of two domains (the functional units of the proteins); only the corresponding triangulated meshes of their solvent-excluded surfaces (SES) [2] were provided as input to the participants. We then evaluated the retrieval performance of each method to retrieve the evolutionary relationships between orthologous proteins (proteins that have the same function in different organisms), and to retrieve the different conformations of an individual protein. Here, we present the results of all the participants and methods, and briefly discuss the trade-off between performance in retrieval and computational cost of each method.

Section snippets

Dataset

Proteins are linear polymers (the so-called protein chains) made of amino-acid residues (up to several hundreds), which fold into a specific, well-defined 3D structure. Furthermore, many proteins need to form a complex of several chains to become functional. For instance, the human heamoglobin requires two α-globin and two β-globin chains to be fully functional. Domains define the functional units of the proteins, and are usually associated with a specific function and/or interaction; it is

Evaluation

Analyses were performed with scikit-learn [13] and numpy [14], and Figs. 4 and 5 were produced using matplotlib [15].

Nearest Neighbor, First-tier and Second-tier These retrieval metrics measure the ratio of models that belong to the same class as the query. For Nearest Neighbor (NN), the first match only is considered (the identity is not considered), while the |C|1 and 2*(|C|1) first matches, where |C| denotes the size of the query’s class, are considered for First-tier (T1) and Second-tier

Participants & methods

Six groups from five different countries registered for the track and submitted 15 dissimilarity matrices in the requested time (8 weeks) along with the description of their protocol. To ease the reading, we have assigned each group a short name for referencing in the following text.

  • 1.

    CODSEQ by Halim Benhabiles, Karim Hammoudi, Adnane Cabani, Feryal Windal, Mahmoud Melkemi (Section 4.1),

  • 2.

    3DZ by Tunde Aderinwale, Genki Terashi, Charles Christoffer, Daisuke Kihara (Section 4.2),

  • 3.

    WKS/SGWS by Yuxu

Results & discussion

In this section, we assess quantitatively the performance of each method described in Section 4. We analyzed the performance at the protein (Fig. 4 and Table 6) and the species (Fig. 5 and Table 7) levels as described in Section 3.

Protein level

At the protein level, the 588 shapes were gathered into 7 classes of multi-domain orthologous proteins; among each class, all members share at least one common domain while the other domains are different.

This feature allows the methods for having

Conclusion

In the present work, we have presented a dataset of shapes from multi-domain proteins. Six groups, among which three used machine learning approaches in their respective work-flows, submitted 15 sets of results. The performances were assessed at the protein and species levels of the SCOPe database.

Shape retrieval methods displayed high-quality results at the protein level. We observed a significant decrease in the performances of all the methods at the species level. These results indicate that

CRediT authorship contribution statement

Florent Langenfeld: Conceptualization, Data curation, Formal analysis, Investigation, Writing - original draft, Supervision. Yuxu Peng: Software, Investigation, Resources, Writing - review & editing. Yu-Kun Lai: Software, Investigation, Resources, Writing - review & editing. Paul L. Rosin: Software, Investigation, Resources, Writing - review & editing. Tunde Aderinwale: Software, Investigation, Resources, Writing - review & editing. Genki Terashi: Software, Investigation, Resources, Writing -

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Yuxu Peng was supported by the Young teachers growth plan project (2019QJCZ014) funded by Changsha University of Science & Technology.

Stelios Mylonas, Apostolos Axenopoulos and Petros Daras were supported by the ATXN1-MED15 PPI project funded by the GSRT - Hellenic Foundation for Research and Innovation.

Matthieu Montes and Florent Langenfeld were supported by the European Research Council Executive Agency under the research grant number 640283.

References (35)

  • F. Langenfeld et al.

    Shrec 2018 protein shape retrieval

    Proceedings of the Eurographics workshop on 3D object retrieval

    (2018)
  • F. Langenfeld et al.

    Protein shape retrieval contest

  • P. Gainza et al.

    Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning

    Nat Methods

    (2019)
  • F. Pedregosa et al.

    Scikit-learn: machine learning in Python

    J Mach Learn Res

    (2011)
  • Oliphant T.. NumPy: a guide to NumPy. USA: Trelgol Publishing; 2006. [Online: Accessed 15 June 2020];...
  • J.D. Hunter

    Matplotlib: a 2D graphics environment

    Comput Sci Eng

    (2007)
  • C. Szegedy et al.

    Inception-v4, inception-resnet and the impact of residual connections on learning

    Proceedings of the 31st AAAI conference on artificial intelligence, AAAI17

    (2017)
  • Cited by (16)

    • PLO3S: Protein LOcal Surficial Similarity Screening

      2024, Computational and Structural Biotechnology Journal
    • SHREC 2022: Protein–ligand binding site recognition

      2022, Computers and Graphics (Pergamon)
      Citation Excerpt :

      These types of approaches, and especially the combination with ML (without any use of chemical information), are still relatively poorly explored for the pocket detection task. This SHREC contest differs from previous SHREC contests related to proteins retrieval and classification, e.g., [23,24], because the focus here is the identification of delimited binding sites rather than the comparison of the whole molecular surface or its domains. Moreover, it also differs from contests on the classification of cryo-electron tomograms, e.g., [25], because the structures we consider are obtained at a finer level of resolution, and we are not focusing in the interaction of a complex system of thousands of proteins.

    • Surface-based protein domains retrieval methods from a SHREC2021 challenge

      2022, Journal of Molecular Graphics and Modelling
      Citation Excerpt :

      Overall, the results are decreased compared to similar past tracks [34]. Indeed, two methods based on descriptors similar to 3DZD and APPFD-FK-GMM (3DZD and HAPPS, respectively) were presented in the SHREC′20 contest and performed very well (e.g both methods exceeding 0.95 for the NN metric) on a problem similar to the shape-only problem (see Tables 6 and 7 of [34]). However, the adapted versions (3DZD and APPFD-FK-GMM) did not reach the same level of performance by exploiting this new, particular dataset of proteins.

    • SHREC 2021 Track: Retrieval and classification of protein surfaces equipped with physical and chemical properties

      2021, Computers and Graphics (Pergamon)
      Citation Excerpt :

      We trained two types of neural network, visually depicted in Fig. 6, to output a score that measures the dissimilarity between a pair of protein shapes, encoded via the 3DZDs. The first framework (Extractor model) was previously used in a SHREC track on multi-domain protein shape retrieval, see [11]. The network is structured into multiple layers: an encoder layer, which converts 3DZD to a vector of 150 features, has 3 hidden units of size 250, 200, and 150, respectively; a feature comparator layer that computes the Euclidean distance, the cosine distance, the element-wise absolute difference, and product; and a fully connected layer with 2 hidden units of size 100 and 50, respectively.

    View all citing articles on Scopus
    1

    Track organizers and corresponding authors.

    View full text