Journal of Molecular Biology
Predicting Secondary Structure Propensities in IDPs Using Simple Statistics from Three-Residue Fragments
Graphical abstract
Introduction
Intrinsically disordered proteins (IDPs) have emerged as key actors in multitude of relevant biological processes such as signaling, regulation and homeostasis [1., 2., 3.]. Moreover, malfunction of IDPs has been linked to a large proportion of cancers and neurodegenerative and cardiovascular diseases [4]. IDPs perform highly specialized functions despite they are devoid of permanent secondary or tertiary structure. Indeed, their malleability enables biological tasks that are out of reach for their globular counterparts [5]. In most cases, function is manifested when these flexible proteins interact with globular partners to trigger signaling or metabolic cascades [6]. These interactions are normally of low or moderate affinity, giving rise to fuzzy complexes where the IDP remains flexible upon binding [7,8]. These interactions are often mediated by short linear motifs or molecular recognition elements that specifically recognize the surface of the partner [9., 10., 11., 12.]. The presence of partially structured elements in short linear motifs tunes the thermodynamics and kinetics of the interaction, often assisted by their flanking regions [13]. Structural and electrostatic changes induced by post-translational modifications can also modulate the affinity of the interaction and represent efficient mechanisms of regulation [14,15].
The identification and characterization of partially structured elements in IDPs is complex and requires extensive experimental work, mainly using NMR. In particular, NMR chemical shifts and residual dipolar couplings (RDCs) are sensitive to small populations of secondary structural elements [16., 17., 18.]. Computational tools represent a good complement or an alternative to experimental studies to localize such structurally biased elements. For over 40 years, numerous methods have been developed to predict secondary structure in proteins from their amino acid sequence [19]. However, current secondary structure predictors are in general trained and evaluated on folded/globular proteins, and thus are not necessarily appropriate to identify partially structured regions in IDPs. Numerous methods have also been proposed to predict structural disorder from protein sequence ([20,21] and references therein). Most of the available disorder predictors focus on the identification of disordered regions in predominantly folded proteins. In general, they only provide a binary output (i.e. ordered/disordered) or a residue-specific disorder probability, but do not identify structural classes. Since they aim at providing different information, traditionally, secondary structure and disorder predictors have been developed independently from each other. One exception is the s2D method [22], which predicts secondary structure populations and disorder in a unified framework. s2D, as the work presented here, relies on a more holistic view of IDPs by exploring structural descriptors that span the continuum between ordered and disordered proteins [23,24,12].
In contrast to the most recent approaches, which are based on intricate machine-learning techniques, here we present an extremely simple strategy to identify secondary structural propensities from protein sequences. As machine-learning-based approaches, our method exploits structural information contained in databases. However, instead of training a machine-learning model or architecture, our approach performs simple statistical operations. These operations are based on a classification of the conformational preferences of three-residue fragments extracted from coil regions of experimentally determined high-resolution protein structures. Although small, tripeptides have been shown to encode relevant sequence-dependent structural information [25] and are valuable building blocks to model unfolded states and disordered proteins or regions [26., 27., 28.]. Furthermore, statistical analyses of three-residue fragments have also been used as key components of knowledge-based potentials and protein fold recognition methods [29,30].
We have evaluated the performance of our method, called local structural propensity predictor (LS2P), using a benchmark of nine well-characterized IDPs. LS2P accurately predicts previously identified helical and extended regions in the benchmark. Moreover, small stretches forming β-turns or promoting α-helices emerge from the analysis of the preferred structural classes of the tripeptides within the local sequence context. The main advantage of our strategy with respect to most machine-learning-based methods for secondary structure prediction, especially those using neural networks, is that it enables a comprehensible connection between amino acid sequence and structural preferences. LS2P is publicly available through a web server at: https://moma.laas.fr/applications/LS2P
Section snippets
Theory
The prediction method proposed in this work, LS2P, exploits statistical information about the structural preferences of three-residue fragments, called tripeptides from now on. This information was extracted from a structural database constructed from coil regions in high-resolution protein structures. Details about the tripeptide database construction can be found in the Materials and Methods section.
To simplify the structural classification, the conformational space of each residue ri was
Identification of secondary structure propensities in IDPs: An overall picture
A benchmark set of nine structurally well-characterized IDPs were used to evaluate the performance of our approach. Concretely, MAPK kinase 7 (MKK7) [32], the fragment 945–1097 of the erythrocyte binding antigen 181 (EBA-181) [33], p15 [34], Sic1 [14], measles virus ntail (ntailMV) [35], Sendai virus ntail (ntailSV) [36], the unique domain of the src kinase (USrc) [37], K18 construct of Tau protein (K18) [38], and full-length Tau protein [39] were used in our study. Predictions of secondary
Discussion
In this work, we have investigated the ability to predict secondary structure propensities within IDPs using local sequence-dependent information encoded in small protein fragments extracted from coil regions in high-resolution protein structures. We have developed an extremely simple statistical approach based on a coarse classification of tripeptide conformations. In contrast with nowadays popular neural-network-based secondary structure predictors, this approach enables a comprehensible
Tripeptide database
The tripeptide database was built from a curated database of high-resolution experimentally determined protein structures. More precisely, we used protein domains from the SCOPe [55] 2.06 release. In order to remove highly redundant sequences, we used the 95% sequence-identity-filtered subset of these domains. This subset consists of PDB-style files for 28,011 domains. DSSP [45] was employed to assign secondary structure labels to each residue in these files.
Each structure file was processed by
Availability
LS2P is publicly available through a web server at: https://moma.laas.fr/applications/LS2P.
The code of LS2P (in Python) and the data (number of structures for each tripeptide type and structural class extracted from high-resolution experimentally determined protein structures) are available upon request to the Lead Contact.
CRediT authorship contribution statement
Alejandro Estaña: Methodology, Data curation, Software, Writing - original draft. Amélie Barozet: Methodology, Writing - review & editing. Assia Mouhand: Investigation, Data curation. Marc Vaisset: Data curation, Software.Christophe Zanon:Software. Pierre Fauret: Software. Nathalie Sibille: Investigation, Writing - review & editing. Pau Bernadó: Conceptualization, Investigation, Supervision, Writing - original draft, Writing - review & editing. Juan Cortés: Conceptualization, Methodology,
Acknowledgments
This work was supported by the European Research Council under the H2020 Programme (2014–2020) chemREPEAT (648030) and Labex EpiGenMed (ANR-10-LABX-12-01) awarded to P.B., and the ANR GPCteR (ANR-17CE11-0022-01) to N.S. The CBS is a member of France-BioImaging (FBI) and the French Infrastructure for Integrated Structural Biology (FRISBI), two national infrastructures supported by the French National Research Agency (ANR-10INBS-04-01 and ANR-10-INBS-05, respectively).
Declaration of Competing Interest
The authors declare no conflict of interest.
References (55)
- et al.
Intrinsically disordered proteins: regulation and disease
Curr. Opin. Struct. Biol.
(2011) - et al.
Intrinsically disordered proteins: Emerging interaction specialists
Curr. Opin. Struct. Biol
(2015) - et al.
Interplay of protein disorder in retinoic acid receptor heterodimer and its corepressor regulates gene expression
Structure
(2019) - et al.
Analysis of molecular recognition features (MoRFs)
J. Mol. Biol.
(2006) The functional importance of structure in unstructured protein regions
Curr. Opin. Struct. Biol.
(2019)- et al.
Modulation of intrinsically disordered protein function by post-translational modifications
J. Biol. Chem.
(2016) - et al.
Quantitative determination of the conformational properties of partially folded and intrinsically disordered proteins using NMR dipolar couplings
Structure
(2009) - et al.
Characterization of intrinsically disordered proteins and their dynamic complexes: from in vitro to cell-like environments
Prog. Nucl. Magn. Reson. Spectrosc.
(2018) - et al.
Protein secondary structure prediction: a survey of the state of the art
J. Mol. Graph. Model.
(2017) - et al.
The s2D method: simultaneous sequence-based prediction of the statistical populations of ordered and disordered regions in proteins
J. Mol. Biol.
(2015)
Realistic ensemble models of intrinsically disordered proteins using a structure-encoding coil database
Structure
Intrinsic disorder within the erythrocyte binding-like proteins from Plasmodium falciparum
Biochim. Biophys. Acta
p15PAF is an intrinsically disordered protein with nonrandom structural preferences at sites of interaction with other proteins
Biophys. J.
Structural characterization of the natively unfolded N-terminal domain of human c-Src kinase: Insights into the role of phosphorylation of the unique domain
J. Mol. Biol.
Predictive atomic resolution descriptions of intrinsically disordered hTau40 and α-synuclein in solution from NMR and small angle scattering
Structure
Protein disorder prediction: implications for structural proteomics
Structure
SPOT-Disorder2: improved protein intrinsic disorder prediction by ensembled deep learning
Genom. Proteom. Bioinf.
Structure/function implications in a dynamic complex of the intrinsically disordered sic1 with the cdc4 subunit of an {SCF} ubiquitin ligase
Structure
Characterization of amino acid sequences in proteins by statistical methods
J. Theor. Biol
The anatomy and taxonomy of protein structure
Dynamic protein interaction networks and new structural paradigms in signaling
Chem. Rev.
Intrinsically disordered proteins in cellular signalling and regulation
Nat. Rev. Mol. Cell Biol.
Functional anthology of intrinsic disorder. 1. Biological processes and functions of proteins with long disordered regions
J. Proteome Res.
Fuzziness: linking regulation to protein dynamics
Mol. BioSyst.
Short linear motifs: Ubiquitous and functionally diverse protein interaction modules directing cell regulation
Chem. Rev.
Interactions via intrinsically disordered regions: what kind of motifs?
IUBMB Life
Cited by (9)
Structure–function relationships in protein homorepeats
2023, Current Opinion in Structural BiologyOn the Potential of Machine Learning to Examine the Relationship Between Sequence, Structure, Dynamics and Function of Intrinsically Disordered Proteins
2021, Journal of Molecular BiologyCitation Excerpt :Such approaches can be generalized and turned into probabilistic models using for example hidden Markov models42 or dynamic Bayesian networks.43 Recent developments have used a database of tri-peptide fragments to predict local structural properties in IDPs,44 or used molecular simulations of peptide fragments to create models of full-length IDPs.45 Combining such approaches may be a fruitful path towards constructing structural models of IDPs that contain transiently formed local structures.
An Integrative Structural Biology Analysis of Von Willebrand Factor Binding and Processing by ADAMTS-13 in Solution
2021, Journal of Molecular BiologyCitation Excerpt :The disorder propensity of vWF-strep-peptide was assessed in silico using POODLE-S and POODLE-L,47 PrDOS,48 RONN,49 Spritz-L and Spritz-S,50 IUPred-L and IUPred-S,51 DISpro, and iPDA52 through the Genesilico MetaDisorder web server.53 In addition, the secondary structure propensity of each residue was calculated by the LS2P method.54 Samples were prepared in 10 mM HEPES pH 7.4, 150 mM sodium chloride, and scattering data were collected at beamline P12 of the Petra III storage ring of the Deutsches Elektronensynchrotron (DESY) in Hamburg (Germany) at 20 °C.
Interdomain linkers tailor the stability of immunoglobulin repeats in polyproteins
2021, Biochemical and Biophysical Research CommunicationsCitation Excerpt :How does linker structure induce domain-linker contacts? To decipher this, we calculated the most preferred SS of the linkers using a recently developed computational tool called LS2P [17]. The method predicts the secondary structure propensities of a stretch of amino acids by breaking it into a sequence of overlapping tri-peptides.
Description of conformational ensembles of disordered proteins by residue-local probabilities
2023, Physical Chemistry Chemical Physics
- †
Lead contact.