To the Editor — Much of life science research revolves around understanding the biological function of proteins. Some proteins, such as the tumor suppressor p53, have been studied extensively1. By contrast, thousands of human proteins remain ‘understudied’: their biological function is poorly understood and annotation of their molecular properties is scarce2,3,4,5,6. However, without a minimal amount of molecular annotation, it is difficult to formulate effective research questions and design experiments to investigate the function of these proteins in mechanistic detail2.

The disparity in how much we know about individual proteins leads to a phenomenon known as the ‘streetlight effect’ or the ‘rich-get-richer syndrome’, in which research in a field preferentially targets proteins that are already well-studied7. There are many reasons for this, including practical considerations (for example, the abundance, solubility and size of a protein), the ease of designing a research plan that depends on available knowledge (for example, knockout phenotype, molecular interactions) and the availability of tools such as antibodies. In addition, working on proteins that already receive a lot of attention (for example, some disease-associated proteins) increases the chances of high-impact publications and funding. Hypothesis-driven (rather than question-driven) research may also contribute, as hypothesizing about the potential function of a completely uncharacterized protein is nearly impossible. Finally, some proteins may remain understudied because they are not expressed or required in standard laboratory conditions. Ironically, some of this problem is caused by the global desire to make research more reproducible through the standardization of experimental conditions.

One counter-argument is that the important proteins are being studied and the others are not as important to pursue. The evidence suggests otherwise: genome-wide studies show that research attention bias does not reflect the importance of genes for cellular processes and human disease2,5. For example, more than half of the host genes implicated in COVID-19 identified by genome-wide studies have not been pursued in more detail by targeted studies of the COVID-19 field8. Furthermore, the creation of a synthetic minimal bacterium required 149 proteins of unknown function9. If these proteins are crucial for the most-minimal cell possible to survive, they should be important to us.

As current approaches to study proteins often reinforce the streetlight effect, we seek to pursue a different approach. We propose that a coordinated effort of the functional proteomics field could be an effective way to systematically advance the basic molecular characterization of understudied proteins, such that detailed studies become more feasible. With the goal of openly discussing, coordinating and initiating efforts to address these challenges, we established the Understudied Proteins Initiative10, with participation of the Wellcome Trust (Fig. 1). In essence, for each understudied protein, we aim to provide enough molecular information (for example, protein interactions, colocalization or coexpression) that hypotheses about its putative function can be made. Importantly, this should make it clear which field or laboratory with a particular research focus would be best placed to carry out further detailed studies of the protein. Thus, the giant task of characterizing the many understudied proteins is split into two parts: a large-scale precharacterization by omics laboratories, and a focused detailed investigation by molecular biology laboratories.

Fig. 1: Roadmap of the Understudied Proteins Initiative.
figure 1

Stages 1 and 2 focus on defining the challenge and building a community. First, a survey among biomedical researchers (https://understudiedproteins.org/survey) will define the minimal information needed to counterbalance the current data bias that works against understudied proteins being included in mechanistic investigations. The survey will also reveal how many proteins are to be considered understudied and provide the data to train an algorithm to automatically assess annotation bias in the future. In addition, the survey will reveal at which locations researchers look for annotation and thus where new annotations should be added. In a second step, a workshop will bring together experts in different disciplines and technologies that provide large-scale data for systematic annotation of proteins to establish the framework of a coordinated understudied proteins initiative. The six action areas to be discussed are data generation, data integration, dissemination of results, assessment of progress, model systems and conditions to cover, and quality control. This will then lead to stage 3, the experimental work that will see a collaborative effort of many laboratories to tackle the problem of understudied proteins.

Choosing the right tools and experiments for such a large-scale data-generation effort requires critical input before data collection begins. As a first step, we have recently launched an openly accessible survey to allow us to better understand which human proteins remain understudied, what the minimal information is that would kick-start their inclusion in mechanistic investigations and where this information should be available (https://understudiedproteins.org/survey). Scientists who engage in mechanistic investigations are best placed to define this.

As a second step, we will then gather experimentalists and computational experts interested in large-scale approaches at a conference (https://understudiedproteins.org/conference) to discuss and identify ways to deliver this information. Ultimately, individual researchers stand to gain from the results of this initiative whenever they face new proteins in an ongoing study and need to prioritize novel targets for further investigation.

Survey participants will be shown a randomly selected human protein and asked to assign it to one of three annotation levels. In addition, they will declare which tools and resources were used for that assessment and what information they regard as important before starting experimental work with a new protein. We envision that respondents will need no more than five minutes per protein. Each protein will be presented to multiple participants, allowing us to average responses and capture the range of different interpretations and assessments of a protein’s annotation level. In this way, the survey will deliver a manually curated assessment of the annotation level of human proteins. Although scores exist that express various aspects of protein annotation3,6,11,12, our survey will return a score that specifically expresses how amenable a protein is to detailed mechanistic investigations.

Next, we will cross-reference this vote-based annotation score with the quantifiable annotation information available for the same protein and its homologs in publicly available resources named by participants and others, which could include PubMed, STRING, BioGRID, UniProt, Gene Cards, Wikipedia, Complex Portal and the Human Protein Atlas. This collated information will reveal key characteristics of understudied proteins, such as what type of quantifiable experimental evidence is available or lacking, and where it is accessible. Notably, this understanding is not limited to human proteins and guides the extension of our efforts toward other species.

The free-text answers from survey respondents will allow us to cross-check whether our data-based assessment agrees with what participants think regarding the minimal information that makes a protein a viable target of study, and where and how annotation should be accessible. In addition, on the basis of the annotation score and the cross-referenced quantifiable annotation information, we will train a machine-learning algorithm to automate the annotation scoring. An automated annotation scoring system allows us to keep scores up-to-date, assess proteins of other species and transparently monitor progress in protein annotation over time. Therefore, if a sizeable proportion of the community who reads this Correspondence and the paper in Nature Methods10 participates in the survey and shares it with colleagues, then we will build a community-driven foundation for the Understudied Proteins Initiative.

With a clear understanding of what constitutes the experimental information that would make an understudied protein amenable to study, we will then start a discussion with funding agencies on how to set up calls aimed at providing this information. A critical component will be the evaluation of the effect of different information sources, facilitated by our automated annotation scoring. We will reveal the benefit of the respective datasets and approaches by monitoring the rate of annotation of understudied proteins. Measuring the effect of large-scale data will inform the effective use of funding, but also highlight where technology developments are needed to fill any systematic gaps left by current tools. Instead of lots of data, we aim to generate meaningful data. Eventually, thousands of laboratories around the world will be able to add those currently understudied proteins that fall into their own fields of interest to ongoing and future mechanistic investigations, thereby ending the era of understudied proteins. Our initiative complements those that have a strong emphasis either on bacterial proteins (COMBREX13 and the Enzyme Function Initiative14) or on protein–small molecule interactions, such as the Structural Genomics Consortium5,15, Open Targets16 and the Illuminating the Druggable Genome program6, which aims to improve our understanding of uncharacterized proteins within the three most commonly drug-targeted protein families (G-protein-coupled receptors, ion channels and protein kinases).

By providing a basic molecular characterization of all proteins, the Understudied Proteins Initiative will catalyze mechanistic investigations of understudied proteins, drive new biomedical research, and boost our understanding of the human proteome and its role in disease. We invite the community to get involved by participating in the survey and spreading the word.