The Immunological Genome Project (ImmGen) is a collaborative group of immunology and computational biology laboratories that perform a thorough dissection of gene expression and its regulation in the immune system of the mouse. This activity first centered on mRNA expression and then expanded to microRNA (miRNA), chromatin structure, nuclear organization and protein–RNA relationships. Shared protocols, data generation and QC pipelines have yielded data that can be directly compared from >250 stem, lymphoid and myeloid cell types, at baseline or under challenge. The group develops and applies computational tools to decipher regulatory connections and transcriptional control. From its inception, data generated by ImmGen were meant to be a public resource, and they can be accessed through dedicated web and smartphone platforms that use interactive graphic displays that make the results intuitive to users.

Basic tenets

ImmGen has been an interesting example of consortium science1, wherein each member performs a focused exploration of their particular interest and, in doing so, contributes to a larger whole. Participation resulted in varied experiences, as reflected in collected soundbites at https://docs.google.com/spreadsheets/d/1_nNvGduRXox0sqfDSydxoLNH0eZ50W4RVMJ7hGNMZfw/edit#gid=0 Several tenets have distinguished this activity. First, in vivo veritas: only ex vivo cells are analyzed, which avoids the biases of established cell lines or cytokine-dependent primary cultures (a choice that certainly carried technical challenges). Second, to ensure data are as comparable as possible, shared pipelines are used for the genomic steps (profiling and ATAC-seq (assay for transposase-accessible chromatin using sequencing)), performed on cells that have been sorted according to rigorous standards in different labs and that are shipped to a central location. Finally, to match computational with experimental rigor, Venn diagrams were banished.

Catherine Laplace’s rendition of ImmGen productions.

ImmGen is all mouse. Naturally, an extension to the human immune system was suggested many times but was not implemented, principally because the magnitude of the task would overwhelm a group already stretched thin, requiring new structures and logistics. On the other hand, as highlighted by ImmGen results2,3, the immune systems of mice and humans are remarkably similar in structure and regulation, when properly compared (every recent success in immunotherapy derives from mouse pilots). In addition, ImmGen results proved instrumental in opening vistas into the human immune system. While exploring dendritic cells (DCs) from mouse parenchymal or lymphoid organs, we discovered a module of genes distinguishing ‘migratory DCs’ that had descended to the draining lymph node from both the DCs in the tissue they came from and lymphoid-resident DCs4. DCs with precisely the same pattern of gene expression were recognizable in single-cell RNA sequencing (scRNA-seq) of human tumors, and we recently realized that the signature was in fact independent of migration and was instead triggered by uptake of cell-associated antigens and includes a strong immunoregulatory component. In addition, ImmGen data browsers do host human datasets (including topical COVID-19-related datasets), and several ImmGen members partake in the Immune Cell Atlas within the Human Cell Atlas5.

Evolving technologies: what is a cell type?

While the ultimate goals and mission have not really changed, technological advances have expanded the breadth and depth of the project (accompanied by some turnover in groups and several students and postdocs becoming ImmGen principal investigators). Starting from quasi-exclusive profiling of protein-coding genes, explorations expanded to include miRNAs (S.R. et al., unpublished observations), chromatin marks and structure by ATAC-seq and ChIP–seq (chromatin immunoprecipitation followed by sequencing) in the Buenrostro and Josefowicz labs6, and three dimensional nuclear architecture (HiChIP, HiC (high-throughput chromosome conformation capture)). Finally, ImmGen is beginning to relate mRNA and protein expression in a large proteogenomics effort with BioLegend, using DNA-tagged antibodies to combine transcriptome and surface proteome data.

ImmGen profiling started with Affymetrix arrays, which yielded tight data for a 30,000-cell input, allowing identification of cell populations at high resolution. Replacement by ULI (ultra-low-input RNA-seq) improved information in the low-expression range, and the 1,000-cell requirement opened the door to profiling of rare populations such as non-classical T cells or innate lymphoid cells. A third expansion came with the advent of scRNA-seq, in which some ImmGen members played a lead role. Single-cell profiling could be construed as the final step of the ‘subset-splitting’ that immunologists have long engaged in (B and T lymphocytes, and so on). It represented a sea change in ImmGen’s operational principle — which, until then, was anchored by expertly defined cell populations — as it held the promise of an unbiased and definitive atlas of cell types. But scRNA-seq did not quite yield the clarity that was hoped for. For some lineages, for example, DCs, existing populations were confirmed, and new populations emerged. But this was not the case for other cells, such as T cells, for which previously defined subsets have melted away (E. Kiner, et al., unpublished observations), and even boundaries with other lineages have become blurred. For instance, in splenocyte datasets, natural killer (NK) and CD8+ T cells run together. There may be technical reasons (for example, sparsity of scRNA-seq data), but this also brings back earlier observations from population profiling: in early principal component analysis plots, NK cells and activated CD8+ T cells were surprisingly close, and kinetic analysis of NK cells and CD8+ T cells responding to viral infections by the Lanier and Goldrath labs7,8 identified shared features of these cytotoxic lymphocytes, with parallels between resting NK cells and ‘central memory’ CD8+ T cells, or cytomegalovirus-memory NK cells and vesicular stomatitis virus–specific ‘effector memory’ CD8+ T cells (similar parallels could be made between T cells and other innate lymphoid cell populations in the Colonna lab9). In the overall design of the immune system, NK and CD8+ T cells are very different actors, separated by the adaptive T cell receptor. This then raises the question of how to define a cell type’s identity: should it be defined by its broader transcriptome or by one or a few determining genes?

Success stories

One of the key assets of ImmGen has been the deep and complementary expertise of its participants, with each of the member laboratories bringing knowledge and know-how on specific facets, lineages or cell types. This unique set has allowed the group to specialize in a range of cells, from stem cells in the Wagers and Rossi labs10, to mast cells in the Austen lab11, to stromal cells in the Turley lab12, and every lymphocyte in between. The results obtained for macrophages further illustrate this diversity. It is obvious today that tissue-resident macrophages are remarkably diverse in origin and phenotype, although this was not the case when ImmGen started. The Turley lab was initially asked to analyze all ‘accessory’ myeloid cells, realized that this would be a monumental task, and recruited the Merad and Randolph labs. The inclusion of macrophage populations was considered necessary but was primarily a way to improve our understanding of the DCs, which were more fashionable at that time. Research on macrophages was a backwater compared with that on DCs, and macrophages in different tissues were presumed to be functionally interchangeable. Work within ImmGen helped to rectify that erroneous conception, uncovering unexpectedly large differences between macrophages from different tissues13. Because the consortium allowed comparison with all the other immune cell types profiled, macrophages stood out as the most diverse among the lineages. This appreciation had practical impacts: never again would it be acceptable to utilize simplistic strategies to universally identify macrophages as CD11b+, or DCs as the only myeloid cells expressing CD11c. This demonstration of the diversity of macrophages (and mononuclear phagocytes (MNPs) in general) was recently taken to the next level with ImmGen’s MNP OpenSource program, in which many labs outside ImmGen contributed to what is an astounding collection of data14 (A. Gainullina et al., unpublished observations). Similarly, ImmGen studies in the Monach lab also shone a new light on neutrophil diversity during activation15.

The analysis of long-lived tissue-resident mast cells (MCs)11 also illustrated the power of complementarity. Both MCs and basophils express the IgE receptor Fcer1a, although they predate IgE by hundreds of millions of years. Basophils emerge from the bone marrow as mature effectors with a short lifespan, whereas MCs only mature within tissues. ImmGen provided the unique opportunity to place these ancient cells within the context of the modern mammalian immune system. MCs proved to be incredibly distinct, forming an independent cluster separate from lymphoid and other myeloid cells. By contrast, basophils clustered with eosinophils and neutrophils and had far more in common with other circulating granulocytes than with MCs. We also identified a core connective tissue MC signature distinct from mucosal mast cells.

Finally, areas too big for any of us to tackle alone included the full differentiation cascade of B cells, undertaken in the Hardy and Nutt labs, and the large galaxy of T lymphocytes, which spans somatically adaptive and innate-like moieties. It took the combined expertise of the Kang (γδ T cells16), Brenner and Kronenberg (non-classical innate-like αβ T cells17), CBDM (differentiation18) and Goldrath and Dustin (activation, effector and memory8) laboratories to stitch together what is certainly the broadest survey of transcription and the chromatin accessibility landscape anywhere. Whereas effector T subtypes share gene programs to execute their function, distinctions emerged in how functional specialization is achieved, from preprograming of innate-like effector subtypes to becoming adaptive effectors post-antigen experience.

Mistakes and horror stories

Naturally, the group had ups and downs, some mistakes and dead-ends. At an unusually tense ImmGen workshop several years ago, the core team was taken to task for its persistent failure to develop a robust RNA-seq protocol compatible with low cell numbers. Luckily, the Broad Technology Labs devised a fabulous implementation of SmartSeq2 for low-input profiling that sidestepped RNA purification, and it has been ImmGen’s workhorse ever since. Smiles returned. Some cell populations had to be pulled from the website because of contamination issues that had not been initially realized (in some cases, because surface markers for sorting were not as specific as had been thought). Proteome profiling also proved challenging. In spite of the best intentions and efforts, collaborations with the systematic mouse mutagenesis programs (M. Malissen and Phenomin, Knockout Mouse Project (KOMP)) never yielded very striking results, partly because gene choices and timelines proved hard to articulate, and because many single knockouts turned out to have little impact on the transcriptomes of CD4+ T cells or macrophages, indicating a strong resilience of the regulatory network.

Extracting the “substantifique moelle”

The 15th century writer Rabelais coined the metaphor “substantifique moelle”, advocating for meditation on texts to extract deep meaning and knowledge, as one would extract marrow from bone — or as computational analysis is able to extract meaning from Big Data. As reflected by some of its founding members (Koller, Collins and Regev), ImmGen always aimed to go beyond mere cataloging and used computational mining to exploit the data for implications of regulatory connections within the immune system. Tools for network inference were being developed in the early 2000s, and we set out to reconstruct the regulatory network of the mouse immune system. The data were huge and dynamic — new cell types were added continuously, forcing recomputation every few months. A few thousand genes capture the dynamics of most systems, but, for the entire immune system, even 8,000 genes seemed small; there were long debates on the depth of clustering that should be applied across several hundred samples, representing >300 cell types (a compromise was reached on two levels of clustering, coarse or fine). The next step entailed developing a novel algorithm that exploited stepwise transitions in the differentiation cascades to infer the regulatory transcription factors19. This type of approach was carried further through several projects6,20,21,22 in the Shay and Mostafavi labs. Most recently, we exploited the potential of deep neural networks to learn and ‘understand’ highly complex and non-linear relationships in large datasets. A trained deep neural network can accurately predict, from DNA sequence alone, the activity of enhancers across the whole immune system and the transcription factors that mediate this activity3. In an overnight run, the machine rediscovers 30 years of hard-won immunogenetics. Humbling!

Value as a resource

One of the ‘wow moments’ came when we started monitoring traffic to the website and realized that ~50 visitors quietly consulted the site every day, whereas we had expected one or two. Someone was seeing value in the effort. This number has now grown and has stabilized at ~250 independent visits per weekday, and the smartphone app also has widespread use23. The ImmGen data browsers are different from the portals of many consortia. They do not aim to serve raw datasets (repositories like the Gene Expression Omnibus did this far more professionally), but are designed to answer the diversity of “Show me…” questions that an immunologist might raise, and they aim for consultation rather than downloads, and graphic visualization rather than tables. Although the Skyline expression histograms account for half the traffic, almost all other tools are queried >100 times per month. To be fair, ImmGen data browsers are an idiosyncratic lot with respect to design and architecture, contributed by different software developers over time. Some are a bit quirky, and interconnections could be improved. David Laidlaw, who helped launch ImmGen visualizations, argued for a lightweight, flexible and evolving assemblage, rather than a large architected machine with industrial-strength software engineering. But ImmGen browsers are robust: the original Skyline developed 15 years ago by the Park–Seguritan–Hyatt trio is still running strong.

Nature Immunology

Fittingly for this anniversary issue, the special relationship between Nature Immunology (NI) and ImmGen was instrumental to the latter’s success. The Berlin Accord (really an impromptu discussion with a senior NI editor at a poster session) acknowledged that data-rich and descriptive ImmGen reports were inherently worthwhile, and could be published without the “driving hypothesis”, “mechanistic insight” and knockout follow-ups requested by Reviewer 3. However, there was a stipulation that the landscape information be broad and integrative, and it had to generate truly novel insights from the cell studies, not mere gene lists. ImmGen reports published in refs. 7,8,9,11,12,13,16,17,18,19,21 met these criteria, and the agreement gave ImmGen the freedom of mind to pursue its endeavors for true resource building. But, contrary to rumors, ImmGen had no free pass at NI.

When is ImmGen finished?

What lies ahead for ImmGen? Answers to this question are exciting and tantalizing. The overall vision could be stated as providing a resource that details the expression of every gene and protein, the regulatory elements (enhancers and nuclear structures, transcription factors and regulatory RNAs) that control their expression, and how this network is brought to bear for organismal homeostasis and immune responses to challenges.

On the way to this elusive holy grail, equally ambitious stepping stones may be to fully define cell types (whether as discrete cell types or as continua, which may require a new vocabulary), to generate a developmental atlas of immune cells across tissues and time, to generate a comprehensive chart of all cis-regulatory elements that our field can adopt as a roadmap, and to link transcriptomes and proteomes. These monumental challenges will also require us to harness the all-encompassing power of machine learning and, from these foundational questions, to ask whether the blueprint provides robust nodes that can be utilized to bolster or repair fragilities and blind spots in immunity.