Introduction

Aging refers to the systematic decline in cellular and organismal function over time. The ubiquity of age-related disease makes chronological age the single most important risk factor for morbidity and mortality [1]. Interventions to slow, delay, or even reverse the aging process thus have the potential to mitigate multiple age-related pathologies [2].

To quantify the effectiveness of such interventions, it is necessary to have a reliable measure of biological age. Aging timers, or clocks, accomplish this by using specific biomarkers whose states change systematically with chronological age. A variety of biomarker modalities have been studied, particularly not only epigenetic, but also transcriptomic, proteomic, and metabolomic, among others [3, 4]. A necessary step in the development of current aging clocks is to show that the chosen biomarker states are associated with chronological age across a population. This correlation captures the average changes over lifespan and establishes a baseline to which individuals can be compared. A desirable property of these biomarker timers is that they be directly linked to the hallmarks of aging [5]. This potentially allows the biomarker states to be interpreted in terms of the mechanisms of aging.

Genome instability due to somatic mutations is the first hallmark of aging [5]. In blood, mutations can lead to somatic mosaicism and eventually clonal hematopoiesis, where cell populations harboring particular allele variants outgrow others. Animal models of clonal hematopoiesis have been shown to contribute to disease progression [6]. More generally, diseases characterized by accelerated aging typically involve the increased accumulation of DNA damage [7]. The idea that somatic mutations can drive clonal expansion has stimulated renewed interest in the mutational theory of aging. This represents a new mechanism by which mutations can lead to aging phenotypes [8, 9] and is distinct from earlier proposals which treated absolute mutation burden as a sufficient cause for organismal aging [10]. Given its importance as a driver of aging, it would seem that somatic evolution could form the basis for a new type of aging timer.

Somatic mutations (single-nucleotide variants (SNVs) and copy-number variants (CNVs)) are naturally occurring barcodes [11] that enable phylogenetic inference of cell lineage trees (cell trees from now on). Cell trees are a representation of the mitotic branching order and clonal structure of a sampled cell population [12, 13]. These partial cell trees are subtrees of the whole organismal cell lineage tree which in an adult human consists of tens of trillions of cells [14]. The shape of a tree refers to the ordering and length of its branches and reflects the clonal structure and evolutionary distances between cells.

The central conjecture behind our proposed aging timer is that the structure, or “shape,” of cell trees is a representation of the biological aging process (Csordas, 2019, https://doi.org/10.7287/peerj.preprints.27821v7). There are two reasons for this hypothesis. The first is that phylogenetic systematics has long shown how genetic distances between species existing today reflect evolutionary changes in the past. It is reasonable then to expect that genetic distances between single cells can be used to infer the somatic evolutionary history of cells, a driver and indicator of aging. The second is that biomedical life history can leave its imprint on the cell tree [15], providing a record of major transitions in the aging process. An additional benefit of cell trees is that they provide an intuitively appealing representation of the dynamics of aging that naturally lends itself to interpretation.

Using human peripheral blood cells from healthy individuals (n=18, age range 21–82 years of age), we have developed a new aging timer called Cell Tree Rings (CTR) with the following characteristics:

  1. 1.

    Naturally occurring somatic single-nucleotide variants (SNVs) are used to build cell trees using standard phylogenetic algorithms,

  2. 2.

    SNVs are called directly and de novo from scRNA-seq data from hundreds or thousands of cells,

  3. 3.

    A broad set of tree metrics is used to identify aspects of tree shape that are associated with chronological age using a penalized multiple regression model,

  4. 4.

    The model is used to predict a cell tree age for individuals.

Two different types of phylogenetic algorithms, UPGMA and maximum likelihood, are shown to produce a working cell tree age model, providing extra evidence for the hypothesis. Importantly, the model is validated with public data as an independent test set. The predicted cell tree ages are also shown to correlate with some clinical blood biomarkers, for instance glucose, albumin, leukocytes, and monocytes (see Supplementary Information: Clinical Markers).

Methods

Experimental data and protocol

Biological sample collection and isolation of cells

In total, 18 blood samples, 5 ml each, have been collected by venipuncture at the Healthy Longevity Clinic (HLC) in Prague, Czech Republic. The samples have been taken with informed consent from healthy patients of the clinic. The Healthy Longevity Clinic Ethical Committee has reviewed and approved the Tree Ring Pilot observational study protocol with the reference number 20220301_001. The age range of the volunteers was 21–82 years old at the time of blood collection, 10 volunteers were males and eight were females. Samples were processed by the same protocol. In the following the data related to the first six samples are detailed. Viable peripheral blood mononuclear cells (PBMCs) were isolated from the collected biological sample. Four milliliters of peripheral blood was diluted with 4 ml of 2% fetal bovine serum (FBS) in phosphate-buffered saline (PBS). Subsequently, 8 ml of diluted peripheral blood was carefully layered on top of 4 ml of a density gradient (such as Lymphoprep™) and centrifuged at 300g for 30 min. The cells were carefully harvested from the interface with a plastic pasteur pipette. Then, another 6 ml of 2% FBS/PBS was added to the cells and centrifuged at 300g for 8 min discarding the supernatant and resuspending the cells in 1 ml of lysis solution. After 1-min incubation on ice, 4 ml of 2% FBS/PBS was added to the cells and centrifuged at 300g for 5 min discarding the supernatant and resuspending the cells in 1 ml of 2% FBS/PBS. Subsequently, the vitality and concentration of cells were determined through Acridine Orange and Propidium Iodide assay at LUNA Automated Cell Counter. Cell concentration range was between 3.72×106 and 6.35×106 b/ml, and cell viability was between 99.1 and 99.7%.

Labeling the cells with CellPlex

The cells were labelled with molecular tags or CellPlex (according to original protocol CG000391 Cell Labeling with Cell Multiplexing Oligo RevA). Later, a specific volume of each sample was transferred into new 2-ml tubes, and, after labeling, the cells were washed three times with 2% FBS/PBS (compared to two times in the original protocol). After the last wash, the cells were resuspended in 600 μl of 2% FBS/PBS and counted at LUNA. Cell concentration range was between 3.15×106 and 3.95×106 b/ml, and cell viability was between 99.3 and 99.7%, post-labeling.

The samples were pooled proportionally, and the final pool was passed through a 30-μm filter. Finally, the cells were counted and diluted to optimal concentration.

Loading and library preparation

Cells were loaded on the Chromium Controller and libraries prepared according to the original protocol CG000390 Chromium Next GEM Single Cell3 v3.1 Cell Surface Protein Cell Multiplexing RevB, aiming for 16,000 recovered cells. Some library preparation steps were modified slightly. During cDNA amplification, the polymerization step was extended to 1.5 min. After cDNA purification, the samples were split into two aliquots (A and B) that were processed in parallel and differed only in the implementation of size selection. After fragmentation, double size selection was modified for samples according to Table 1.

Table 1 Double-sided size selection using SPRIselect beads

After PCR amplification, both samples were purified using SPRIselect beads according to Table 1. Finally, the quality and quantity of libraries were determined using the Fragment Analyzer and QuantiFluor dsDNA System.

The various chemicals or kits used were Next GEM Chip G Single Cell Kit, Next GEM Single Cell 3’ Gel Beads Kit v3.1, Next GEM Single Cell 3’ GEM Kit v3.1, Dynabeads MyOne Silane, Next GEM Single Cell 3’ Library Kit v3.1, Single Index Kit T Set A, 3’ CellPlex Kit Set A, 3’ Feature Barcode Kit, and Dual Index Kit NN SetA.

Sequencing

Library pools were sequenced on an Illumina NovaSeq 6000 using the S4 300-cycle kit and with 150-bp-long R2.

Bioinformatics PROCESSING of the experimental data

The output sequencing files have been processed with Cell Ranger v6.0.2 and the indexed paired-end bam files have been converted into fastq files with bamtofastq v2.30.0. The fastq files and the identified barcode list were used for further processing.

General schematics of the Cell Tree Rings computational workflow

The Cell Tree Rings computational pipeline involves four consecutive steps, with names of the sub-pipelines highlighted in bold.

  1. 1.

    Tizkit: Barcode specific calling of SNVs with scSNV and germline filtering.

  2. 2.

    Tiznit: Generating fasta files and phylogenetic inference of cell trees.

  3. 3.

    AgeTreeShape: Compute tree features and univariate regression on age.

  4. 4.

    CellTreeAge: Building multiple penalized regression model and cell tree age prediction.

The following four sections provide further details of this workflow.

Somatic mutation calling de novo from scRNA-seq

We have used scSNV v1.0b [16] to call somatic mutations, specifically single-nucleotide variants, directly from 10x Genomics scRNA-seq data through collapsed molecular duplicates to increase mutation coverage. The GRCh38 (hg38) reference human genome build was used for mapping and alignment, specifically GENCODE Release 44, GRCh38.p14. Potential germline variants over 1% of minor allele frequency have been removed using the latest version 110 release of the 1000GENOMES vcf file using the Ensembl ftp directory. The “V3” parameter was set to process 3-prime libraries. The default setting of scSNV has been used with Maximum Variant Allele Fraction set to 0.999. To reduce the number of false-positive calls, two important consecutive filters were introduced. First, only somatic variants detected by at least eight different UMIs per barcode have been used, and second, only somatic variants that were present in at least four barcodes were considered further for phylogenetic tree generation. The inputs were fastq files and the outputs were sparse SNV count matrices for alternative and reference alleles along with annotated SNV files in csv and vcf format.

Phylogenetic tree inference

The matrices and csv files generated in the previous step are used to generate fasta alignments. Fasta files are generated from all the cells and a subset of the cells with SeqKit v2.1.0 [17]. To track within-sample variability of the trees, we generate five replicate trees from each sample with each tree being constructed from a random selection of 700 out of the ~1400 cells from that sample. Because these subsets of 700 cells are partially overlapping with each other, each tree is a pseudo-replicate. We will refer to a pseudo-replicate tree as a “partial tree.”

For phylogenetic inference with UPGMA, the R package phangorn v2.8.1 [18] was used with helper functions from the ape package v5.6.2. We used the p-distance (the proportion of sites that differ between a sequence pair) to determine branch lengths by setting the evolutionary substitution model to “raw.” The matrix of pairwise distances was computed with the dist.dna function of the ape package. Tree inference provided rooted, ultrametric trees by default. The trees were stored in newick files.

For Maximum Likelihood, IQ-TREE multicore version 2.2.0-beta COVID-edition was used. The substitution model was JC69.

Cell trees have been visualized with version 1.4.4 of the FigTree tree figure drawing tool.

Cell tree metrics

The tree metrics, or features, can be split into five groups based on their technical properties (details in Supplementary Information: Tree Metrics). Group I contains spectral tree metrics that are based on the transform matrices of the cell trees which are discrete analogues of the generalized Fourier [19] and Laplacian transforms [20], respectively. Group II contains specialized phylogenetic features focusing on aggregated branch length statistics and their derivatives, including entropy-based metrics. Group III includes well-known general phylogenetic tree statistics used in the biodiversity literature. Group IV focuses on branch length values specifically and generates summary statistics based on the distance matrix between the tips of the tree. Finally, Group V has 2 powergraph-based features generating the Laplacian transforms of the square of the tree graphs, similar to Group I.

Regression analysis

Elastic net regression

To build a predictor model, we regress chronological age on 31 tree statistics in addition to Sex, giving a total of 32 features. We allow pairwise interactions between sex and each tree statistic giving a total of 32+31=63 predictors.

In line with the majority of previous aging timers [3, 21,22,23], we employ elastic net regularization [24, 25]. This requires solving

$$\mu, \hat{\beta}=\arg \underset{\left(\mu, \beta \right)}{\min}\left\{\frac{1}{2{N}_{\textrm{tot}}}\sum\limits_{i=1}^{Ns}\sum\limits_{j=1}^{n_i}{\left({y}_{ij}-\mu -{x}_{ij}^{\intercal}\beta \right)}^2+\lambda \left[\frac{1}{2}\left(1-\alpha \right){\left\Vert \beta \right\Vert}_2^2+\alpha {\left\Vert \beta \right\Vert}_1\right]\right\},$$
(1)

which is a convex program when the hyperparameters λ (the regularization constant) and α (the lasso fraction) are fixed. Here, xij is a predictor and yij is the chronological age for pseudo-replicate j in sample i. β is the vector of regression coefficients, μ is the constant offset, Ns is the number of samples, ni is the number of replicates in sample i, and \({N}_{\textrm{tot}}={\sum}_{i=1}^{N_s}{n}_i\) is the total number of replicates across all samples.

Equation 1 is solved using the elastic-net routine from scikit-learn [26] (version 1.3.0) in Python3 [27] (version 3.11.5).

Nested cross-validation

To test predictive accuracy using this model, we implement a nested cross-validation scheme [28, 29]. We used leave-one-out cross-validation in both outer and inner loops: since there are 18 samples this means there were 18 folds in the outer loop and 17 folds in the inner loop (a fold partitions data into training and test sets with each sample assigned to a test set exactly once). All pseudo-replicates from a sample are assigned the same chronological age and are never split across training and test sets.

Hyperparameter grid search

Hyperparameters are determined by solving Eq. 1 multiple times, each time with different hyperparameter value combinations. Hyperparameter values are chosen from the sets λ ∈ {0.1,0.3,1,3,10} and α ∈ {0.6,0.7,0.8,0.9,1} in an exhaustive grid search. The hyperparameter combination giving the lowest mean absolute error, as found by cross-validation in the inner loop, is chosen as optimal. Once the optimal hyperparameters have been found for a given outer training set, Eq. 1 is solved for one step in the outer loop. The procedure is then repeated for other steps in the outer loop, calculating a new set of hyperparameters each time.

Regression coefficients and prediction accuracy

Each step in the outer loop produces a vector of regression coefficient estimates, \(\hat{\beta}\), and a subset of test sample predictions, \({\hat{y}}_{ij}\). Prediction accuracy is calculated from the full set of predicted and chronological age pairs, \(\left\{{\hat{y}}_{ij},{y}_{ij}\right\}\). Performance metrics used are the mean absolute error (mae), median absolute error (mdae), root-mean-squared error (rmse), and Pearson correlation (r) defined as follows:

$${\displaystyle \begin{array}{c}\textrm{mse}=\frac{1}{N_{\textrm{tot}}}\sum\limits_{ij}\left|{\hat{y}}_{ij}-{y}_{ij}\right|,\\ {}\textrm{mdae}=\textrm{median}\left\{\left|{\hat{y}}_{ij}-{y}_{ij}\right|\right\},\\ {}\textrm{rmse}=\sqrt{\frac{1}{N_{\textrm{tot}}}}\sum\limits_{ij}{\left({\hat{y}}_{ij}-{y}_{ij}\right)}^2,\\ {}r=\frac{\sum_{ij}\left({\hat{y}}_{ij}-\left\langle {\hat{y}}_{ij}\right\rangle \right)\left({y}_{ij}-\left\langle {y}_{ij}\right\rangle \right)}{\sqrt{\sum_{ij}{\left({\hat{y}}_{ij}-\left\langle {\hat{y}}_{ij}\right\rangle \right)}^2}\sqrt{\sum_{ij}{\left({y}_{ij}-\left\langle {y}_{ij}\right\rangle \right)}^2}},\end{array}}$$

where, for compactness, we write the double sum \({\sum}_{i=1}^{N_s}{\sum}_{j=1}^{n_i}\)as ∑ij

This nested cross-validation generates a single age prediction for each pseudo-replicate. Because the outer loop has 18 folds, each regression coefficient is estimated 18 times. Results are shown in Fig. 2A and B.

Testing on public data

A final step in testing the model is to examine its prediction on an external dataset. This involves, in essence, just one step in the outer loop of the nested cross-validation where all the 18 HLC samples are used as a single training set, and all the 18 AIDA samples are used as a single test set. Cross-validation is still performed in the inner loop to optimize the hyperparameters.

Feature pre-selection

The regularized regression procedure described above combines feature selection and coefficient estimation in a single optimization step. This can result in biased predictions since the regularization required to shrink the weaker predictors also shrinks the better predictors [25]. This bias can be mitigated by pre-selecting features in an initial step, prior to elastic net regression [30]. The basic idea is that, by removing some of the weaker predictors prior to regression, the degree of regularization needed in the coefficient estimation step is decreased, thus reducing prediction bias.

Our approach to pre-selecting features is to use the output from the elastic net itself. We take the features selected by the procedure described above and use them in a second (elastic net) regression. This two-step regression approach is similar to adaptive regularization methods where an initial regression step is used to determine feature-specific regularization parameters [31, 32]. In our scheme, the first regression (the selection step) produces a ranking of features based on the magnitude of their regression coefficients. The second regression (the estimation step) is performed with only the top k ranked features. It is the model fit from this second step that is used for prediction.

To determine the optimal value of k, we repeat the estimation step across different k, finding the value that maximizes performance. Technically, the number of features k is a hyperparameter of the model and should be determined in the inner cross-validation loop, along with α and l1. This would help account for the uncertainties in post-selection inference [33, 34] and reduce the risk of overfitting. However, for our exploratory purposes, we simply perform the estimation step at several different values of k and use the k that gives the best performance.

We find that feature pre-selection does improve predictability, albeit only slightly (see Supplementary Figure 1 in Supplementary Information: Tree Metrics). Thus, for simplicity, our default model uses all features, without pre-selection. Nevertheless, demonstrating that a subset of the features provides as good, if not slightly better, predictability as all the features is helpful for reducing the number of regression coefficients that need to be interpreted (see Discussion).

Processing public datasets

The Asian Immune Diversity Atlas (AIDA) public Human Cell Atlas (HCA) scRNA-seq dataset has been used to process candidate samples for model validation (https://data.humancellatlas.org/explore/projects/f0f89c14-7460-4bab-9d42-22228a91f185). Fastq files have been downloaded from the HCA site. Filtered barcode lists have been directly downloaded from the CellRanger output available at the Chan Zuckerberg CELLxGENE Collections at the following URL: https://cellxgene.cziscience.com/collections/ced320a1-29f3-47c1-a735-513c7084d508. For scSNV, the “V2_5P” parameter was set to process the 10X V2 5-prime libraries.

Results

Cell trees

Phylogenetic trees were inferred using the distance matrix algorithm UPGMA and maximum likelihood. To characterize the variability in trees sampled from a single individual in the HLC dataset, we construct trees from five subsets of 700 randomly sampled cells rather than a single tree from all ~1400 cells per individual. Because the sets of 700 cells are partially overlapping, we refer to these “partial” trees as pseudo-replicates. Each partial tree is generated using the same filters as described in “Somatic mutation calling de novo from scRNA-seq” in the “Methods” section. The regression model fit to the HLC dataset is tested on samples from the AIDA dataset using five pseudo-replicate trees, each generated from 700 cells per individual.

For visualization, Fig. 1 shows the complete trees utilizing information from all ~1400 barcoded cells from each individual in the HLC cohort. The circular rendering, which places the root at the center and the cells around the perimeter, provided inspiration for the name “Cell Tree Rings.”

Fig. 1
figure 1

Complete cell trees from each of the 18 HLC participants in the study

Cell tree metrics

The central hypothesis of the study is that the shape of cell trees is a measure of biological age. Here shape refers to the combination of topology (branching order) and branch lengths. Topology, in the case of cell lineage trees, corresponds to branching patterns of mitotic division in somatic cells. Branch lengths represent the amount of evolutionary change and are usually defined as the product of mutation rates and a suitable unit of time.

Different tree metrics capture either topology only or branch-length only, or a combination of both. We have applied a set of 31 tree metrics to characterize the shape of cell trees built from somatic mutations from human peripheral blood mononuclear cells (see Supplementary Information: Tree Metrics). This set of measures comprises both traditional tree metrics used in phylogenetics and some that were developed specifically for this study.

Model building and prediction results

An age predictor has been constructed by regressing chronological age on the 31 tree metrics, along with Sex. Elastic Net regularization and nested leave-one-out cross-validation were used to validate and test the model (see “Regression analysis” in the “Methods” section). The resulting prediction errors, correlation coefficient, p-values, and explained variance were estimated by comparing cell tree ages to chronological ages.

Figure 2 shows the performance of the default cell tree age model cross-validated on the 18 samples from the HLC data. In this default model, UPGMA is used for phylogenetic tree inference, five pseudo-replicate trees are used per sample, and all 32 features are used in the regression. For this model, r=0.813, p=0.00004, R2=0.660, MAE=7.625, MdAE=4.396, and RMSE=10.760.

Fig. 2
figure 2

A Comparison of predicted versus chronological age for 18 human HLC blood samples. Dots represent predictions for each of the 5 pseudo-replicates per sample while crosses represent the mean prediction for each sample. The dotted red line is the reference for perfect prediction (y=x). B Performance metrics, MAE (mean absolute error), MdAE (median absolute error), RMSE (root mean square error), r (Pearson’s correlation coefficient)

We also modeled the HLC data with a dummy regressor instead of regressing on tree metrics using the elastic net. The dummy error terms were the following: mae=16.471, mdae=15.882, rmse=19.515. The improvement when including the tree metrics as covariates is apparent.

The Supplementary information shows how feature pre-selection can slightly improve performance and how using Maximum Likelihood instead of UPGMA for tree inference affects results.

The slope of predicted to chronological age is less than 1, indicating that low ages are systematically overestimated while high ages are systematically underestimated. This is, in large part, due to regularization which biases the regression coefficients towards 0 and thus biases predictions towards the mean of the data. Figure 3A illustrates this trend by showing how the age difference (predicted minus chronological age) is positive below and negative above the mean age of ~47 years. This bias in age prediction can be characterized by a linear trend line (green dashed). It has become customary to refer to the difference between the predicted age and this linear trend line as the “age acceleration” [22], shown in Fig. 3B.

Fig. 3
figure 3

Age difference (A) and age acceleration (B) from the default cell tree age prediction model in Fig. 2. The linear trend of predicted to chronological age is given by the green linear. Age difference is predicted age minus chronological age. Age acceleration is the difference between predicted age and the linear fit of predicted to chronological age

Public validation of cell tree age model

Having developed the cell tree age model on HLC data, we then tested it on a public scRNA-seq dataset. For this, we used 18 peripheral blood samples from the Asian Immune Diversity Atlas (AIDA). This involved 10 females and eight males ranging in age between 21 and 65 years old. For these data, as with the HLC data, we used UPGMA to generate five pseudo-replicate trees from each sample where each pseudo-replicate used 700 cells randomly selected from the sample. The default model was trained on all 18 of the HLC data (see “Testing on public data” in the “Regression analysis” section). The resulting performance metrics were r=0.853, p = 0.00001, R2=0.728, MAE=12.791, MdAE=13.636, RMSE=14.081.

Cell tree age associations with clinical biomarkers

As part of the approved study protocol, a wide range of clinical blood biomarker work has been shared. These were obtained during the same period as the blood draws for the scRNA-seq. For the following list of clinical blood markers, we had access to values of 13 samples out of the 18 HLC samples, nine males and four females: complete leukocyte and erythrocyte blood panel (13 markers), albumin, glucose, Hba1c, total cholesterol, alkaline phosphatase, uric acid, creatinine, blood urea nitrogen. This provided a total of 21 markers.

Table 2 shows statistically significant associations between predicted cell tree ages and chronological ages and several of the blood markers. The most accurate model was used for the predicted cell tree ages, involving six pre-selected features (see Supplementary Figure 1 in Supplementary Information Tree Metrics).

Table 2 Statistically significant associations between cell tree age and various clinical markers

Since reference ranges can differ between females and males for particular blood markers and since we had nine male samples available compared with four female samples, male-specific results are provided as well in order to examine their associations with predicted cell tree age and chronological age.

Out of the 21 blood markers, eight markers were associated significantly with cell tree age or chronological age or both. Six out of these eight significant associations showed cell tree ages to be stronger predictors than chronological age with the metabolic marker blood glucose showing the biggest difference.

Discussion

We have shown that cell trees constructed using SNVs from human peripheral blood mononuclear cells can predict chronological age. The SNVs underlying these trees can be directly called from the most accessible single-cell sequencing approach, scRNA-seq. Importantly, the trained cell tree age model was validated on an independent test set and the predicted cell tree age was found to be significantly associated with several blood markers (see Supplementary Information: Clinical Markers).

The resulting new molecular aging timer, Cell Tree Rings, requires dozens of cell tree metrics as inputs. Performance of this default model can be improved by ranking the features and regressing on a subset of them (see Feature Pre-Selection in the “Regression analysis” section). Figure 1 of Supplementary Information Tree Metrics shows that pre-selection of between five and 12 features improves predictive accuracy. At peak accuracy, using six regressors (Sex, “tipDistNorm,” “mMax_eigengap,” “mMaxEigen,” “AC_2,” “tipRootPatr”) results in predictions with a median absolute error of ~3.5 years, a correlation coefficient of 0.9, and ~82% explained variance. Note that accuracy rapidly declines with fewer predictors such that univariate prediction, even with the single best regressor, has poor accuracy.

Feature selection is useful for identifying the metrics important for prediction. “mMax_eigengap” has been used as a heuristic to identify different modes of divisions or modalities within the tree [20]. “AC_2” is the algebraic connectivity of a graph, a well-known feature of graph robustness. Overall, these tree metrics may represent a conceptually new and experimentally verifiable class of quantitative predictors of age.

The Maximum Likelihood method for phylogenetic inference serves as important additional phylogenetic evidence to justify the cell tree age model approach (see Supplementary Figure 3 Supplementary Information: Tree Metrics). On the HLC data, the best performing model contained six predictors with performance metrics r=0.797, p = 0.00007, mae=8.260, and mdae=5.654 (see Supplementary Figure 4 Supplementary Information: Tree Metrics).

Most importantly, the cell tree age model, trained on the HLC data, when tested on samples from the public AIDA dataset gave an r of 0.85 indicating that the model does generalize to existing public data. To further the generalizability, the HLC data came from a Central European cohort, and the AIDA data were from an East Asian cohort. However, despite this strong correlation between predicted and chronological ages, the clock appears to systematically overestimate age in the AIDA data (see Fig. 4). Although feature pre-selection reduces this bias (see Supplementary Figure 2 in Supplementary Information: Tree Metrics), there is still a clear overestimate of age.

Fig. 4
figure 4

A Comparison of predicted vs chronological ages for 18 independent human blood samples from the public AIDA dataset. The default model, without feature pre-selection, was trained on 18 samples from the HLC dataset. As in Fig. 2, dots represent predictions for each of the five pseudo-replicates per sample while crosses represent the mean prediction for each sample. The dotted red line is the reference for perfect prediction (y=x). B Performance metrics, MAE (mean absolute error), MdAE (median absolute error), RMSE (root mean square error), r (Pearson’s correlation coefficient)

There are differences between the HLC training set and the AIDA test set that might account for this batch effect. While the HLC data used 10x V3 3’ prime-end libraries, the AIDA data used 10X V2 5’ prime-end libraries. This means that the former method detects variants in the 3’ end region of the genes, including UTR regions; the latter detects variants from the 5’ prime end region, including regulatory elements. Furthermore, the AIDA dataset is primarily restricted to the 20–50-year-old age range, with just a single sample over 60. Further investigations are needed to understand this batch effect.

We believe Cell Tree Rings to be the only existing natural barcoding approach that can build larger trees from thousands of cells using somatic mutations called de novo from scRNA-seq data alone as well as from all autosomes, sex chromosomes, and the mitochondrial genome combined. Representative cell trees can capture multiple clonal events in different parts of the accessible cellular genomes, reaching a higher resolution in cell population history than possible with targeted genomics approaches alone.

Another advantage of Cell Tree Rings is its potential to extract both lineage histories and gene expression levels from the same cells using a single method and lab protocol. Combining lineage and phenotypic expression information from the same sample is a considerable challenge and existing approaches offer complex solutions combining different protocols [35].

We have also examined the association of cell tree age with 21 clinical blood markers (see “Cell tree age associations with clinical biomarkers”). Of the eight markers that show significant associations with cell tree or chronological age, six of them are better correlated with cell tree age than chronological age. These results provide a preliminary indication of how cell tree age could provide valuable clinical indicators. Ultimately, clocks measuring biological age must be better predictors of clinical markers than chronological age. Not all of these associations were related to explicit immune functions, for instance the glucose and albumin associations suggest that if these results are confirmed on bigger cohorts, then Cell Tree Rings might provide a measure of broader multi-tissue and multi-organ age modalities.

Methodological comparison between single-cell DNA and RNA sequencing of somatic mutations

Although single-cell whole-genome DNA sequencing would appear to be the most obvious and comprehensive solution for calling somatic SNVs in single cells, there are two main reasons we chose to pursue a scRNA-seq protocol. The first is translational in that we seek to develop a simpler, more affordable platform for clinical application that can be scaled up to meet increasing demand. The second reason is scientific.

In terms of translational arguments, there are four different but overlapping aspects of choosing scRNA-seq over single-cell whole-genome DNA sequencing to generate cell lineage trees: affordability, costs, scale, and complexity of protocols.

  1. 1.

    Affordability: Few laboratories have the resources to perform single-cell, whole-genome DNA sequencing combined with error-corrected sequencing. Single-cell RNA sequencing, in contrast, is a commercially available technique now used in hundreds of facilities worldwide.

  2. 2.

    Costs: Single-cell, whole-genome DNA sequencing costs are an order of magnitude greater than commercially available single-cell RNA sequencing.

  3. 3.

    Scale: Because of these high costs, current cell lineage tree analysis using whole-genome scDNA-seq protocols is likely difficult to scale beyond a few hundred cells. In the Mitchell et al. study [36], an average number of 358 cells were studied over 10 individuals. In our first implementation of the Cell Tree Rings method, we have already studied an average of 1358 cells over 18 samples. With the external validation set, we have added 18 public samples to the original 18 HLC samples, giving a total of 36 samples. This demonstrates the ability, potentially, to extract hundreds or thousands of new cell trees from existing public scRNA-seq samples with Cell Tree Rings methodology and add it to the model. Also it is straightforward to scale this to tens of thousands of cells or more.

  4. 4.

    Protocol: Our technique can be implemented in essentially any 10x Chromium sequencing facility and the public validation of our data shows that it can be retrospectively applied to datasets already generated worldwide. In contrast, the [36] whole-genome scDNA-seq protocol requires growing colonies from the sampled individual cells in vitro and extensive QC steps to control for the artificially introduced genome changes.

In terms of scientific arguments, there are three aspects to consider currently:

  1. 1.

    Genome Coverage: One obvious advantage of single-cell whole-genome DNA sequencing is the ability to track somatic mutations across the whole genome, while sc-RNA-seq is restricted to protein coding and nearby regulatory regions. However, the current study proved that cell lineage trees generated from scRNA-seq data can reliably deliver a strong age signal.

  2. 2.

    Transcriptomics information: Because we are calling variants from RNA transcripts, our method has the potential to integrate both genomics and transcriptomics modalities. We anticipate that an scRNA-seq aging timer which has both gene expression and variant information will result in a more versatile and insightful aging timer than an scDNA-seq timer which has variant information alone. However, our current study, the first implementation of Cell Tree Rings, has focused on using mutation information alone.

  3. 3.

    Cell type access: Our scRNA-seq–based method captures blood cells from a wide variety of myeloid and lymphoid lineages. These can be at different maturity states, from stem cells to fully differentiated white blood cells. In contrast, scDNA-seq techniques are more restricted to certain cell types. For example, if growing colonies in vitro is required, as in the Mitchell et al. study [36], it may only be possible to study hematopoietic stem and progenitor cells, but not more mature phenotypes.

Potential directions for improvement in CTR

The version of CTR reported here represents a basic proof-of-principle. Here we discuss some of the improvements envisioned. The current version of Cell Tree Rings has been achieved with direct and de novo scRNA-seq alone without the aid of any bulk sequencing approach. At a technical level, bulk exome sequencing data can improve true mutation calls at the single cell level and filter out more noise.

In the future, some of this residual variation in predicated age could be explained by individual medical histories or phenotypes, when they become available. In addition, as mentioned above, gene expression information can be extracted from the same cellular barcodes and from the very same genes whose SNVs have been used to generate the cell tree. Efforts are currently underway to see how much of the currently unexplained variation can be accounted for by the cellular phenotypes and the individual medical histories.

Translational geroscience seeks to identify which elements of the aging process are irreversible under current available treatments and which are amenable to modification by existing therapeutic interventions. The tree features involved in Cell Tree Rings may be valuable in diagnosing the influence of a particular intervention by examining its effect on tree shape.

When extending Cell Tree Rings to other tissues, an important question is how spatial aspects of the tree can be incorporated. Cellular elements in complex biofluids, such as blood and saliva, have considerable freedom of movement throughout the human body. In contrast, resident cells of compact, solid tissues in the kidney, intestine, or liver for instance are under considerable spatial restrictions. The infiltrating immune cells of compact tissues are less constricted spatially than the resident cells, but more constricted than circulating blood cells. It is an open question how tree shape metrics contributing to the age regression model will change in a more restricted spatial environment. Spatial restrictions have been shown to be important in cancer where evolutionary phylodynamic models have been applied to model boundary-driven solid tumor growth [37]. Combining spatial information with the temporal information of cell trees could thus help improve the ability of cell trees to quantify biological age.

Questions that Cell Tree Rings can help answer

One surprising finding in the phylogenetic tree-aided developmental biology literature is the degree of asymmetry in phylogenetic lineage trees [38, 39]. These studies showed how, at least on the few samples studied, there can be a substantial difference in the number of surviving progeny between offspring of the first or first few cell divisions, the asymmetry reaching sometimes as large as 10:90%. Cell Tree Rings can help quantify this developmental imbalance further and separate it from changes in later life.

Somatic mutations in a small set of (growth and cancer-associated) genes have been shown to propagate clones that become dominant in the hematopoietic system of older individuals [36] and have been directly linked to increased risk of cancer and other chronic diseases [40]. Additionally, somatic mosaicism has been shown, in multiple tissues, to rise with age and to predict disease in animal models [6].

From the perspective of designing interventions, it is important to understand which of the somatic mutations are simply passive indicators and which are active drivers of the aging process. In addition, it will be important to establish which can be targeted with clinical interventions.

By providing a way to quantify the different aspects of tree shape, Cell Tree Rings can be used to identify early indicators of clonal hematopoiesis and diagnose why certain individuals display resilience to the effects of somatic mutation and experience reduced chronic age-associated disease.

Cell Tree Rings and the timescales and convergence of different clocks

When evaluating the effectiveness of different biological aging clocks, it is important to address the question of what the minimal meaningful temporal unit of biological aging is. While clocks with high temporal resolution can evaluate the short-term effects of interventions, these effects can often be difficult to distinguish from physiological noise. On the other hand, lower temporal resolution over longer time windows may miss important short-term signals [41]. Cell Tree Rings captures the long-term dynamics of somatic evolution that relates to the decades-long processes usually associated with aging but is insensitive to processes that are shorter than the characteristic timescale for detecting somatic mutations. It is an open question whether clocks based on different mechanisms and with different time resolutions can be combined and merged, but ultimately the results from different clocks should be reconciled.

Our hope is that Cell Tree Rings may provide a baseline integrative framework for different aging hallmarks and clocks. There are three reasons this may be possible:

First, Cell Tree Rings operate on the fundamental genome-instability level by tracking somatic mutations in hundreds or thousands of single cells. Importantly, it is the use of the tree structure to constrain these mutations that helps to improve their detection accuracy.

Second, Cell Tree Rings is based on a foundational construct, the somatic evolutionary cell tree, that relates the tens of trillions of somatic cells of a human body to each other and to time.

Third, without identifying the damage somatic mutations cause, it is difficult to design healthy longevity therapies and regimens. Cell Tree Rings captures and organizes this basic mutation information at different levels of the tree hierarchy, potentially providing signposts for which interventions are likely to be most effective.

Cell Tree Rings is thus not simply another aging timer. It aims to provide a foundational principle for clocks. There is considerable uncertainty about whether epigenetic aging clocks can inform us about biological age reversals in clinical trials [42]. Adding Cell Tree Rings as a single-cell resolution clock component might mitigate this uncertainty and improve the assessment of geroprotective trials.