当前位置: X-MOL 学术Cytom. Part A › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Validation of crowd-sourced plant genome size measurements
Cytometry Part A ( IF 3.7 ) Pub Date : 2021-08-07 , DOI: 10.1002/cyto.a.24493
David W Galbraith 1
Affiliation  

In this increasingly data-rich era of scientific enquiry, attention is turning to the value of archiving information in the form of publicly-available, searchable databases, to permit future analyses without the prior need for acquisition of additional data. The establishment of repositories is facilitated by the continuing decrease in computational overhead enshrined in Moore's law [1], and therefore offers considerable and increasing value in situations where sample collection is rate-limited, due to costs, legal restrictions, unavailability of collection expertise, or other factors. Such a situation exists in flow cytometric analyses of plant genome sizes. Including those species yet to be discovered, world-wide we estimate approximately 330,000–400,000 different species of flowering plants [2], a very small percentage of which have been characterized with any degree of sophistication, whether at the ecological, agronomic, morphological, physiological, molecular, or cellular level. Given the impact of the Anthropocene, increasing rates of species extinction offers the real risk that we will lose entire plant species before their currently unsuspected importance is discovered [3]. Archiving specimens and the data already derived from them provides one small way in which we can address this problem, through providing a molecular-taxonomic inventory of all extant plant species. However, how representative and reliable are the stored measurements?

A significant problem with archiving data for future use relates to unforeseen and unrecorded variables. As cytometrists, our interest naturally focuses on measurement variables, and on the performance of the associated instrumentation and methods of measurement. Our example here is the measurement of plant genome sizes, defined operationally as the 2C nuclear DNA content, in pg, of somatic cells. This information has become readily accessible using flow cytometry [4] and, with global spread of this particular application, the number of publications reporting plant genome sizes has rapidly increased. Recognizing this, archives of genome size measurements emerged, one of the most comprehensive of which is the RBG Kew Plant C-value database [5]. This database contains a compilation of genome size values harvested from the primary scientific literature, representing the contributions of very many different laboratories located around the world. These values are sorted in terms of measurement methods (most involve FCM), presented in the form of ranges and global average values, and manually curated in terms of “prime values” for each species, as well as other estimates also reported for the species if available. Prime values were originally described as those which represent “the most consistent value obtained under best-practice methods (as originally defined by Bennett & Smith, 1976)” [6]. Since this reference predates the use of flow cytometry, the criteria for defining prime values have been adapted and refined empirically (Leitch IJ: Pers. Commun.). Thus, the following questions are asked: (a) Did the study follow best practices—that is, what DNA-specific staining procedure was employed, were there a suitable number of replicates, good CVs, and internal calibration, and was a suitable reference standard used (i.e., a species having a genome less than three times the size of the species of interest)? (b) Were chromosome counts made on the same materials used to estimate the genome size? (c) What is the reputation of the laboratories generating the data—and if so, how reliable are they considered to be? (d) Does a herbarium voucher exist of the analyzed material?

In general, estimates made using flow cytometry have been considered to be “more reliable”, and hence were, and continue to be, selected in preference to those estimated using Feulgen microdensitometry. Nevertheless, over time it became increasingly hard to distinguish between the estimates due to the comparably high quality of data coming from an increasing number of groups. Thus, for stability of the lists, if a species had previously been assigned to be a prime estimate, then this was retained unless there was a compelling reason to change it. Currently, the Kew Plant C-value database (release 7.1; https://cvalues.science.kew.org/) contains data for 12,273 species, comprising 10,770 angiosperms, 421 gymnosperms, 303 pteridophytes (246 ferns and fern allies and 57 lycophytes), 334 bryophytes, and 445 algae.

The availability of this data now makes it possible to directly test the relationship between the crowd-sourced prime genome size values, and measurements made in a single laboratory under controlled conditions. This effectively examines whether the curated, crowd-sourced data (genome size values within the C-value database measured in many different laboratories, using different instruments and experimental conditions, and calibration standards with different assumed genome sizes for converting relative measurements into absolute amounts) is a useful permanent record. To test this, we employed the Beckman Coulter Cytoflex for the analysis of homogenates of four species, staining with propidium iodide/ribonuclease [7] (for methods specific to the CytoFlex, see https://www.beckman.com/resources/reading-material/application-notes/plant-genome-size-flow-cytometry-analysis also provided as Appendix S2). The Kew Plant C-value prime values for the four species span a range of 0.32 pg DNA (the 2C value for A. thaliana leaf cells) to 101.12 pg (the 32C value for endoreduplicated pericarp cells of Capsicum annuum); about 95% of the plant nuclear DNA content 2C-values in the Kew database fall between 0.32 and 101.12 pg. Figure 1 illustrates the experimental strategy with arabidopsis: a parametric plot of side scatter versus PI (area) fluorescence (Figure 1A) reveals five clusters of nuclei, equally spaced across the PI-dimension (log scale). Figure 1B illustrates the time-dependency of PI fluorescence, gating on Region P1 of Figure 1A. Further gating (Region P2, Figure 1B) to include only those nuclei whose fluorescence remains constant over time provides the uniparametric histogram of the individual classes of endoreduplicated nuclei (Figure 1C). Table 1 lists summary values for the fluorescence values representing the positions of these peaks, the associated CVs, and the corresponding prime DNA content values taken from the Kew Plant C-value database.

Details are in the caption following the image
FIGURE 1
Open in figure viewerPowerPoint
Experimental pipeline illustrating the gating strategy for Arabidopsis thaliana. (A) Biparametric analysis of side-scatter (SSC) versus propidium iodide fluorescence (PI-A). The nuclei are gated in region P1. (B) Time dependency of nuclear PI fluorescence, gated on region P1. (C) Uniparametric analysis of time-invariant nuclear PI fluorescence, gating on region P7
TABLE 1. Comparison of measured DNA content values (the peak position of the propidium iodide/RNAase signals measured within homogenates of the four species, and the prime values for DNA content recovered from the Kew plant C-value database). The prime value entries were extracted for the 2C nuclei, and the DNA contents for the higher C-value levels for each species were calculated by simple multiplication. Homogenates were prepared from young leaves, with the exception of pepper, for which fruit pericarp tissues were sampled. The Arabidopsis accession was Col-0, as for the Kew database. The tomato accession employed for homogenization is an unknown popular commercial variety, and the Kew database value is for cv. Alicante. The pepper accession employed for homogenization is an unknown popular commercial variety, and the Kew database value is for var. annuum
Species PI (A) × 10−3 Nuclear DNA content (pg) CV (%)
2C arabidopsis Col-0 36.47 0.32aa 2C nuclear DNA content prime values from the Kew C-value database.
1.77
4C arabidopsis 73.23 0.64 1.68
8C arabidopsis 146.71 1.28 1.39
16C arabidopsis 293.13 2.56 1.32
2C tomato 227.60 2.00aa 2C nuclear DNA content prime values from the Kew C-value database.
1.28
4C tomato 454.80 4.00 1.32
2C maize 609.83 5.50aa 2C nuclear DNA content prime values from the Kew C-value database.
1.31
4C maize 1194.59 11.00 1.27
2C pepper 839.83 6.32aa 2C nuclear DNA content prime values from the Kew C-value database.
1.83
4C pepper 1162.40 12.64 1.64
8C pepper 3298.16 25.28 0.83
16C pepper 6490.37 50.56 1.81
32C pepper 12746.90 101.12 1.19
  • Note: Mean CV (±SD) = 1.43% (0.29).
  • a 2C nuclear DNA content prime values from the Kew C-value database.

From these experiments, we make a number of observations: First, the CV values for these peaks are remarkably low and consistent, in all cases <2%, even for the largest nuclear genome (1.19%; pepper, 101.12 pg, 32C). Second, endoreduplication in arabidopsis, spanning a low range of DNA contents (0.32–2.56 pg) and in pepper, spanning a higher range (6.32–101.12 pg), results in an almost perfect linear correspondence between peak positions and DNA content (r2 values of 1.0 and 0.9999 respectively) (Figure 2). Finally, the combined regression analysis between all species, peak positions, and DNA content values from the Kew database also provides an almost perfect correspondence (r2 = 0.9997). The line of best fit intersects the origin at (0,0) which indicates an absence of systematic error in the measurements, and no unusual species-specific deviations are noted.

Details are in the caption following the image
FIGURE 2
Open in figure viewerPowerPoint
Linear regression analysis of the nuclear DNA content data. (A) All values combined from separate runs. (B) Arabidopsis alone. (C) Tomato and maize data combined from two runs. (D) Pepper alone

A number of conclusions can be drawn: (1) In these species, endoreduplication results in the precise and complete duplication of the nuclear genome, through at least three (arabidopsis) or four (pepper) endocycles. It should be recognized that, in a very small minority of plant groups, such as orchids, partial endoreduplication is seen; further details of this interesting phenomenon can be found in Trávníček et al. [8]. Although rarely encountered, care should be taken to accommodate its potential occurrence. (2) Nuclei occupying different size classes due to endocycling are quantitively measured by the cytometer, meaning that geometric interactions between the size of the nucleus and the height of the illumination spot are not a factor affecting measurement efficiency. We can predict nuclear sizes, assuming a spherical shape (which will most likely be the case after the nuclei are isolated) based on the reported volume of Arabidopsis nuclei (presumptively 2C) being 32 cu. μm [9]; from this, r = 2.88 μm. Assuming nuclear size scales linearly with DNA content, the largest endoreduplicated nuclei will occupy a volume of 32 x 101.12/0.32 = 10,112 cu. μm, predicting a sphere of radius 13.4 μm. The Cytoflex illumination beam (spot size: 5 μm × 80 μm) therefore can be considered to act in slit-scanning mode for these larger nuclei, emphasizing the importance of acquiring area and not height measurements from the pulse-waveforms corresponding to the individual nuclei. It would be interesting to independently confirm the predicted sizes of endoreduplicated nuclei using fluorescence microscopy or, better perhaps, image cytometry. (4) Since the scaling of DNA content with endocycle status is highly compelling, this further implies that differences in chromatin packing, otherwise shown to influence flow cytometric DNA measurements using PI [10] are insignificant across the endocycles and the species measured here. Parenthetically, this confirms earlier reports of linearity of duplication of plant genomes accompanying endocycles or autopolyploidization [7, 11, 12]. (5) Remarkably, the precise degree of scaling between the measured DNA contents of a single run, using the Cytoflex instrument and identical experimental conditions, and those values identified as “prime” in the Kew database (Table 1), implies the process of crowd-sourcing plant genome size measurements, across many laboratories, has converged on a meaningful relationship. This enhances the value of the content of the Kew Plant C-value database. (6) That the prime values in the database are not assigned to varieties, lines, or cultivars argues that the 2C nuclear DNA contents of the species selected for analysis here must be the same as, or very close to, these prime values.

Taken together, these observations lend confidence to the concept of data repositories, but only in the context of plant nuclear DNA content measurement using flow cytometry. The caveat exists that, since in all situations and for all databases, similar strategies for evaluating the “quality” of the stored data must be devised to satisfy “Best Practices”, this may not turn out to be possible in all cases. An obvious issue is the uncovering ex post facto of critical variables that had not been recorded in the databases and that now are no longer available. For further discussions of Best Practices in Plant Cytometry, please visit Galbraith et al. [13], and references cited therein.



中文翻译:

验证众包植物基因组大小测量

在这个数据日益丰富的科学探究时代,人们的注意力正在转向以公开可用、可搜索的数据库形式存档信息的价值,以便在无需事先获取额外数据的情况下进行未来分析。摩尔定律 [ 1 ] 所规定的计算开销的持续减少促进了存储库的建立,因此在由于成本、法律限制、收集专业知识的不可用而导致样本收集速率受限的情况下,提供了可观且不断增加的价值,或其他因素。这种情况存在于植物基因组大小的流式细胞仪分析中。包括那些尚未被发现的物种,我们估计全世界大约有 330,000–400,000 种不同的开花植物 [ 2],其中极少数已具有任何复杂程度的特征,无论是在生态学、农艺学、形态学、生理学、分子还是细胞水平。鉴于人类世的影响,物种灭绝率的增加带来了真正的风险,即在发现它们目前未曾预料到的重要性之前,我们将失去整个植物物种[ 3 ]。通过提供所有现存植物物种的分子分类学清单,归档标本和已经从中获得的数据提供了一种解决这个问题的小方法。但是,存储的测量值的代表性和可靠性如何?

存档数据以供将来使用的一个重要问题与不可预见和未记录的变量有关。作为流式细胞仪,我们的兴趣自然集中在测量变量以及相关仪器的性能和测量方法上。我们这里的例子是植物基因组大小的测量,在操作上定义为体细胞的 2C 核 DNA 含量,以 pg 为单位。使用流式细胞仪 [ 4 ] 可以轻松获取这些信息,并且随着这种特定应用程序的全球传播,报告植物基因组大小的出版物数量迅速增加。认识到这一点,出现了基因组大小测量档案,其中最全面的档案之一是 RBG Kew Plant C 值数据库 [ 5]。该数据库包含从主要科学文献中收集的基因组大小值的汇编,代表了世界各地许多不同实验室的贡献。这些值根据测量方法(大多数涉及 FCM)进行排序,以范围和全球平均值的形式呈现,并根据每个物种的“主要值”手动策划,以及该物种的其他估计值如果可供使用的话。质数最初被描述为代表“在最佳实践方法下获得的最一致的值(最初由 Bennett & Smith, 1976 定义)”[ 6]。由于该参考文献早于流式细胞术的使用,因此定义素值的标准已根据经验进行调整和完善(Leitch IJ:Pers. Commun.)。因此,提出了以下问题:(a) 研究是否遵循最佳实践——即采用了何种 DNA 特异性染色程序,是否有合适的重复次数、良好的 CV 和内部校准,并且是合适的参考使用的标准(即基因组小于目标物种大小三倍的物种)?(b) 染色体计数是否在用于估计基因组大小的相同材料上进行?(c) 生成数据的实验室的声誉如何?如果是,它们的可靠性如何?(d) 分析材料是否存在植物标本馆凭证?

一般来说,使用流式细胞术进行的估计被认为是“更可靠的”,因此,与使用 Feulgen 显微光密度法估计的估计相比,过去并且继续被选择。然而,随着时间的推移,由于来自越来越多的组的数据质量相当高,区分这些估计变得越来越困难。因此,为了列表的稳定性,如果一个物种以前被指定为主要估计,那么除非有令人信服的理由改变它,否则它会被保留。目前,邱植物 C 值数据库(版本 7.1;https://cvalues.science.kew.org/)包含 12,273 个物种的数据,包括 10,770 种被子植物、421 种裸子植物、303 种蕨类植物(246 种蕨类植物和蕨类植物以及 57 种石松植物) )、334 种苔藓植物和 445 种藻类。

现在,这些数据的可用性使得可以直接测试众包的主要基因组大小值与在受控条件下在单个实验室进行的测量之间的关系。这有效地检查了策划的众包数据(在许多不同实验室测量的 C 值数据库中的基因组大小值,使用不同的仪器和实验条件,以及具有不同假设基因组大小的校准标准,用于将相对测量值转换为绝对量)是有用的永久记录。为了测试这一点,我们使用 Beckman Coulter Cytoflex 分析四种物种的匀浆,用碘化丙啶/核糖核酸酶染色 [ 7](有关 CytoFlex 特有的方法,请参阅 https://www.beckman.com/resources/reading-material/application-notes/plant-genome-size-flow-cytometry-analysis 也作为附录 S2 提供)。四个物种的 Kew Plant C 值素值范围为 0.32 pg DNA(拟南芥叶细胞的 2C 值)到 101.12 pg(辣椒内切重复果皮细胞的 32C 值)); Kew 数据库中约 95% 的植物核 DNA 含量 2C 值介于 0.32 和 101.12 pg 之间。图 1 说明了拟南芥的实验策略:侧向散射与 PI(面积)荧光(图 1A)的参数图显示了五个核簇,在 PI 维度(对数标度)上等距分布。图 1B 说明了 PI 荧光的时间依赖性,在图 1A 的区域 P1 上进行门控。进一步的门控(P2 区,图 1B)仅包括那些荧光随时间保持恒定的核,提供了单个核内复制核类别的单参数直方图(图 1C)。表 1 列出了代表这些峰位置的荧光值的汇总值、相关的 CV 以及取自 Kew Plant C 值数据库的相应主要 DNA 含量值。

详细信息在图片后面的标题中
图1
在图形查看器中打开微软幻灯片软件
说明拟南芥门控策略的实验管道。(A) 侧向散射 (SSC) 与碘化丙啶荧光 (PI-A) 的双参数分析。细胞核在区域 P1 中进行门控。(B) 核 PI 荧光的时间依赖性,在区域 P1 上进行门控。(C) 时不变核 PI 荧光的单参数分析,在 P7 区域选通
表格1。测量的 DNA 含量值的比较(在四个物种的匀浆中测量的碘化丙啶/RNAase 信号的峰值位置,以及从 Kew 植物 C 值数据库中恢复的 DNA 含量的主要值)。提取 2C 核的素值条目,并通过简单的乘法计算每个物种的较高 C 值水平的 DNA 含量。除辣椒外,均从嫩叶制备匀浆,对其果实果皮组织进行取样。至于 Kew 数据库,拟南芥的加入是 Col-0。用于均质化的番茄种质是一种未知的流行商业品种,Kew 数据库值为 cv。阿利坎特。用于均质化的辣椒种质是一种未知的流行商业品种,Kew 数据库值为 var。
物种 PI (A) × 10 -3 核 DNA 含量 (pg) 简历 (%)
2C 拟南芥 Col-0 36.47 0.32一个 来自 Kew C 值数据库的2C 核 DNA 含量素值。
1.77
4C拟南芥 73.23 0.64 1.68
8C拟南芥 146.71 1.28 1.39
16C拟南芥 293.13 2.56 1.32
2C番茄 227.60 2.00一个 来自 Kew C 值数据库的2C 核 DNA 含量素值。
1.28
4C番茄 454.80 4.00 1.32
2C玉米 609.83 5.50一个 来自 Kew C 值数据库的2C 核 DNA 含量素值。
1.31
4C玉米 1194.59 11.00 1.27
2C辣椒 839.83 6.32 来自 Kew C 值数据库的2C 核 DNA 含量素值。
1.83
4C辣椒 1162.40 12.64 1.64
8C辣椒 3298.16 25.28 0.83
16C辣椒 6490.37 50.56 1.81
32C辣椒 12746.90 101.12 1.19
  • :平均 CV (± SD ) = 1.43% (0.29)。
  • 来自 Kew C 值数据库的2C 核 DNA 含量素值。

从这些实验中,我们进行了一些观察:首先,这些峰的 CV 值非常低且一致,在所有情况下均 <2%,即使对于最大的核基因组(1.19%;胡椒,101.12 pg,32C)也是如此。其次,拟南芥中的核内复制,跨越低范围的 DNA 含量(0.32-2.56 pg)和辣椒,跨越更高的范围(6.32-101.12 pg),导致峰值位置和 DNA 含量之间几乎完美的线性对应(r 2值分别为 1.0 和 0.9999)(图 2)。最后,来自 Kew 数据库的所有物种、峰位置和 DNA 含量值之间的组合回归分析也提供了几乎完美的对应关系 ( r 2 = 0.9997)。最佳拟合线在 (0,0) 处与原点相交,这表明测量中不存在系统误差,并且没有注意到不寻常的物种特异性偏差。

详细信息在图片后面的标题中
图 2
在图形查看器中打开微软幻灯片软件
核DNA含量数据的线性回归分析。(A) 从单独的运行中组合的所有值。(B) 仅拟南芥。(C) 两次运行的番茄和玉米数据。(D) 单独的胡椒

可以得出许多结论:(1)在这些物种中,核内复制导致核基因组的精确和完整复制,通过至少三个(拟南芥)或四个(胡椒)内环。应该认识到,在极少数植物群中,例如兰花,可以看到部分核内复制;这种有趣现象的更多细节可以在 Trávníček 等人中找到。[ 8]。虽然很少遇到,但应注意适应其可能发生的情况。(2) 由于内环化而占据不同大小等级的细胞核由细胞仪定量测量,这意味着细胞核大小和照明点高度之间的几何相互作用不是影响测量效率的因素。我们可以根据报告的拟南芥细胞核体积(假定为 2C)为 32 立方厘米,假设呈球形(在分离细胞核后很可能是这种情况)来预测细胞核大小。微米 [ 9]; 由此,r = 2.88 μm。假设核大小与 DNA 含量呈线性关系,最大的核内复制核将占据 32 x 101.12/0.32 = 10,112 cu 的体积。μm,预测半径为 13.4 μm 的球体。因此,可以认为 Cytoflex 照明光束(光斑尺寸:5 μm × 80 μm)对这些较大的核以狭缝扫描模式起作用,强调了从对应于单个核的脉冲波形获取面积而不是高度测量的重要性. 使用荧光显微镜或更好的图像细胞术来独立确认核内复制核的预测大小会很有趣。(4) 由于具有内环状态的 DNA 含量的比例非常引人注目,这进一步意味着染色质包装的差异,10 ] 在内环和这里测量的物种中是微不足道的。顺便说一句,这证实了早期关于伴随内环或自多倍体化的植物基因组复制线性的报道 [ 7, 11, 12]。(5) 值得注意的是,使用 Cytoflex 仪器和相同的实验条件,在单次运行中测量的 DNA 含量与在 Kew 数据库(表 1)中标识为“素数”的那些值之间的精确缩放程度意味着许多实验室的众包植物基因组大小测量结果已经融合在一个有意义的关系上。这提高了 Kew Plant C 值数据库内容的价值。(6) 数据库中的主要值未分配给品种、品系或栽培品种,这表明这里选择用于分析的物种的 2C 核 DNA 含量必须与这些主要值相同或非常接近。

总之,这些观察结果为数据存储库的概念提供了信心,但仅限于使用流式细胞术进行植物核 DNA 含量测量的背景下。需要注意的是,由于在所有情况下和所有数据库中,必须设计用于评估存储数据“质量”的类似策略以满足“最佳实践”,因此这可能并非在所有情况下都是可能的。一个明显的问题是事后发现尚未记录在数据库中且现在不再可用的关键变量。有关植物细胞术最佳实践的进一步讨论,请访问 Galbraith 等人。[ 13 ],以及其中引用的参考文献。

更新日期:2021-08-07
down
wechat
bug