当前位置: X-MOL 学术Environ. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Avoiding the Misuse of Pathway Analysis Tools in Environmental Metabolomics
Environmental Science & Technology ( IF 10.8 ) Pub Date : 2022-09-26 , DOI: 10.1021/acs.est.2c05588
Cecilia Wieder 1 , Jacob G Bundy 1 , Clément Frainay 2 , Nathalie Poupin 2 , Pablo Rodríguez-Mier 2 , Florence Vinson 2 , Juliette Cooke 2 , Rachel P J Lai 3 , Fabien Jourdan 2, 4 , Timothy M D Ebbels 1
Affiliation  

Within the past 20 years, metabolomics has moved from an exciting innovation within the environmental sciences to something that is almost routine. It can be considered as a means to generate metabolite biomarkers, although it is also important to note the cogent criticisms of the environmental biomarker approach that have been made within ecotoxicology: briefly, that biomarkers are surrogates for macro phenotypes (e.g., survival, reproduction, and behavior) that have population-level effects and that it is generally more straightforward and meaningful to measure these end points directly. (1) Some studies have emphasized instead the ability to gain potentially relevant mechanistic information, even for nonmodel organisms, especially when used as part of a multiomic approach. (2) An improved biological understanding is often implicitly or explicitly part of the justification of including metabolomics in a study. So far, so good, but there is a problem: there is no simple, universally accepted way of reverse engineering mechanistic understanding from metabolomic data, even for model organisms, and the problem is even more complicated for nonmodel species. The closest thing to a standard approach is pathway analysis (PA), i.e., making use of existing biochemical knowledge. There are multiple approaches to PA, but we will focus on just one, over-representation analysis (ORA). (NB that the term ORA is often not used, and many authors refer generically to “pathway enrichment” methods.) It should clearly be understood, though, that ORA is certainly not the only approach to analyzing metabolomics data. It is beyond the scope of this work to review the options available, but we direct the interested reader to recent reviews. (3,4) ORA uses the intuitive approach of identifying metabolite biomarker “hits” and comparing them to the numbers of metabolites in specific pathways, to determine if there are either more or fewer hits than one would expect by chance. It therefore has the twin advantages of being simple to calculate and simple to understand. It does, though, have disadvantages. One potential limitation is shared with all methods that rely on predetermined pathway definitions: traditional pathways are, generally, subjective and heuristic approaches to imposing order on a biochemical network. (5) While this is an important point, we will simply note it here and pass on, and bear in mind that “pathways” are, at least to some extent, arbitrary definitions. The problem is exacerbated for nonmodel organisms, in that accurate metabolic pathway definitions may not be available. It should also be noted that metabolites may contribute to many different pathways: for example, glucose is present in 23 of 263 pathways (KEGG, human), and ATP is present in 880 of 1669 pathways (Reactome, human). Just because a metabolite may be part of a particular pathway, then, does not mean that changes in that metabolite necessarily mean changes in that pathway. Particular care must be taken with environmental organisms not to misinterpret changes with respect to examples from human medicine. A second obvious limitation of ORA is that the criteria for defining significant metabolites are also arbitrary, usually, but not necessarily, based on selecting a threshold for P values from null hypothesis significance testing. It is also possible to draw incorrect conclusions from ORA. For instance, the online Metaboanalyst web server provides a suite of tools for metabolomic analysis, including, but not limited to, ORA. (6) These have become justly popular, as they are free to use, available online, integrated with data processing and biostatistical modules, and updated to ensure they remain current. They also provide some opportunities to set parameters that affect the results, opening up the possibility of inadvertently misusing the tools. (NB that this is not an implicit criticism of the team behind Metaboanalyst: individual researchers should take responsibility for their own results, including interpretation.) We recently published a study of the sensitivity of ORA of metabolomics data to some of the different parameters that can be chosen. (7) A wide range of different factors affect the results (Figure 1). First, and obviously, the choice of database and pathway definitions has a major impact. Second, the precise P value cutoff used for selecting metabolite hits had, unsurprisingly, major effects on the number of significantly enriched pathways. Third, using a background or reference metabolome (i.e., the total list of annotated metabolites detected in a particular experiment) is critically important: if the background is not taken into consideration, the results tend to be very overoptimistic─the P values obtained are much more significant than they should be. Figure 1. Schematic illustration of the factors that can affect metabolic pathway analysis of metabolomics data, at different stages of the study. Inputs: affected by the organism and pathway database chosen; affected by the significance threshold used to choose the number of metabolite hits. Pathway analysis: the use of a background set (reference metabolome) is particularly important. Outputs: have the P values for the pathways been corrected for multiple testing (based on the total number of pathways in the database)? We decided to survey the literature to get an idea of the current practice in environmental metabolomics. We searched for environmental metabolomics papers (Clarivate Web of Science core database, July 7, 2022; searched all fields for “metabolom* or metabonom*”, and constrained by Topic = Environmental Sciences, by Document Type = Article, and by Publication Year = 2020–2022) and identified 988 recent papers. We randomly selected 30 papers from this list (after manually excluding three more reviews that had been incorrectly labeled; the list of papers is given in Table S1) and checked to see what form of PA, if any, was used. Two-thirds of the studies (20 of 30) employed PA (two additional studies mapped metabolites to pathways, but without an associated statistical test); all of these (20 of 20) used ORA, although this generally was not specified by name. Two of them used simultaneous enrichment of transcriptomic and metabolomic data, although full details were not given. Fourteen of the studies used Metaboanalyst; one used the R package Mummichog, and seven failed to specify which software was used. With the exception of the study that had used Mummichog, they generally failed in reporting key parameters that affect the outcome. No studies specified exactly which pathway database was used for the analysis, including for which organism; eight mentioned KEGG pathways but with no more detail given. No studies reported if they corrected for multiple testing in the software output (i.e., based on the number of different metabolic pathways tested); several used plots including an uncorrected P value scale with no additional information given. No studies made any mention of a reference or background metabolome set. Some studies set ad hoc thresholds based on the “pathway impact” statistic provided by Metaboanalyst. It is clear that ORA is being unintentionally misused in environmental metabolomics research, in a fashion that is likely to lead to misleading results. We conclude by making some brief recommendations for using ORA with environmental metabolomics data (see ref (7) for more detail and fuller discussion). (1) Accurately report the analyses carried out. The specific software package/online tool used should be reported, along with all of the parameters, even if they were left as defaults, including the database version and organism used for pathways. Specify what P value cutoff/other parameter was used for selection of metabolite hits for PA, including any correction for multiple testing, and also whether correction for multiple testing was carried out on the output (i.e., based on the number of pathways). (2) Always upload a reference metabolome, or “background set”. In other words, the list of all metabolites that have been identified in that specific study. If this is not done, the results should be treated with extreme caution, as they may inaccurately identify pathways as significantly enriched. (3) Avoid definitive statements about which pathways have been impacted in a particular study. This type of pathway-based approach is, ideally, used to help generate hypotheses that can then be validated by independent experiments; even if further experiments are not feasible, the limitations should be appreciated. We hope these simple recommendations should help researchers avoid some of the common errors that currently plague environmental metabolomics research. The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.est.2c05588. A list of environmental metabolomics papers used to investigate real-world pathway analysis approaches in environmental metabolomics (Table S1) (XLSX) Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html. Jacob G. Bundy is a Senior Lecturer in Biological Chemistry at Imperial College London. He studied chemistry and environmental sciences, before obtaining a Ph.D. in environmental microbiology from the University of Aberdeen in 2000. He then carried out post docs at Imperial, the University of California, Davis, and Cambridge, before returning to Imperial in 2005. His research is in metabolomics, focusing on microbial and environmental applications, with particular reference to terrestrial ecotoxicology and earthworms. He also works on method development in metabolomics, with relevant publications on both mass spectrometry and nuclear magnetic resonance spectroscopy. He is grateful for the opportunity to collaborate here with experts in computational bioinformatics. C.W. is supported by a Wellcome Trust PhD Studentship (222837/Z/21/Z). J.G.B. was supported by the UK Natural Environment Research Council (NERC) for this work (NE/S000240/1). R.P.J.L. receives support from the UK Medical Research Council (MR/R008922/1). J.C. is supported by a state-funded Ph.D. contract [MESRI (Minister of Higher Education, Research and Innovation)]. F.J. is supported by the French Ministry of Research and National Research Agency as part of the French MetaboHUB, the national metabolomics and fluxomics infrastructure (Grant ANR-INBS-0010), and the MetClassNet project (ANR-19-CE45-0021 and DFG 431572533). T.M.D.E. gratefully acknowledges partial support from BBSRC Grant BB/T007974/1, National Institutes of Health Grant R01 HL133932-01 and the NIHR Imperial Biomedical Research Centre (BRC). This article references 7 other publications. This article has not yet been cited by other publications. Figure 1. Schematic illustration of the factors that can affect metabolic pathway analysis of metabolomics data, at different stages of the study. Inputs: affected by the organism and pathway database chosen; affected by the significance threshold used to choose the number of metabolite hits. Pathway analysis: the use of a background set (reference metabolome) is particularly important. Outputs: have the P values for the pathways been corrected for multiple testing (based on the total number of pathways in the database)? Jacob G. Bundy is a Senior Lecturer in Biological Chemistry at Imperial College London. He studied chemistry and environmental sciences, before obtaining a Ph.D. in environmental microbiology from the University of Aberdeen in 2000. He then carried out post docs at Imperial, the University of California, Davis, and Cambridge, before returning to Imperial in 2005. His research is in metabolomics, focusing on microbial and environmental applications, with particular reference to terrestrial ecotoxicology and earthworms. He also works on method development in metabolomics, with relevant publications on both mass spectrometry and nuclear magnetic resonance spectroscopy. He is grateful for the opportunity to collaborate here with experts in computational bioinformatics. This article references 7 other publications. The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.est.2c05588. A list of environmental metabolomics papers used to investigate real-world pathway analysis approaches in environmental metabolomics (Table S1) (XLSX) Most electronic Supporting Information files are available without a subscription to ACS Web Editions. Such files may be downloaded by article for research use (if there is a public use license linked to the relevant article, that license may permit other uses). Permission may be obtained from ACS for other uses through requests via the RightsLink permission system: http://pubs.acs.org/page/copyright/permissions.html.

中文翻译:

避免在环境代谢组学中滥用通路分析工具

在过去的 20 年里,代谢组学已经从环境科学领域的一项激动人心的创新转变为几乎司空见惯的事情。它可以被认为是产生代谢物生物标志物的一种手段,尽管同样重要的是要注意生态毒理学中对环境生物标志物方法的有力批评:简而言之,生物标志物是宏观表型(例如,生存、繁殖、和行为)具有人口水平的影响,并且直接测量这些终点通常更直接和更有意义。(1) 一些研究强调了获得潜在相关机械信息的能力,即使对于非模式生物,尤其是在用作多组学方法的一部分时。(2) 改进的生物学理解通常是隐含或明确地将代谢组学纳入研究的理由的一部分。到目前为止,一切都很好,但有一个问题:没有简单的、普遍接受的从代谢组学数据逆向工程机械理解的方法,即使对于模式生物也是如此,对于非模式物种,问题更加复杂。最接近标准方法的是通路分析 (PA),即利用现有的生化知识。PA 有多种方法,但我们将只关注一种,过度表征分析 (ORA)。(注意,ORA 一词通常不被使用,许多作者通常指的是“通路富集”方法。)但应该清楚地理解,ORA 肯定不是分析代谢组学数据的唯一方法。审查可用选项超出了这项工作的范围,但我们将感兴趣的读者引导至最近的评论。(3,4) ORA 使用直观的方法来识别代谢物生物标志物“命中”并将它们与特定途径中的代谢物数量进行比较,以确定是否存在比人们偶然预期的更多或更少的命中。因此,它具有计算简单和易于理解的双重优点。不过,它确实有缺点。所有依赖于预先确定的通路定义的方法都有一个潜在的限制:传统的通路通常是主观的和启发式的方法,用于在生化网络上施加顺序。(5) 虽然这是重要的一点,但我们将在这里简单地指出并继续,并记住“途径”至少在某种程度上是,任意定义。非模式生物的问题更加严重,因为可能无法获得准确的代谢途径定义。还应注意,代谢物可能有助于许多不同的途径:例如,葡萄糖存在于 263 条途径中的 23 条(KEGG,人类)中,而 ATP 存在于 1669 条途径中的 880 条(Reactome,人类)中。仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值 非模式生物的问题更加严重,因为可能无法获得准确的代谢途径定义。还应注意,代谢物可能有助于许多不同的途径:例如,葡萄糖存在于 263 条途径中的 23 条(KEGG,人类)中,而 ATP 存在于 1669 条途径中的 880 条(Reactome,人类)中。仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值 非模式生物的问题更加严重,因为可能无法获得准确的代谢途径定义。还应注意,代谢物可能有助于许多不同的途径:例如,葡萄糖存在于 263 条途径中的 23 条(KEGG,人类)中,而 ATP 存在于 1669 条途径中的 880 条(Reactome,人类)中。仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值 因为可能无法获得准确的代谢途径定义。还应注意,代谢物可能有助于许多不同的途径:例如,葡萄糖存在于 263 条途径中的 23 条(KEGG,人类)中,而 ATP 存在于 1669 条途径中的 880 条(Reactome,人类)中。仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值 因为可能无法获得准确的代谢途径定义。还应注意,代谢物可能有助于许多不同的途径:例如,葡萄糖存在于 263 条途径中的 23 条(KEGG,人类)中,而 ATP 存在于 1669 条途径中的 880 条(Reactome,人类)中。仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值 还应注意,代谢物可能有助于许多不同的途径:例如,葡萄糖存在于 263 条途径中的 23 条(KEGG,人类)中,而 ATP 存在于 1669 条途径中的 880 条(Reactome,人类)中。仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值 还应注意,代谢物可能有助于许多不同的途径:例如,葡萄糖存在于 263 条途径中的 23 条(KEGG,人类)中,而 ATP 存在于 1669 条途径中的 880 条(Reactome,人类)中。仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值 仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值 仅仅因为代谢物可能是特定途径的一部分,并不意味着该代谢物的变化必然意味着该途径的变化。必须特别注意环境生物,不要误解人类医学例子的变化。ORA 的第二个明显限制是定义重要代谢物的标准也是任意的,通常但不一定基于选择的阈值来自零假设显着性检验的值。也有可能从 ORA 中得出错误的结论。例如,在线 Metaboanalyst 网络服务器提供了一套用于代谢组学分析的工具,包括但不限于 ORA。(6) 这些已经非常受欢迎,因为它们可以免费使用、在线提供、与数据处理和生物统计模块集成,并进行更新以确保它们保持最新状态。它们还提供了一些设置影响结果的参数的机会,从而增加了无意误用工具的可能性。(请注意,这并不是对 Metaboanalyst 背后团队的隐含批评:个别研究人员应对自己的结果负责,包括解释。) 我们最近发表了一项关于代谢组学数据的 ORA 对一些可以选择的不同参数的敏感性的研究。(7) 影响结果的因素多种多样(图 1)。首先,很明显,数据库和通路定义的选择具有重大影响。二、精准毫不奇怪,用于选择代谢物命中的P值截止值对显着富集途径的数量具有重大影响。第三,使用背景或参考代谢组(即在特定实验中检测到的注释代谢物的总列表)至关重要:如果不考虑背景,结果往往过于乐观——P获得的值比应有的重要得多。图 1. 在研究的不同阶段可能影响代谢组学数据的代谢途径分析的因素示意图。输入:受所选生物体和途径数据库的影响;受用于选择代谢物命中数的显着性阈值的影响。通路分析:使用背景集(参考代谢组)尤为重要。输出:有P路径的值是否针对多次测试进行了校正(基于数据库中路径的总数)?我们决定查阅文献以了解当前环境代谢组学的实践。我们搜索了环境代谢组学论文(Clarivate Web of Science 核心数据库,2022 年 7 月 7 日;搜索了“metabolom* 或 metabonom*”的所有字段,并受主题 = 环境科学、文档类型 = 文章和出版年份 = 2020-2022)并确定了 988 篇近期论文。我们从这个列表中随机选择了 30 篇论文(在手动排除了另外三篇被错误标记的评论之后;论文列表在表 S1 中给出)并检查使用了哪种形式的 PA(如果有)。三分之二的研究(30 项中的 20 项)使用了 PA(另外两项研究将代谢物映射到通路,但没有相关的统计测试);所有这些(20 个中的 20 个)都使用了 ORA,尽管通常没有按名称指定。他们中的两个人同时使用了转录组和代谢组数据的富集,尽管没有给出完整的细节。其中 14 项研究使用了 Metaboanalyst;一个使用了 R 包 Mummichog,七个没有说明使用的是哪个软件。除了使用 Mummichog 的研究外,他们通常未能报告影响结果的关键参数。没有研究准确说明分析使用了哪个途径数据库,包括针对哪个生物体;八个提到的 KEGG 途径,但没有给出更多细节。没有研究报告是否修正了软件输出中的多个测试(即,基于测试的不同代谢途径的数量);没有给出额外信息的P值量表。没有研究提到参考或背景代谢组。一些研究根据 Metaboanalyst 提供的“通路影响”统计数据设定了临时阈值。很明显,ORA 在环境代谢组学研究中被无意误用,可能导致误导性结果。最后,我们对 ORA 与环境代谢组学数据一起使用提出了一些简短的建议(有关更详细和更全面的讨论,请参见参考文献(7))。(1) 准确报告所进行的分析。应报告使用的特定软件包/在线工具以及所有参数,即使它们保留为默认值,包括用于途径的数据库版本和生物体。指定什么P值截止值/其他参数用于选择 PA 的代谢物命中,包括对多重测试的任何校正,以及是否对输出进行多重测试的校正(即,基于途径的数量)。(2) 始终上传参考代谢组,或“背景集”。换句话说,在该特定研究中已确定的所有代谢物的列表。如果不这样做,则应极其谨慎地处理结果,因为它们可能会错误地将途径识别为显着富集。(3) 避免明确说明在特定研究中哪些途径受到影响。理想情况下,这种基于途径的方法用于帮助产生可以通过独立实验验证的假设;即使进一步的实验不可行,应该理解这些限制。我们希望这些简单的建议可以帮助研究人员避免一些目前困扰环境代谢组学研究的常见错误。支持信息可在 https://pubs.acs.org/doi/10.1021/acs.est.2c05588 免费获得。用于研究环境代谢组学中真实通路分析方法的环境代谢组学论文列表(表 S1) (XLSX) 大多数电子支持信息文件无需订阅 ACS 网络版即可获得。此类文件可以按文章下载以供研究使用(如果有与相关文章链接的公共使用许可,则该许可可能允许其他用途)。可通过 RightsLink 权限系统请求从 ACS 获得其他用途的许可:http://pubs.acs。org/page/copyright/permissions.html。Jacob G. Bundy 是伦敦帝国理工学院生物化学高级讲师。在获得博士学位之前,他学习了化学和环境科学。2000 年获得阿伯丁大学环境微生物学博士学位。随后,他在帝国理工学院、加州大学戴维斯分校和剑桥大学进行博士后研究,然后于 2005 年回到帝国理工学院。他的研究领域是代谢组学,专注于微生物和环境应用,特别是关于陆地生态毒理学和蚯蚓。他还致力于代谢组学的方法开发,并在质谱和核磁共振光谱方面发表过相关文章。他很高兴有机会在这里与计算生物信息学专家合作。连续波 由 Wellcome Trust 博士生奖学金 (222837/Z/21/Z) 提供支持。JGB 得到了英国自然环境研究委员会 (NERC) 的这项工作 (NE/S000240/1) 的支持。RPJL 得到英国医学研究委员会 (MR/R008922/1) 的支持。JC 由国家资助的博士支持。合同 [MESRI(高等教育、研究和创新部长)]。FJ 得到法国研究部和国家研究机构的支持,作为法国 MetaboHUB、国家代谢组学和通量组学基础设施 (Grant ANR-INBS-0010) 和 MetClassNet 项目 (ANR-19-CE45-0021 和 DFG 431572533) 的一部分)。TMDE 衷心感谢 BBSRC 赠款 BB/T007974/1、美国国立卫生研究院赠款 R01 HL133932-01 和 NIHR 帝国生物医学研究中心 (BRC) 的部分支持。本文引用了其他 7 个出版物。这篇文章尚未被其他出版物引用。图 1. 在研究的不同阶段可能影响代谢组学数据的代谢途径分析的因素示意图。输入:受所选生物体和途径数据库的影响;受用于选择代谢物命中数的显着性阈值的影响。通路分析:使用背景集(参考代谢组)尤为重要。输出:有 背景集(参考代谢组)的使用尤为重要。输出:有 背景集(参考代谢组)的使用尤为重要。输出:有路径的值是否针对多次测试进行了校正(基于数据库中路径的总数)?Jacob G. Bundy 是伦敦帝国理工学院生物化学高级讲师。在获得博士学位之前,他学习了化学和环境科学。2000 年获得阿伯丁大学环境微生物学博士学位。随后,他在帝国理工学院、加州大学戴维斯分校和剑桥大学进行博士后研究,然后于 2005 年回到帝国理工学院。他的研究领域是代谢组学,专注于微生物和环境应用,特别是关于陆地生态毒理学和蚯蚓。他还致力于代谢组学的方法开发,并在质谱和核磁共振光谱方面发表过相关文章。他很高兴有机会在这里与计算生物信息学专家合作。本文引用了其他 7 个出版物。支持信息可在 https://pubs.acs.org/doi/10.1021/acs.est.2c05588 免费获得。用于研究环境代谢组学中真实通路分析方法的环境代谢组学论文列表(表 S1) (XLSX) 大多数电子支持信息文件无需订阅 ACS 网络版即可获得。此类文件可以按文章下载以供研究使用(如果有与相关文章链接的公共使用许可,则该许可可能允许其他用途)。可通过 RightsLink 许可系统请求从 ACS 获得其他用途的许可:http://pubs.acs.org/page/copyright/permissions.html。
更新日期:2022-09-26
down
wechat
bug