A data value metric for quantifying information content and utility,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A data value metric for quantifying information content and utility
Journal of Big Data ( IF 8.1 ) Pub Date : 2021-06-05 , DOI: 10.1186/s40537-021-00446-6
Morteza Noshad _{1,

2} , Jerome Choi _{3,

4} , Yuming Sun _{3,

5} , Alfred Hero _{1,

4,

6} , Ivo D Dinov _{3,

7}

Affiliation

Data-driven innovation is propelled by recent scientific advances, rapid technological progress, substantial reductions of manufacturing costs, and significant demands for effective decision support systems. This has led to efforts to collect massive amounts of heterogeneous and multisource data, however, not all data is of equal quality or equally informative. Previous methods to capture and quantify the utility of data include value of information (VoI), quality of information (QoI), and mutual information (MI). This manuscript introduces a new measure to quantify whether larger volumes of increasingly more complex data enhance, degrade, or alter their information content and utility with respect to specific tasks. We present a new information-theoretic measure, called Data Value Metric (DVM), that quantifies the useful information content (energy) of large and heterogeneous datasets. The DVM formulation is based on a regularized model balancing data analytical value (utility) and model complexity. DVM can be used to determine if appending, expanding, or augmenting a dataset may be beneficial in specific application domains. Subject to the choices of data analytic, inferential, or forecasting techniques employed to interrogate the data, DVM quantifies the information boost, or degradation, associated with increasing the data size or expanding the richness of its features. DVM is defined as a mixture of a fidelity and a regularization terms. The fidelity captures the usefulness of the sample data specifically in the context of the inferential task. The regularization term represents the computational complexity of the corresponding inferential method. Inspired by the concept of information bottleneck in deep learning, the fidelity term depends on the performance of the corresponding supervised or unsupervised model. We tested the DVM method for several alternative supervised and unsupervised regression, classification, clustering, and dimensionality reduction tasks. Both real and simulated datasets with weak and strong signal information are used in the experimental validation. Our findings suggest that DVM captures effectively the balance between analytical-value and algorithmic-complexity. Changes in the DVM expose the tradeoffs between algorithmic complexity and data analytical value in terms of the sample-size and the feature-richness of a dataset. DVM values may be used to determine the size and characteristics of the data to optimize the relative utility of various supervised or unsupervised algorithms.

中文翻译：

用于量化信息内容和效用的数据价值度量

最近的科学进步、快速的技术进步、制造成本的大幅降低以及对有效决策支持系统的巨大需求推动了数据驱动的创新。这导致努力收集大量异构和多源数据，但是，并非所有数据都具有同等质量或同等信息量。以前捕获和量化数据效用的方法包括信息价值 (VoI)、信息质量 (QoI) 和互信息 (MI)。这份手稿引入了一种新措施来量化大量越来越复杂的数据是否会增强、降低或改变它们在特定任务方面的信息内容和效用。我们提出了一种新的信息论度量，称为数据价值度量 (DVM)，量化大型异构数据集的有用信息内容（能量）。DVM 公式基于平衡数据分析值（效用）和模型复杂性的正则化模型。DVM 可用于确定附加、扩展或扩充数据集在特定应用领域是否有益。根据用于查询数据的数据分析、推理或预测技术的选择，DVM 量化与增加数据大小或扩展其特征的丰富性相关的信息增强或退化。DVM 被定义为保真度和正则化项的混合。保真度特别是在推理任务的上下文中捕获样本数据的有用性。正则化项表示相应推理方法的计算复杂度。受深度学习中信息瓶颈概念的启发，保真度项取决于相应的有监督或无监督模型的性能。我们针对几种替代的监督和无监督回归、分类、聚类和降维任务测试了 DVM 方法。具有弱和强信号信息的真实和模拟数据集都用于实验验证。我们的研究结果表明，DVM 有效地捕捉了分析价值和算法复杂性之间的平衡。DVM 中的变化暴露了算法复杂性和数据分析价值之间在样本大小和数据集的特征丰富性方面的权衡。

更新日期：2021-06-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>