当前位置: X-MOL 学术IEEE Trans. Knowl. Data. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Incremental Factorization of Big Time Series Data with Blind Factor Approximation
IEEE Transactions on Knowledge and Data Engineering ( IF 8.9 ) Pub Date : 2021-02-01 , DOI: 10.1109/tkde.2019.2931687
Dan Chen , Yunbo Tang , Hao Zhang , Lizhe Wang , Xiaoli Li

Extracting the latent factors of big time series data is an important means to examine the dynamic complex systems under observation. These low-dimensional and “small” representations reveal the key insights to the overall mechanisms, which can otherwise be obscured by the notoriously high dimensionality and scale of big data as well as the enormously complicated interdependencies amongst data elements. However, grand challenges still remain: (1) to incrementally derive the multi-mode factors of the augmenting big data and (2) to achieve this goal under the circumstance of insufficient a priori knowledge. This study develops an incrementally parallel factorization solution (namely I-PARAFAC) for huge augmenting tensors (multi-way arrays) consisting of three phases over a cutting-edge GPU cluster: in the “giant-step” phase, a variational Bayesian inference (VBI) model estimates the distribution of the close neighborhood of each factor in a high confidence level without the need for a priori knowledge of the tensor or problem domain; in the “baby-step” phase, a massively parallel Fast-HALS algorithm (namely G-HALS) has been developed to derive the accurate subfactors of each subtensor on the basis of the initial factors; in the final fusion phase, I-PARAFAC fuses the known factors of the original tensor and those accurate subfactors of the “increment” to achieve the final full factors. Experimental results indicate that: (1) the VBI model enables a blind factor approximation, where the distribution of the close neighborhood of each final factor can be quickly derived (10 iterations for the test case). As a result, the model of a low time complexity significantly accelerates the derivation of the final accurate factors and lowers the risks of errors; (2) I-PARAFAC significantly outperforms even the latest high performance counterpart when handling augmenting tensors, e.g., the increased overhead is only proportional to the increment while the latter has to repeatedly factorize the whole tensor, and the overhead in fusing subfactors is always minimal; (3) I-PARAFAC can factorize a huge tensor (volume up to 500 TB over 50 nodes) as a whole with the capability several magnitudes higher than conventional methods, and the runtime is in the order of $\frac{1}{n}$1n to the number of compute nodes; (4) I-PARAFAC supports correct factorization-based analysis of a real 4-order EEG dataset captured from a variety of epilepsy patients. Overall, it should also be noted that counterpart methods have to derive the whole tensor from the scratch if the tensor is augmented in any dimension; as a contrast, the I-PARAFAC framework only needs to incrementally compute the full factors of the huge augmented tensor.

中文翻译:

使用盲因子逼近对大时间序列数据进行增量分解

提取大时间序列数据的潜在因素是检验被观察的动态复杂系统的重要手段。这些低维和“小”表示揭示了对整体机制的关键见解,否则这些机制可能会被大数据的高维和规模以及数据元素之间极其复杂的相互依赖性所掩盖。然而,巨大的挑战仍然存在:(1)逐步推导出增强大数据的多模式因素和(2)在不足的情况下实现这一目标先验知识。本研究开发了一个增量并行分解解决方案(即I-PARAFAC) 用于由尖端 GPU 集群上的三个阶段组成的巨大增强张量(多路阵列): “巨无霸”阶段, 变分贝叶斯推理 (VBI) 模型以高置信度估计每个因子的近邻域的分布,而无需 先验张量或问题领域的知识;在里面“婴儿阶段”, 大规模并行 快速HALS 算法(即 G-HALS) 已被开发以在初始因子的基础上推导出每个子张量的准确子因子;在决赛中融合阶段, I-PARAFAC融合原始张量的已知因子和“增量”的那些精确子因子,以获得最终的完整因子。实验结果表明:(1)VBI 模型使一个 盲因子近似,其中可以快速导出每个最终因子的近邻域的分布(测试用例为 10 次迭代)。因此,低时间复杂度的模型显着加快了最终准确因子的推导速度,降低了出错的风险;(2)I-PARAFAC在处理增强张量时,即使是最新的高性能对应物也显着优于,例如,增加的开销仅与增量成正比,而后者必须重复分解整个张量,并且融合子因子的开销总是最小的;(3)I-PARAFAC 可以将一个巨大的张量(超过 50 个节点的体积高达 500 TB)分解为一个整体,其能力比传统方法高几个数量级,运行时间约为 $\frac{1}{n}$1n到计算节点的数量;(4)I-PARAFAC支持对从各种癫痫患者捕获的真实 4 阶 EEG 数据集进行基于正确分解的分析。总的来说,还应该注意的是,如果张量在任何维度上增加,对应的方法必须从头开始推导出整个张量;相比之下,I-PARAFAC 框架只需要增量计算巨大的增广张量的全部因子。
更新日期:2021-02-01
down
wechat
bug