当前位置: X-MOL 学术arXiv.cs.DB › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning Individual Models for Imputation (Technical Report)
arXiv - CS - Databases Pub Date : 2020-04-07 , DOI: arxiv-2004.03436
Aoqian Zhang, Shaoxu Song, Yu Sun, Jianmin Wang

Missing numerical values are prevalent, e.g., owing to unreliable sensor reading, collection and transmission among heterogeneous sources. Unlike categorized data imputation over a limited domain, the numerical values suffer from two issues: (1) sparsity problem, the incomplete tuple may not have sufficient complete neighbors sharing the same/similar values for imputation, owing to the (almost) infinite domain; (2) heterogeneity problem, different tuples may not fit the same (regression) model. In this study, enlightened by the conditional dependencies that hold conditionally over certain tuples rather than the whole relation, we propose to learn a regression model individually for each complete tuple together with its neighbors. Our IIM, Imputation via Individual Models, thus no longer relies on sharing similar values among the k complete neighbors for imputation, but utilizes their regression results by the aforesaid learned individual (not necessary the same) models. Remarkably, we show that some existing methods are indeed special cases of our IIM, under the extreme settings of the number l of learning neighbors considered in individual learning. In this sense, a proper number l of neighbors is essential to learn the individual models (avoid over-fitting or under-fitting). We propose to adaptively learn individual models over various number l of neighbors for different complete tuples. By devising efficient incremental computation, the time complexity of learning a model reduces from linear to constant. Experiments on real data demonstrate that our IIM with adaptive learning achieves higher imputation accuracy than the existing approaches.

中文翻译:

学习个人模型进行插补(技术报告)

缺失数值很普遍,例如,由于不同来源之间不可靠的传感器读取、收集和传输。与有限域上的分类数据插补不同,数值存在两个问题:(1)稀疏问题,由于(几乎)无限域,不完整元组可能没有足够的完整邻居共享相同/相似的插补值;(2) 异质性问题,不同的元组可能无法拟合同一个(回归)模型。在这项研究中,受到在某些元组而不是整个关系上有条件地保持的条件依赖的启发,我们建议为每个完整的元组及其邻居单独学习一个回归模型。我们的 IIM,通过单个模型进行估算,因此不再依赖于在 k 个完整邻居之间共享相似的值进行插补,而是通过上述学习的个体(不一定相同)模型利用它们的回归结果。值得注意的是,我们表明,在个别学习中考虑的学习邻居数量 l 的极端设置下,一些现有方法确实是我们 IIM 的特例。从这个意义上说,适当数量的邻居对于学习单个模型(避免过度拟合或欠拟合)至关重要。我们建议针对不同的完整元组在不同数量的邻居上自适应地学习单个模型。通过设计高效的增量计算,学习模型的时间复杂度从线性降低到常数。
更新日期:2020-04-08
down
wechat
bug