Abstract
Data repairing is a key problem in data cleaning which aims to uncover and rectify data errors. Traditional methods depend on data dependencies to check the existence of errors in data, but they fail to rectify the errors. To overcome this limitation, recent methods define repairing rules on which they depend to detect and fix errors. However, all existing data repairing rules are provided by experts which is an expensive task in time and effort. Besides, rule-based data repairing methods need an external verified data source or user verifications; otherwise, they are incomplete where they can repair only a small number of errors. In this paper, we define weighted matching rectifying rules (WMRRs) based on similarity matching to capture more errors. We propose a novel algorithm to discover WMRRs automatically from dirty data in-hand. We also develop an automatic algorithm for rules inconsistency resolution. Additionally, based on WMRRs, we propose an automatic data repairing algorithm (WMRR-DR) which uncovers a large number of errors and rectifies them dependably. We experimentally verify our method on both real-life and synthetic data. The experimental results prove that our method can discover effective WMRRs from dirty data in-hand and perform dependable and full-automatic repairing based on the discovered WMRRs, with higher accuracy than the existing dependable methods.
Similar content being viewed by others
Notes
http://www.cs.utexas.edu/users/ml/riddle/data.html
References
Organaizations is full of dirty data. http://www.itpro.co.uk/609057/firms-full-of-dirty-data. Accessed 16 May 2019
Dirty data affects on the US. economy. http://www.ringlead.com/dirty-data-costs-economy-3-trillion/. Accessed 22 Apr 2019
Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM, pp. 143–154 (2005)
Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007. IEEE, pp. 746–755 (2007)
Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)
Wang, Y., Song, S., Chen, L., Yu, J.X., Cheng, H.: Discovering conditional matching rules. ACM Trans. Knowl. Discov. Data 11(4), 46 (2017)
Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. Proc. VLDB Endow. 3(1–2), 173–184 (2010)
Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)
Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD Conference, pp. 457–468 (2014)
Interlandi, M., Tang, N.: Proof positive and negative in data cleaning. In: 2015 IEEE 31st International Conference on Data Engineering. IEEE, pp. 18–29 (2015)
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp. 315–326 (2007)
Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 159–170 (2008)
Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, vol. 99. Citeseer, pp. 68–79 (1999)
Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory. ACM, pp. 53–62 (2009)
Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow. 3(1–2), 197–207 (2010)
Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)
Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)
Fan, W., Ma, S., Tang, N., Wenyuan, Y.: Interaction between record matching and data repairing. J. Data Inf. Qual. 4(4), 16 (2014)
Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, pp. 458–469 (2013)
Ahmad, H.A., Wang, H.: An effective weighted rule-based method for entity resolution. Distrib. Parallel Databases 36(3), 593–612 (2018)
Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2014)
He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp. 893–907 (2016)
Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: a data cleaning system powered by knowledge bases and crowd sourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1247–1261 (2015)
Hao, S., Tang, N., Li, G., Li, J., Feng, J.: Distilling relations using knowledge bases. VLDB J. 27(4), 497–519 (2018)
Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In VLDB, vol. 1, pp. 381–390 (2001)
Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015)
Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. Proc. VLDB Endow. 4(5), 279–289 (2011)
Rekatsinas, T., Chu, X., Ilyas, I.F., Holoclean, C.R.: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11), 1190–1201 (2017)
Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp. 553–564 (2013)
Shin, J., Sen, W., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow. 8(11), 1310–1321 (2015)
Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. arXiv:1505.04406 (2015)
Niu, F., Ré, C., Doan, A.H., Shavlik, J.: Tuffy: scaling up statistical inference in Markov logic networks using an rdbms. Proc. VLDB Endow. 4(6), 373–384 (2011)
Singh, R., Meduri, V., Elmagarmid, A., Madden, S., Papotti, P., Quiané-Ruiz, J.-A., Solar-Lezama, A., Tang, N.: Generating concise entity matching rules. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, pp. 1635–1638 (2017)
Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data quality and record linkage techniques. Springer, Berlin (2007)
Hao, S., Tang, N., Li, G., He, J., Ta, N., Feng, J.: A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29(4), 727–742 (2017)
Acknowledgements
This paper was partially supported by NSFC Grant U1509216, U1866602, 61772157, The National Key Research and Development Program of China 2016YFB1000703, NSFC Grants 61472099, 61602129, National Sci-Tech Support Plan 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province LC2016026. The authors would like to thank Prof. Jiannan Wang for his support in this work and also the anonymous reviewers for their valuable comments that greatly improved this paper.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Abu Ahmad, H., Wang, H. Automatic weighted matching rectifying rule discovery for data repairing. The VLDB Journal 29, 1433–1447 (2020). https://doi.org/10.1007/s00778-020-00617-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-020-00617-6