Skip to main content
Log in

Automatic weighted matching rectifying rule discovery for data repairing

Can we discover effective repairing rules automatically from dirty data?

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Data repairing is a key problem in data cleaning which aims to uncover and rectify data errors. Traditional methods depend on data dependencies to check the existence of errors in data, but they fail to rectify the errors. To overcome this limitation, recent methods define repairing rules on which they depend to detect and fix errors. However, all existing data repairing rules are provided by experts which is an expensive task in time and effort. Besides, rule-based data repairing methods need an external verified data source or user verifications; otherwise, they are incomplete where they can repair only a small number of errors. In this paper, we define weighted matching rectifying rules (WMRRs) based on similarity matching to capture more errors. We propose a novel algorithm to discover WMRRs automatically from dirty data in-hand. We also develop an automatic algorithm for rules inconsistency resolution. Additionally, based on WMRRs, we propose an automatic data repairing algorithm (WMRR-DR) which uncovers a large number of errors and rectifies them dependably. We experimentally verify our method on both real-life and synthetic data. The experimental results prove that our method can discover effective WMRRs from dirty data in-hand and perform dependable and full-automatic repairing based on the discovered WMRRs, with higher accuracy than the existing dependable methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://www.hospitalcompare.hhs.gov/

  2. http://www.cs.utexas.edu/users/ml/riddle/data.html

References

  1. Organaizations is full of dirty data. http://www.itpro.co.uk/609057/firms-full-of-dirty-data. Accessed 16 May 2019

  2. Dirty data affects on the US. economy. http://www.ringlead.com/dirty-data-costs-economy-3-trillion/. Accessed 22 Apr 2019

  3. Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM, pp. 143–154 (2005)

  4. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007. IEEE, pp. 746–755 (2007)

  5. Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)

    Article  Google Scholar 

  6. Wang, Y., Song, S., Chen, L., Yu, J.X., Cheng, H.: Discovering conditional matching rules. ACM Trans. Knowl. Discov. Data 11(4), 46 (2017)

    Article  Google Scholar 

  7. Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. Proc. VLDB Endow. 3(1–2), 173–184 (2010)

    Article  Google Scholar 

  8. Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)

    Article  Google Scholar 

  9. Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD Conference, pp. 457–468 (2014)

  10. Interlandi, M., Tang, N.: Proof positive and negative in data cleaning. In: 2015 IEEE 31st International Conference on Data Engineering. IEEE, pp. 18–29 (2015)

  11. Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp. 315–326 (2007)

  12. Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 159–170 (2008)

  13. Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, vol. 99. Citeseer, pp. 68–79 (1999)

  14. Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory. ACM, pp. 53–62 (2009)

  15. Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow. 3(1–2), 197–207 (2010)

    Article  Google Scholar 

  16. Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)

    Article  Google Scholar 

  17. Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)

    Article  Google Scholar 

  18. Fan, W., Ma, S., Tang, N., Wenyuan, Y.: Interaction between record matching and data repairing. J. Data Inf. Qual. 4(4), 16 (2014)

    Google Scholar 

  19. Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, pp. 458–469 (2013)

  20. Ahmad, H.A., Wang, H.: An effective weighted rule-based method for entity resolution. Distrib. Parallel Databases 36(3), 593–612 (2018)

    Article  Google Scholar 

  21. Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2014)

    Article  Google Scholar 

  22. He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp. 893–907 (2016)

  23. Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: a data cleaning system powered by knowledge bases and crowd sourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1247–1261 (2015)

  24. Hao, S., Tang, N., Li, G., Li, J., Feng, J.: Distilling relations using knowledge bases. VLDB J. 27(4), 497–519 (2018)

    Article  Google Scholar 

  25. Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In VLDB, vol. 1, pp. 381–390 (2001)

  26. Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015)

  27. Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. Proc. VLDB Endow. 4(5), 279–289 (2011)

    Article  Google Scholar 

  28. Rekatsinas, T., Chu, X., Ilyas, I.F., Holoclean, C.R.: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11), 1190–1201 (2017)

    Article  Google Scholar 

  29. Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp. 553–564 (2013)

  30. Shin, J., Sen, W., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow. 8(11), 1310–1321 (2015)

    Article  Google Scholar 

  31. Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. arXiv:1505.04406 (2015)

  32. Niu, F., Ré, C., Doan, A.H., Shavlik, J.: Tuffy: scaling up statistical inference in Markov logic networks using an rdbms. Proc. VLDB Endow. 4(6), 373–384 (2011)

    Article  Google Scholar 

  33. Singh, R., Meduri, V., Elmagarmid, A., Madden, S., Papotti, P., Quiané-Ruiz, J.-A., Solar-Lezama, A., Tang, N.: Generating concise entity matching rules. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, pp. 1635–1638 (2017)

  34. Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data quality and record linkage techniques. Springer, Berlin (2007)

    MATH  Google Scholar 

  35. Hao, S., Tang, N., Li, G., He, J., Ta, N., Feng, J.: A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29(4), 727–742 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This paper was partially supported by NSFC Grant U1509216, U1866602, 61772157, The National Key Research and Development Program of China 2016YFB1000703, NSFC Grants 61472099, 61602129, National Sci-Tech Support Plan 2015BAH10F01, the Scientific Research Foundation for the Returned Overseas Chinese Scholars of Heilongjiang Province LC2016026. The authors would like to thank Prof. Jiannan Wang for his support in this work and also the anonymous reviewers for their valuable comments that greatly improved this paper.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hiba Abu Ahmad or Hongzhi Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abu Ahmad, H., Wang, H. Automatic weighted matching rectifying rule discovery for data repairing. The VLDB Journal 29, 1433–1447 (2020). https://doi.org/10.1007/s00778-020-00617-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00617-6

Keywords

Navigation